So how do I process my data?

Analysis workflows

As the modules mentioned, there are multiple workflows that have been developed to help you process your data. The Broad Institute has developed a set of best practices when doing certain analysis as part of it’s Genome Analysis Toolkit. Flowcharts for their pipelines are available at their website, and cover:

In addition to the workflows determined by the Broad institute, a few other workflows can be found here. These are meant :

  • RNA-Sequencing pipeline Differential Expression

  • SNV Detection Workflow SNV Detection

  • CNV Detection Workflow CNV Detection

File formats

  • GenBank Flat File: file format for genbank intended to be human readable
  • Fasta: sequence(s)
  • Fastq: sequence(s) with qualities
  • Sam/Bam: aligned reads either human readable or not
  • VCF: variant file

For definitions of file formats, check here.

Databases

NCBI

  • many, many resources including literature (PubMed and PMC), health (dbGaP and ClinVar), genomes (WG adn RefSeq), proteins (Seq and 3D structures), and chemicals (PubChem and BioSystems)
  • includes Entrez gene, the best way to explore gene information (according to BFFO)

ICGC data portal

  • provides tools for visualizing, querying and downloading the data released quarterly by the consortium’s member projects.
  • protected and non-protected data from 90 global cancer projects
  • includes tumor/normal pairs with genome, transcriptome, methylome, and clinical data
  • includes TCGA data

TCGA: The Cancer Genome Atlas

COSMIC: Catalogue of Somatic Mutations in Cancer

European Genome-Phenome Archive (EGA)

Dockstore

UCSC Human Cancer

GDC: Genomic Data Commons Data Portal

Cancer Genome Collaboratory

Variation databases

dbSNP

dbGaP

Gene fusion databases

ChimerDB 3

TCGA gene fusion portal

Cosmic

ConjoinDB

Genome browsers

Used to explore genomes. Can load “tracks” of extra info. Most common UCSC then EnsEMBL then NCBI Map Viewer. Many others exist.

Tools

Tools for CNVs

CRMAv2: http://www.aroma-project.org/vignettes/CRMAv2
DNACopy: http://www.bioconductor.org/packages/release/bioc/html/DNAcopy.html
HMM-Dosage: http://compbio.bccrc.ca/software/hmm-dosage/
PICNIC: http://www.sanger.ac.uk/genetics/CGP/Software/PICNIC/
OncoSNP: https://sites.google.com/site/oncosnp/
HMMCopy: http://compbio.bccrc.ca/software/hmmcopy/
Apolloh: http://compbio.bccrc.ca/software/apolloh/
Control-FreeC: http://bioinfo-out.curie.fr/projects/freec/
ASCAT: https://www.crick.ac.uk/peter-van-loo/software/ASCAT
ABSOLUTE: http://software.broadinstitute.org/cancer/software/genepattern/modules/docs/ABSOLUTE/1
TITAN: http://compbio.bccrc.ca/software/titan/

Tools for SNVs

GATK: http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit
SamTools: http://samtools.sourceforge.net/
MuTect: http://www.broadinstitute.org/software/cprg/?q=node/34
MuTect2: https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_cancer_m2_MuTect2.php Strelka (from Illumina): https://sites.google.com/site/strelkasomaticvariantcaller/ SomaticSniper: http://gmt.genome.wustl.edu/packages/somatic-sniper/ VarScan: http://varscan.sourceforge.net/

Tools for Annotation

ANNOVAR: http://www.openbioinformatics.org/annovar/
SNPEFF: http://snpeff.sourceforge.net/

Pathway and Network Analysis

Reactome

g:Profiler

Command-line tricks

Pipe output for column view and scrolling

your command that would output to the screen in messy columns | column -t | less -S

column -t makes your tab delimited output into nice columns
less -S makes content in less left/right scrollable instead of wrapping it to the next line