Gene Fusions Tutorial - Prediction
General Setup
Create tutorial directory structure on the instance
Create a directory in the temporary workspace and change to that directory.
mkdir /mnt/workspace/Module7/
cd /mnt/workspace/Module7/
All content for this tutorial is located in a bitbucket repo at
https://dranew@bitbucket.org/dranew/cbw_tutorial.git
. Some of the data, scripts and config files from this repo will be used in the tutorial. Link the repo directory to your workspace.
ln -s /media/cbwdata/CG_data/Module7/cbw_tutorial
The reference genomes have been indexed and fusion caller specific setup scripts have been run for you. Instructions for your reference are available in the install section of the tutorial. Link the ref data to your workspace.
ln -s /media/cbwdata/CG_data/Module7/refdata
Sample data has been prepared for you, link to your workspace.
ln -s /media/cbwdata/CG_data/Module7/sampledata
For visualization purposes, a list of bams is available for reads aligning to fused genes. Link those into your workspace.
ln -s /media/cbwdata/CG_data/Module7/bams
Set the $TUTORIAL_HOME
environment variable so subsequent commands know the
locatioin of installed binaries, libraries, and reference genome data. Also,
subsequent analyses will fall into subdirectories of $TUTORIAL_HOME
.
TUTORIAL_HOME=/mnt/workspace/Module7
Environment setup
For this tutorial, we now assume the environment variable $TUTORIAL_HOME
has been set to an existing directory to which the user has write access.
All binaries used in this tutorial will be installed using conda
. Modify the PATH
environment variable to point to binaries in the anaconda installation.
export PATH="/home/ubuntu/CourseData/CG_data/Module7/anaconda/bin:$PATH"
Reference Genome
We will be using a modified reference genome comprising a single chromosome to allow for quicker running times.
Environment setup
Create variables for the reference genome and gene models.
TUTORIAL_CHROMOSOME=2
ENSEMBL_GTF_FILENAME=$TUTORIAL_HOME/refdata/Homo_sapiens.GRCh37.75.gtf
GENCODE_GTF_FILENAME=$TUTORIAL_HOME/refdata/gencode.v19.annotation.gtf
ENSEMBL_GENOME_FILENAME=$TUTORIAL_HOME/refdata/Homo_sapiens.GRCh37.75.dna.chromosome.$TUTORIAL_CHROMOSOME.fa
UCSC_GENOME_FILENAME=$TUTORIAL_HOME/refdata/chr$TUTORIAL_CHROMOSOME.fa
ENSEMBL_GENOME_PREFIX=$TUTORIAL_HOME/refdata/Homo_sapiens.GRCh37.75.dna.chromosome.$TUTORIAL_CHROMOSOME
UCSC_GENOME_PREFIX=$TUTORIAL_HOME/refdata/chr$TUTORIAL_CHROMOSOME
For gmap, set a directory for the gmap indices, and the id of the reference.
GMAP_REF_ID=chr$TUTORIAL_CHROMOSOME
GMAP_INDEX_DIR=$TUTORIAL_HOME/refdata/gmap/
For STAR, set the directory for the indices.
STAR_GENOME_INDEX=$TUTORIAL_HOME/refdata/star/
HCC1395 sample data
Environment setup
We will create environment variables pointing to the data on the instance, and set a sample id for the data.
We will be working with a publically available HCC1395 breast cancer cell line sequenced for teaching and benchmarking purposes. The dataset is available from washington university servers, and can be accessed via the github page for the Genome Modeling System.
Special thanks to Malachi Griffith for providing access to the data.
SAMPLE_ID=HCC1395
SAMPLE_FASTQ_1=/media/cbwdata/CG_data/Module7/sampledata/HCC1395/rnaseq/SAMPLE_R1.fastq
SAMPLE_FASTQ_2=/media/cbwdata/CG_data/Module7/sampledata/HCC1395/rnaseq/SAMPLE_R2.fastq
Running times for each tool
chimerascan: 12 minutes
defuse: 18 minutes
star: 11 minutes
trinity: 12 minutes
Run the ChimeraScan fusion prediction tool
Environment setup
Set variable for index directory.
CHIMERASCAN_INDEX=$TUTORIAL_HOME/refdata/chimerascan/indices/
Execution
Run chimerascan by providing the index directory, the pair of fastq files, and the sample specific output directory
mkdir -p $TUTORIAL_HOME/analysis/chimerascan/$SAMPLE_ID/
chimerascan_run.py -p 4 \
$CHIMERASCAN_INDEX \
$SAMPLE_FASTQ_1 $SAMPLE_FASTQ_2 \
$TUTORIAL_HOME/analysis/chimerascan/$SAMPLE_ID/
After chimerascan has completed we can run an additional script to create an html output to browse the results.
chimerascan_html_table.py --read-throughs \
-o $TUTORIAL_HOME/analysis/chimerascan/$SAMPLE_ID/chimeras.html \
$TUTORIAL_HOME/analysis/chimerascan/$SAMPLE_ID/chimeras.bedpe
You can view the resulting html report at the following url in your browser, where you need to replace #
with your instance id.
http://cbw#.dyndns.info/Module7/analysis/chimerascan/HCC1395/chimeras.html
Run the deFuse gene fusion prediction tool
Environment setup
Set variable for the config filename and the two scripts.
DEFUSE_CONFIG=$TUTORIAL_HOME/cbw_tutorial/config/defuse_chr1.txt
DEFUSE_REF_DATA=$TUTORIAL_HOME/refdata/defuse/
Execution
Running deFuse involves invoking a single script using perl. The output is to a directory which will contain a number of temporary and results files. Create an output directory and run the defuse script specifying the paired end reads and the configuration filename.
mkdir -p $TUTORIAL_HOME/analysis/defuse/$SAMPLE_ID/
defuse_run.pl \
-c $DEFUSE_CONFIG \
-d $DEFUSE_REF_DATA \
-1 $SAMPLE_FASTQ_1 \
-2 $SAMPLE_FASTQ_2 \
-o $TUTORIAL_HOME/analysis/defuse/$SAMPLE_ID/
The results are output as a table in TSV format at $TUTORIAL_HOME/analysis/defuse/$SAMPLE_ID/results.filtered.tsv
. They can be viewed in R or MS Excel. To view the read evidence for a specific fuion, use the defuse_get_reads.pl
script.
defuse_get_reads.pl \
-c $DEFUSE_CONFIG \
-d $DEFUSE_REF_DATA \
-o $TUTORIAL_HOME/analysis/defuse/$SAMPLE_ID/ \
-i 20
Run the STAR RNA-Seq aligner on a sample
Environment
Specify the directory in which the reference genome data will be stored.
STAR_GENOME_INDEX=$TUTORIAL_HOME/refdata/star/
STAR_FUSION_GENOME_INDEX=$TUTORIAL_HOME/refdata/star/GRCh37_gencode_v19_CTAT_lib/
Execution
Run the star aligner
Running STAR on paired end read fastqs can be performed in a single step. Set the prefix argument --outFileNamePrefix
to specify the location of the output.
Assume we have end 1 and end 2 files as $SAMPLE_FASTQ_1
and $SAMPLE_FASTQ_2
environment variables pointing to paired fastq files for sample $SAMPLE_ID
.
mkdir -p $TUTORIAL_HOME/analysis/star/$SAMPLE_ID/
STAR --runThreadN 4 \
--outSAMtype BAM SortedByCoordinate \
--outSAMstrandField intronMotif \
--outFilterIntronMotifs RemoveNoncanonicalUnannotated \
--outReadsUnmapped None \
--chimSegmentMin 15 \
--chimJunctionOverhangMin 15 \
--chimOutType WithinBAM \
--alignMatesGapMax 200000 \
--alignIntronMax 200000 \
--genomeDir $STAR_GENOME_INDEX \
--readFilesIn $SAMPLE_FASTQ_1 $SAMPLE_FASTQ_2 \
--outFileNamePrefix $TUTORIAL_HOME/analysis/star/$SAMPLE_ID/
Note: In order to obtain alignments of chimeric reads potentially supporting fusions, we have added the
--chimSegmentMin 20
option to obtain chimerica reads anchored by at least 20nt on either side of the fusion boundary, and--chimOutTypeWithinBAM
to report such alignments in the sam/bam output.
Note: We are running with additional command line options for fusion discovery as described in the manual for STAR-Fusion.
The file $TUTORIAL_HOME/analysis/star/$SAMPLE_ID/Aligned.sortedByCoord.out.bam
should contain
the resulting alignments in bam format, sorted by position.
Index the bam file using samtools index
.
samtools index $TUTORIAL_HOME/analysis/star/$SAMPLE_ID/Aligned.sortedByCoord.out.bam
Generate a precalculated coverage file for use with IGV. This will allow viewing of read depth across the genome at all scales, and will speed up IGV viewing of the bam file.
igvtools count $TUTORIAL_HOME/analysis/star/$SAMPLE_ID/Aligned.sortedByCoord.out.bam \
$TUTORIAL_HOME/analysis/star/$SAMPLE_ID/Aligned.sortedByCoord.out.bam.tdf hg19
You can view the bam file in IGV at:
http://cbw#.dyndns.info/Module7/analysis/star/HCC1395/Aligned.sortedByCoord.out.bam
where you need to replace # with your student ID.
Run STAR fusion caller
The STAR fusion caller parses the chimeric alignments produced by the STAR aligner and predicts fusions from these alignments.
STAR-Fusion \
--chimeric_junction $TUTORIAL_HOME/analysis/star/$SAMPLE_ID/Chimeric.out.junction \
--genome_lib_dir $STAR_FUSION_GENOME_INDEX \
--output_dir $TUTORIAL_HOME/analysis/star/$SAMPLE_ID/starfusion
The fusion reads are output to the sam file Chimeric.out.sam
. To view these in IGV, first import into bam format, sort, and then index.
samtools view -bt $UCSC_GENOME_FILENAME \
$TUTORIAL_HOME/analysis/star/$SAMPLE_ID/Chimeric.out.sam \
| samtools sort - -o $TUTORIAL_HOME/analysis/star/$SAMPLE_ID/Chimeric.out.bam
samtools index $TUTORIAL_HOME/analysis/star/$SAMPLE_ID/Chimeric.out.bam
igvtools count $TUTORIAL_HOME/analysis/star/$SAMPLE_ID/Chimeric.out.bam \
$TUTORIAL_HOME/analysis/star/$SAMPLE_ID/Chimeric.out.bam.tdf hg19
You can view the bam file of fusion reads in IGV at:
http://cbw#.dyndns.info/Module7/analysis/star/HCC1395/Chimeric.out.bam
where you need to replace # with your student ID. Fusion reads can be found
at this location chr2:160,832,184-160,837,535
.
Run the Trinity RNA-Seq assembler / GMAP contig mapper
Execution
Assemble using Trinity
Create a directory for the trinity analysis. Run trinity with fastq (fq) as the sequence type. Specify memory and threads, and provide the fastq paths and output paths.
mkdir -p $TUTORIAL_HOME/analysis/trinity/$SAMPLE_ID/trinity/
cp $SAMPLE_FASTQ_1 $TUTORIAL_HOME/analysis/trinity/$SAMPLE_ID/sample_reads_1.fq
cp $SAMPLE_FASTQ_2 $TUTORIAL_HOME/analysis/trinity/$SAMPLE_ID/sample_reads_2.fq
Trinity \
--seqType fq --max_memory 12G --CPU 4 \
--left $TUTORIAL_HOME/analysis/trinity/$SAMPLE_ID/sample_reads_1.fq \
--right $TUTORIAL_HOME/analysis/trinity/$SAMPLE_ID/sample_reads_2.fq \
--output $TUTORIAL_HOME/analysis/trinity/$SAMPLE_ID/trinity/
Align using GMap
Use gmap to map the resulting contigs to the reference genome and produce a GFF3 file.
cd $TUTORIAL_HOME/analysis/trinity/$SAMPLE_ID/trinity
gmap \
-f gff3_gene \
-D $GMAP_INDEX_DIR \
-d $GMAP_REF_ID \
Trinity.fasta \
> Trinity.gff3
Postprocess
Post-process the gff3 file to produce a list of fusions. We will use a custom script to simply pull out chimeric contigs. Realistically, further processing would be required to identify true fusions.
cd $TUTORIAL_HOME/analysis/trinity/$SAMPLE_ID/trinity
python $TUTORIAL_HOME/cbw_tutorial/scripts/gmap_extract_fusions.py \
Trinity.gff3 Trinity.fasta Fusions.fasta
The following sequence is the PLA2R-RBMS1
fusion.
>TRINITY_DN532_c0_g1_i1 len=712 path=[1379:0-711] [-1, 1379, -2]
TTTTGTGAATGAGCTCTTACATTCAAAATTTAATTGGACAGAAGAAAGGCAGTTCTGGAT
TGGATTTAATAAAAGAAACCCACTGAATGCCGGCTCATGGGAGTGGTCTGATAGAACTCC
TGTTGTCTCTTCGTTTTTAGACAACACTTATTTTGGAGAAGATGCAAGAAACTGTGCTGT
TTATAAGGCAAACAAAACATTGCTGCCCTTACACTGTGGTTCCAAACGTGAATGGATATG
CAAAATCCCAAGAGATGTGAAACCCAAGATTCCGTTCTGGTACCAGTACGATGTACCCTG
GCTCTTTTATCAGGATGCAGAATACCTTTTTCATACCTTTGCCTCAGAATGGTTGAACTT
TGAGTTTGTCTGTAGCTGGCTGCACAGTGATCTTCTCACAATTCATTCTGCACATGAGCA
AGAATTCATCCACAGCAAAATAAAAGCGCAACAGGAACAAGATCCTACCAACCTCTACAT
TTCTAATTTGCCACTCTCCATGGATGAGCAAGAACTAGAAAATATGCTCAAACCATTTGG
ACAAGTTATTTCTACAAGGATACTACGTGATTCCAGTGGTACAAGTCGTGGTGTTGGCTT
TGCTAGGATGGAATCAACAGAAAAATGTGAAGCTGTTATTGGTCATTTTAATGGAAAATT
TATTAAGACACCACCAGGAGTTTCTGCCCCCACAGAACCTTTATTGTGTAAG