Gene Fusions Tutorial - DISCASM and GMAP-Fusion
DISCASM and GMAP-Fusion Tutorial
DISCASM and GMAP-fusion are software focused on exploring transcripts that are ill-represented within a reference genome, either lacking genome representation (missing genes, foreign transcripts) or otherwise not well represented in the reference genome sequence due to chromosomal rearrangements. This is highly relevant for cancer biology for detecting fusion transcripts and oncoviruses from RNA-Seq data. DISCASM and GMAP-Fusion were developed as components of the broader Trinity Cancer Transcriptome Analysis Toolkit (CTAT). Each tool is briefly described below.
-
DISCASM targets RNA-Seq reads that are discordantly aligned to the genome or lacking alignment to the genome altogether for de novo transcriptome assembly. De novo transcriptome assembly is performed using either Trinity or Oases.
-
GMAP-fusion use the GMAP aligner to identify candidate fusion transcripts followed by additional analysis to weed out likely artifacts and further score fusion candidates based on RNA-Seq read supporting evidence.
These tools can be used separately or synergistically. In this tutorial, we showcase the latter, where we use DISCASM to first assemble discordant and unmapped reads, and then use GMAP-fusion to identify reconstructed fusion transcripts. We also further explore the reconstructed transcripts from discordant and genome-unmapped reads to see if we have evidence of foreign transcripts such as derived from cancer-relevant viruses. For exploring foreign transcripts, we leverage Centrifuge.
Tutorial Software and Data Setup
All software and data required to run through the tutorial is installed locally on the server. You just need to copy it over to your workspace like so:
% cp -r ~/CourseData/CG_data/Module9/GMAP-fusion .
This will add the tools GMAP-fusion and DISCASM along with the required tutorial data.
We next just need to do a little environment configuration so that the system will find the required software.
Set up the following environmental variables like so:
% cd GMAP-fusion/
% export GMAP_FUSION_HOME=`pwd`
% export DISCASM_HOME=`pwd`/DISCASM
Tutorial data contents
Next enter the Tutorial/ subdirectory
% cd Tutorial
The data here correspond to a small region of human chromosome 2 and RNA-Seq reads derived from a breast cancer cell line HCC1395, and further supplemented with data from another cell line SKBR3 provided by Henrik Edgren et al.
Note, parts of this tutorial are based on earlier work by Andrew McPherson using cell line data generated by Malachi Griffith.
List the files in the Tutorial directoryL
% ls -l
-rw-rw-r-- 1 ubuntu ubuntu 2488924 Mar 14 10:14 HCC1395-miniplus_1.fastq.gz
-rw-rw-r-- 1 ubuntu ubuntu 74 Mar 14 10:14 README.md
-rw-rw-r-- 1 ubuntu ubuntu 2500464 Mar 14 10:14 HCC1395-miniplus_2.fastq.gz
-rwxrwxr-x 1 ubuntu ubuntu 195 Mar 14 10:14 cleanMe.sh
-rw-rw-r-- 1 ubuntu ubuntu 1565706 Mar 14 10:14 minichr2.fa
-rw-rw-r-- 1 ubuntu ubuntu 684002 Mar 14 10:14 minichr2.gtf
-rw-rw-r-- 1 ubuntu ubuntu 98099026 Mar 14 10:14 centrifuge_VirusDB.tar.gz
drwxrwxr-x 2 ubuntu ubuntu 4096 Mar 14 10:14 centrifuge_VirusDB
Data provided in the tutorial specifically include:
minichr2.fa : short segment of human chr2
minichr2.gtf : corresponding gene annotations for this region of chr2
HCC1395-miniplus_1.fastq.gz : 'left' fragment cancer line RNA-Seq reads
HCC1395-miniplus_2.fastq.gz : 'right' fragment cancer line RNA-Seq reads
Steps of the Tutorial
The following tutorial will take you through preparing the above provided genome and annotation files for use with Trinity CTAT tools, followed by aligning reads, capturing the discordant and unmapped reads, performing de novo transcriptome assembly, and identifying reconstructed fusion transcripts as well as any interesting transcripts supported by the RNA-Seq data but not represented by the target genome.
Prep the genome for use with Trinity CTAT
Run the following to prepare the genome and annotations for use with GMAP-Fusion, DISCASM, STAR, and other processes we’ll need for fusion discovery and analysis as part of Trinity CTAT.
% $GMAP_FUSION_HOME/FusionFilter/prep_genome_lib.pl \
--genome_fa minichr2.fa \
--gtf minichr2.gtf \
--gmap_build
Align RNA-Seq reads to the genome
Use STAR to align the RNA-Seq data to the genome:
STAR --genomeDir ctat_genome_lib_build_dir/ref_genome.fa.star.idx \
--readFilesIn HCC1395-miniplus_1.fastq.gz HCC1395-miniplus_2.fastq.gz \
--outReadsUnmapped None \
--chimSegmentMin 12 \
--chimJunctionOverhangMin 12 \
--alignSJDBoverhangMin 10 \
--alignMatesGapMax 100000 \
--alignIntronMax 100000 \
--alignSJstitchMismatchNmax 5 -1 5 5 \
--runThreadN 4 \
--outSAMstrandField intronMotif \
--outSAMunmapped Within \
--outSAMtype BAM Unsorted \
--outSAMattrRGline ID:GRPundef \
--chimSegmentReadGapMax 3 \
--genomeLoad NoSharedMemory \
--twopassMode Basic \
--readFilesCommand 'gunzip -c'
# takes less than a minute
There are two important output files that will be generated by the above alignment command:
Aligned.out.bam : bam file containing the aligned reads
Chimeric.out.junction : identification of reads aligning discordantly to the genome
We’ll use both of these files when running DISCASM below.
De novo assemble discordant and genome-unmapped reads using DISCASM
Assemble the discordantly-aligned and genome-unmapped reads using DISCASM with the Trinity assembler, leveraging the outputs from STAR above like so:
% $DISCASM_HOME/DISCASM --chimeric_junctions Chimeric.out.junction \
--aligned_bam Aligned.out.bam \
--denovo_assembler Trinity \
--left_fq HCC1395-miniplus_1.fastq.gz \
--right_fq HCC1395-miniplus_2.fastq.gz \
--out_dir discasm
# takes a few minutes
Once the above completes, you should find the de novo assembled transcripts at:
discasm/trinity_out_dir/Trinity.fasta
Take a look at the file:
% head discasm/trinity_out_dir/Trinity.fasta .
>TRINITY_DN0_c0_g2_i1 len=243 path=[0:0-242]
GCCCAGGAGGCAGAGGCTTCAGTGAGCTGAGATTGTGCTACTATACTCCAACCTGGGTGAGTGTGAGAATACAGGTGTGCACACCACACTGCACTGCTTTTTAAATTTTTTGTAGATGTGAAGTCTCACTGTATTGACCAGGCTGGTCTCGAACTTAGGAGATCAAGCAGTCTTCCTGCCTCACCCTCCAAAGGTGCTGGGATTACAGGCATGAGCTACTGCACCTGGCCAGAGGCAACATTA
>TRINITY_DN11_c0_g1_i1 len=231 path=[1:0-230]
CTTGGTGGATTAACACAGCAAAGGTTTATTTCTCACCCACACCTGCACGGGCTGGCAAGAGGACTTTGCTCTGTTCAGTCATTCATGGGTCCATGCTTCTCCTATATAGCCAGCCCTCTGTCTCCTAAGGCCTGGAGTCCTCCATGGGCCTCCAATCTGTGGCTGACAGAGGATGTTGACTTGAGACCCATCTTTGATTCATTATGGAGCTTCCACAGGCTGTTGAAATCC
>TRINITY_DN11_c0_g1_i2 len=187 path=[0:0-186]
CTTTGTGTTTGAGAGGTTGGGAAAGAAGAACCAGGGGTGCTCTCAGGTAAATCAGTTTTAAGAAAAAATATACATGAAGATTCTGGAAATTAAGTTTCTTAAATACCCATGCTCCTGTATTCCTTAGGCATGTGTCCCCAGTTTGAGGACTTTTGACTTAGAGGATTTCAACAGCCTGTGGAAGCTC
>TRINITY_DN11_c0_g2_i1 len=261 path=[0:0-260]
TTGATTTGGCCTCATTCTGGCTGTCCCTGGTGCTGTCTTATCTCTTGAGAACTGGCTGTGTCTCCTGTTGAATTCCCTATCCGAACACTGCAGTTCTTGGCTACAAGGCAGTCACTTCACCCCTGCAGTGGTGAGATCCCATTAGCTACCATTAGCTACTCTTTTCGTCCAGAAAGGCGAAATGCCATAGCAAACAATCCCACAGTCTTGGTGGATTAACACAGCAAAGTCCTCTTGCCAGCCCATGCAGGTGTGGGTGAG
>TRINITY_DN12_c1_g3_i1 len=113 path=[0:0-112]
AGTATGATGTATGCTTTGTGCTGTTCTTCCAGCAATATCTGTAGTATGATAAGCTATTTGTCTCCTTGTTTCTCTAAATAAATGTTCCCATACAGCTCTTCCTACTATATTAT
Mine the reconstructed transcripts for fusions using GMAP-fusion
Discordantly-aligned reads are a signature for fusion transcripts. See if we were able to assembly an fusion transcripts from the discordantly (or unmapped) reads by running GMAP-fusion on the reconstructed transcripts:
% $GMAP_FUSION_HOME/GMAP-fusion \
-T discasm/trinity_out_dir/Trinity.fasta \
--left_fq HCC1395-miniplus_1.fastq.gz \
--right_fq HCC1395-miniplus_2.fastq.gz \
--genome_lib_dir ctat_genome_lib_build_dir/ \
--output gmapf
# takes a couple seconds
The fusion predictions will be found as file ‘gmapf/GMAP-fusion.final’.
Examine the fusion predictions. Do you find any?
% cat gmapf/GMAP-fusion.final
.
#fusion_name J S trans_acc trans_brkpt geneA chrA coordA geneB chrB coordB junction_type
PLA2R1--RBMS1 199 41 TRINITY_something... 272-271 PLA2R1 chr2 272580 RBMS1 chr2 599999 ONLY_REF_SPLICE
Exploring fusions using the UCSC Genome Browser
From the ‘discasm/trinity_out_dir/Trinity.fasta’ file, locate the fasta-formatted sequence for the transcript predicted as a fusion. (Try using ‘less’ unix utility and searching for the entry).
if you can’t easily locate yours, here’s mine
Visit the UCSC Genome Browser.
From the menu, go to ‘Tools’->’Blat’
Paste your fusion transcript into the form field and submit.
You should find the following results similar to this:
Explore the top two hits that are 100% identical to regions of chromosome 2.
Which genes do they match? Can you infer where the fusion breakpoint is based on the transcript alignments?
Comparison of fusion transcript with DNA-based fusion breakpoint
Earlier work by others using WGS sequence data and the destruct software yielded the following sequence supporting the physical breakpoint in the genome.
>destruct_31240
TGCAAAAGATCTGGAAAAATGCAGTCTGGTATTTACACATAATTTAAGTTCACAGTGC
AACTGCTCCCATAACCCTAGCTGAAACTGTCTCTTCTTAGTCATTTTTAATTTTCCAA
GATAACTTGGCAAAGCTATTGTTGTTGACATAATAAAGACTGGGCAGAAGGCTTACCT
AGCAAAGCCAACACCACGACTTGTACCACTGGAATCACGTAGTATCCTTGTAGAAATA
ACTTGTCCAAATGGTTTGAGCATATTTTCTAGTTCTTGCTCATCCATGGAGAGTGGCA
AATTAGAAATGTAGAGGTTGGTAGGATCTTGTTCCTGTTGCTAAAACAGAAGAGAGTG
TTGTCCATTAATTTCCAACAGAAGGTGAGATATTTATGTTAACACACCTATTTTTATT
AGCTACTTTCTTTGCTCAAGTCCTTTTAAAGTACTCAGAACCTCAGAACACCAAAGTC
ACCCTGGACTCTTGAAAATAGTGTCTGAAGCTTGGACAA[AA]AAAAAGTAATATTAG
AAAATGAATTCATTTTCTGACAAAAAATTATTGGCTCATCCTCTCAGTTATTTACCCT
CTCAGTGATTTATAATTCATTGCATATGTCACATGTATTTGAAAAACAATTCAAGGTA
TCAAGGCATCATTAGTATAAAGATACTGATTTTAGGTATTAGTCTGATTGCTAAGCTT
TAAGCAGTATAAGCTTTCCTTCCCATTCAAATAGAGAGACACAATATAGGACAAAAGA
ATACTACAGAGTGCCCAGTGTTTGACAACTAGAAAATTATCCTTTTGATGAGTTCATG
TCCTTTGCAGGGACATGGATGAAGCTGGAAACCATCAATCTCAGCAAACTAACACATG
AACAGAAAACCAAACACCGCATGTTCTCACTCATAAGTGGGAGCTGAACAATGAGAAC
ACATGGACACAGGGAGGGGAACATCACACACTGGGGCCTGTCGG
Use the UCSC Blat utility to align the above along with your fusion transcript sequence and examine the alignment results in the context of the fusion transcript alignments results.
How do the genomic breakpoints compare to the fusion-transcript breakpoints? Why do they differ?
Detection of foreign transcripts (eg. oncoviruses)
Search the de novo reconstructed transcripts for evidence of viruses using the Centrifuge software along with the database of viral sequences provided.
% centrifuge -x centrifuge_VirusDB/abv -f \
-U discasm/trinity_out_dir/Trinity.fasta \
> centrifuge.matches.txt
The above will identify transcripts with sequence matches to known viruses, with the matches provided in the ‘centrifuge.matches.txt’ file, and summarizes the findings in a file ‘centrifuge_report.tsv’.
Examine the ‘centrifuge_report.tsv’ file. Were there any matches?
% cat centrifuge_report.tsv
name taxID taxRank genomeSize numReads numUniqueReads abundance
Bos taurus polyomavirus 1 1891754 species 4697 11 11 1
Remember, this was a cancer cell line RNA-Seq data set. Why might we find the above polyomavirus in such a sample? (Try using Google with terms ‘Bos taurus polyomavirus cell culture’ and see what you find)
Concluding remarks
Congratulations on finishing the DISCASM / GMAP-fusion tutorial! This was, of course, a fairly contrived example of how one might use these tools to explore the somewhat ‘dark matter of the cancer transcriptome’. The data sets used here were small enough to be run through on a basic laptop computer in a short amount of time. Running the above system on larger cancer data sets is far more time consuming and computationally challenging, but highly rewarding with respect to biological discoveries.
Be sure to visit our Trinity CTAT site and subscribe to our Trinity CTAT Google forum to stay apprised of our latest developments and announcements.