Cancer Analyis 2021 Module 3

Module 3: Genome Alignment

By Mathieu Bourgey, Ph.D
https://bitbucket.org/mugqic/genpipes

================================

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. This means that you are able to copy, share and modify the work, as long as the result is distributed under the same license.

================================

In this workshop, we will present the main steps that are commonly used to process and to analyze cancer sequencing data. We will focus only on whole genome data and provide command lines that allow creating high quality alignment files usable for variant detection. This workshop will show you how to launch individual the first steps of a complete DNA-Seq SNV pipeline using to analyze cancer data

Data Source

We will be working on a CageKid sample pair, patient C0098. The CageKid project is part of ICGC and is focused on renal cancer in many of it’s forms. The raw data can be found on EGA and calls, RNA and DNA, can be found on the ICGC portal. For more details about CageKid

For practical reasons we subsampled the reads from the sample because running the whole dataset would take way too much time and resources.

All data used and generated in this workshop is accessible here

Environment setup


mkdir -p $HOME/workspace/CBW_CAN_2021/Module3/

docker run --privileged -v /tmp:/tmp --network host -it -w $PWD -v $HOME:$HOME \
--user $UID:$GROUPS -v /etc/group:/etc/group  -v /etc/passwd:/etc/passwd \
-v /etc/fonts/:/etc/fonts/ -v /media:/media c3genomics/genpipes:0.8


export REF=$MUGQIC_INSTALL_HOME/genomes/species/Homo_sapiens.GRCh37/


cd $HOME/workspace/CBW_CAN_2021/Module3/

ln -s $HOME/CourseData/CAN_data/Module3/* .

Software requirements

These are all already installed, but here are the original links.

We should load the corresponding modules

module load mugqic/java/openjdk-jdk1.8.0_72 \
   mugqic/bvatools/1.6 \
   mugqic/trimmomatic/0.36 \
   mugqic/samtools/1.9 \
   mugqic/bwa/0.7.17 \
   mugqic/GenomeAnalysisTK/4.1.0.0 \
   mugqic/R_Bioconductor/3.5.0_3.7

Original Setup

The initial structure of your folders should look like this:

<ROOT>
|-- raw_reads/               # fastqs from the center (down sampled)
    `-- normal               # The blood sample directory
        `-- run*_?           # Lane directory by run number. Contains the fastqs
    `-- tumor                # The tumor sample directory
        `-- run*_?           # Lane directory by run number. Contains the fastqs
|-- savedResults             # Folder containing precomputed results
|-- scripts                  # cheat sheet folder
|-- adapters.fa              # fasta file containing the adapter used for sequencing

Cheat file

You can find all the unix command lines of this practical in the file: commands.sh

First data glance

So you’ve just received an email saying that your data is ready for download from the sequencing center of your choice.

What should you do ? solution

Fastq files

Let’s first explore the fastq file.

Try these commands

zless -S raw_reads/normal/run62DVGAAXX_1/normal.64.pair1.fastq.gz

Why was it like that ? solution

Now try these commands:

zcat raw_reads/normal/run62DVGAAXX_1/normal.64.pair1.fastq.gz | head -n4
zcat raw_reads/normal/run62DVGAAXX_1/normal.64.pair2.fastq.gz | head -n4

What was special about the output ? Why was it like that? Solution

You could also just count the reads

zgrep -c "^@HWUSI" raw_reads/normal/run62DVGAAXX_1/normal.64.pair1.fastq.gz

We should obtain 4003 reads

Quality

We can’t look at all the reads. Especially when working with whole genome 50x data. You could easily have Billions of reads.

Tools like FastQC and BVATools readsqc can be used to plot many metrics from these data sets.

Let’s look at the data:

# Generate original QC
mkdir -p originalQC/
java -Xmx1G -jar ${BVATOOLS_JAR} readsqc --quality 64 \
  --read1 raw_reads/normal/run62DVGAAXX_1/normal.64.pair1.fastq.gz \
  --read2 raw_reads/normal/run62DVGAAXX_1/normal.64.pair2.fastq.gz \
  --threads 2 --regionName normalrun62DVGAAXX_1 --output originalQC/

To view the images hosted on your AWS instance, open an internet browser tab and type in http://<your IPv4 or your IPv4 DNS>, where you replace "<IPv4 or your IPv4 DNS>" with the ID for your AWS instance. This is the same ID you used to ssh into your instance (ssh -i CBW.pem ubuntu@<your IPv4 or your IPv4 DNS>). From here you can navigate through the files in your AWS workspace. The images are stored in CBW_CAN_2021/Module3/originalQC.

All the generated graphics have their uses. But 3 of them are particularly useful to get an overal picture of how good or bad a run went. - The Quality box plots - The nucleotide content graphs. - The Box plot shows the quality distribution of your data.

The quality of a base is computated using the Phread quality score.

The quality of a base is computated using the Phread quality score. Phred quality score formula

In the case of base quality the probability use represents the probability of base to have been wrongly called Base Quality values

The formula outputs an integer that is encoded using an ASCII table.

The way the lookup is done is by taking the the phred score adding 33 and using this number as a lookup in the table.

Older illumina runs, and the data here, were using phred+64 instead of phred+33 to encode their fastq files.

ACII table

What stands out in the graphs ? Solution

Why do we see adapters ? solution

Although nowadays this doesn’t happen often, it does still happen. In some cases, miRNA, it is expected to have adapters.

Trimming

Since adapter are not part of the genome they should be removed

To do that we will use Trimmomatic.

The adapter file is in your work folder.

cat adapters.fa

Why are there 2 different ones ? Solution

trimming with trimmomatic:

# Trim and convert data
for file in raw_reads/*/run*_?/*.pair1.fastq.gz;
do
  FNAME=`basename $file`;
  DIR=`dirname $file`;
  OUTPUT_DIR=`echo $DIR | sed 's/raw_reads/reads/g'`;

  mkdir -p $OUTPUT_DIR;
  java -Xmx2G -cp $TRIMMOMATIC_JAR org.usadellab.trimmomatic.TrimmomaticPE -threads 2 -phred64 \
    $file \
    ${file%.pair1.fastq.gz}.pair2.fastq.gz \
    ${OUTPUT_DIR}/${FNAME%.64.pair1.fastq.gz}.t30l50.pair1.fastq.gz \
    ${OUTPUT_DIR}/${FNAME%.64.pair1.fastq.gz}.t30l50.single1.fastq.gz \
    ${OUTPUT_DIR}/${FNAME%.64.pair1.fastq.gz}.t30l50.pair2.fastq.gz \
    ${OUTPUT_DIR}/${FNAME%.64.pair1.fastq.gz}.t30l50.single2.fastq.gz \
    TOPHRED33 ILLUMINACLIP:adapters.fa:2:30:15 TRAILING:30 MINLEN:50 \
    2> ${OUTPUT_DIR}/${FNAME%.64.pair1.fastq.gz}.trim.out ; 
done

cat reads/normal/run62DVGAAXX_1/normal.trim.out

note on trimmomatic command

What does Trimmomatic says it did ? Solution

Exercice: Let’s generate the new graphs Solution

How does it look now ? Solution

Alignment

The raw reads are now cleaned up of artefacts we can align each lane separatly.

Why should this be done separatly? Solution

Why is it important to set Read Group information ? Solution

Alignment with bwa-mem

# Align data
for file in reads/*/run*/*.pair1.fastq.gz;
do
  FNAME=`basename $file`;
  DIR=`dirname $file`;
  OUTPUT_DIR=`echo $DIR | sed 's/reads/alignment/g'`;
  SNAME=`echo $file | sed 's/reads\/\([^/]\+\)\/.*/\1/g'`;
  RUNID=`echo $file | sed 's/.*\/run\([^_]\+\)_.*/\1/g'`;
  LANE=`echo $file | sed 's/.*\/run[^_]\+_\(.\).*/\1/g'`;

  mkdir -p $OUTPUT_DIR;

  bwa mem -M -t 3 \
    -R "@RG\\tID:${SNAME}_${RUNID}_${LANE}\\tSM:${SNAME}\\t\
LB:${SNAME}_${RUNID}\\tPU:${RUNID}_${LANE}\\tCN:Centre National de Genotypage\\tPL:ILLUMINA" \
    ${REF}/genome/bwa_index/Homo_sapiens.GRCh37.fa \
    $file \
    ${file%.pair1.fastq.gz}.pair2.fastq.gz \
  | java -Xmx2G -jar ${GATK_JAR}  SortSam \
    -I /dev/stdin \
    -O ${OUTPUT_DIR}/${SNAME}.sorted.bam \
    --CREATE_INDEX true --SORT_ORDER coordinate --MAX_RECORDS_IN_RAM 500000
done

Why did we pipe the output of one to the other ? Solution

Could we have done it differently ? Solution

Lane merging

We now have alignments for each of the sequences lanes:

This is not practical in it’s current form.
What we wan’t to do now is merge the results into one BAM.

Since we identified the reads in the BAM with read groups, even after the merging, we can still identify the origin of each read.

# Merge Data
java -Xmx2G -jar ${GATK_JAR}  MergeSamFiles \
  -I alignment/normal/run62DPDAAXX_8/normal.sorted.bam \
  -I alignment/normal/run62DVGAAXX_1/normal.sorted.bam \
  -I alignment/normal/run62MK3AAXX_5/normal.sorted.bam \
  -I alignment/normal/runA81DF6ABXX_1/normal.sorted.bam \
  -I alignment/normal/runA81DF6ABXX_2/normal.sorted.bam \
  -I alignment/normal/runBC04D4ACXX_2/normal.sorted.bam \
  -I alignment/normal/runBC04D4ACXX_3/normal.sorted.bam \
  -I alignment/normal/runBD06UFACXX_4/normal.sorted.bam \
  -I alignment/normal/runBD06UFACXX_5/normal.sorted.bam \
  -O alignment/normal/normal.sorted.bam \
  --CREATE_INDEX true

java -Xmx2G -jar ${GATK_JAR}  MergeSamFiles \
  -I alignment/tumor/run62DU0AAXX_8/tumor.sorted.bam \
  -I alignment/tumor/run62DUUAAXX_8/tumor.sorted.bam \
  -I alignment/tumor/run62DVMAAXX_4/tumor.sorted.bam \
  -I alignment/tumor/run62DVMAAXX_6/tumor.sorted.bam \
  -I alignment/tumor/run62DVMAAXX_8/tumor.sorted.bam \
  -I alignment/tumor/run62JREAAXX_4/tumor.sorted.bam \
  -I alignment/tumor/run62JREAAXX_6/tumor.sorted.bam \
  -I alignment/tumor/run62JREAAXX_8/tumor.sorted.bam \
  -I alignment/tumor/runAC0756ACXX_5/tumor.sorted.bam \
  -I alignment/tumor/runBD08K8ACXX_1/tumor.sorted.bam \
  -I alignment/tumor/run62DU6AAXX_8/tumor.sorted.bam \
  -I alignment/tumor/run62DUYAAXX_7/tumor.sorted.bam \
  -I alignment/tumor/run62DVMAAXX_5/tumor.sorted.bam \
  -I alignment/tumor/run62DVMAAXX_7/tumor.sorted.bam \
  -I alignment/tumor/run62JREAAXX_3/tumor.sorted.bam \
  -I alignment/tumor/run62JREAAXX_5/tumor.sorted.bam \
  -I alignment/tumor/run62JREAAXX_7/tumor.sorted.bam \
  -I alignment/tumor/runAC0756ACXX_4/tumor.sorted.bam \
  -I alignment/tumor/runAD08C1ACXX_1/tumor.sorted.bam \
  -O alignment/tumor/tumor.sorted.bam \
  --CREATE_INDEX true

You should now have one BAM containing all your data.

Let’s double check

ls -l alignment/normal/
samtools view -H alignment/normal/normal.sorted.bam | grep "^@RG"

You should have your 9 read group entries.

Why did we use the -H switch? Solution

Try without. What happens? Solution

lane merging note

SAM/BAM exploration

Let’s spend some time to explore bam files.

samtools view alignment/normal/normal.sorted.bam | head -n4

Here you have examples of alignment results. A full description of the flags can be found in the SAM specification http://samtools.sourceforge.net/SAM1.pdf

You can try using picards explain flag site to understand what is going on with your reads http://broadinstitute.github.io/picard/explain-flags.html

The flag is the 2nd column.

What do the flags of the first 4th reads mean? solutions

Exercice: Let’s take the 3rd one, the one that is in proper pair, and find it’s mate. solutions

Why the pairing information is important ? solutions

SAM/BAM filtering

You can use samtools to filter reads as well.

Exercice: How many reads mapped and unmapped were there? solution

SAM/BAM CIGAR string

Another useful bit of information in the SAM is the CIGAR string. It’s the 6th column in the file.

This column explains how the alignment was achieved.

    M == base aligns *but doesn't have to be a match*. A SNP will have an M even if it disagrees with the reference.
    I == Insertion
    D == Deletion
    S == soft-clips. These are handy to find un removed adapters, viral insertions, etc.

An in depth explanation of the CIGAR can be found here

The exact details of the cigar string can be found in the SAM spec as well.

We won’t go into too much detail at this point since we want to concentrate on cancer specific issues now.

Cleaning up alignments

We started by cleaning up the raw reads. Now we need to fix some alignments.

The first step for this is to realign around indels and snp dense regions.

The Genome Analysis toolkit has a tool for this called IndelRealigner.

It basically runs in 2 steps:

Find the targets
Realign them

##GATK IndelRealigner

# Realign
#switch to old GATK 3.8
module unload  mugqic/GenomeAnalysisTK/4.1.0.0
module load mugqic/GenomeAnalysisTK/3.8

java -Xmx2G  -jar ${GATK_JAR} \
  -T RealignerTargetCreator \
  -R ${REF}/genome/Homo_sapiens.GRCh37.fa \
  -o alignment/normal/realign.intervals \
  -I alignment/normal/normal.sorted.bam \
  -I alignment/tumor/tumor.sorted.bam \
  -L 9

java -Xmx2G -jar ${GATK_JAR} \
  -T IndelRealigner \
  -R ${REF}/genome/Homo_sapiens.GRCh37.fa \
  -targetIntervals alignment/normal/realign.intervals \
  --nWayOut .realigned.bam \
  -I alignment/normal/normal.sorted.bam \
  -I alignment/tumor/tumor.sorted.bam

  mv normal.sorted.realigned.ba* alignment/normal/
  mv tumor.sorted.realigned.ba* alignment/tumor/


#return to GATK 4
module unload mugqic/GenomeAnalysisTK/3.8
module load  mugqic/GenomeAnalysisTK/4.1.0.0

Why did we use both normal and tumor together? Solution

How could we make this go faster ? Solution

How many regions did it think needed cleaning ? Solution

Indel Realigner also makes sure the called deletions are left aligned when there is a microsatellite or homopolymer.

This
ATCGAAAA-TCG
into
ATCG-AAAATCG

or
ATCGATATATATA--TCG
into
ATCG--ATATATATATCG

Why it is important ?Solution

Mark duplicates

What are duplicate reads ? Solution

What are they caused by ? Solution

What are the ways to detect them ? Solution

Here we will use picards approach:

# Mark Duplicates
java -Xmx2G -jar ${GATK_JAR}  MarkDuplicates \
  --REMOVE_DUPLICATES false --CREATE_INDEX true \
  -I alignment/normal/normal.sorted.realigned.bam \
  -O alignment/normal/normal.sorted.dup.bam \
  --METRICS_FILE alignment/normal/normal.sorted.dup.metrics

java -Xmx2G -jar ${GATK_JAR}  MarkDuplicates \
  --REMOVE_DUPLICATES false --CREATE_INDEX=true \
  -I alignment/tumor/tumor.sorted.realigned.bam \
  -O alignment/tumor/tumor.sorted.dup.bam \
  --METRICS_FILE alignment/tumor/tumor.sorted.dup.metrics

We can look in the metrics output to see what happened.

less -S alignment/normal/normal.sorted.dup.metrics
less -S alignment/tumor/tumor.sorted.dup.metrics

How many duplicates were there ? Solution

We can see that it computed separate measures for each library.

Why is this important to do not combine everything ? Solution

Note on Duplicate rate

Base Quality recalibration

Why do we need to recalibrate base quality scores ? Solution

It runs in 2 steps:
1 - Build covariates based on context and known snp sites
2 - Correct the reads based on these metrics

GATK BaseRecalibrator:

# Recalibrate
for i in normal tumor
do
  java -Xmx2G -jar ${GATK_JAR} BaseRecalibrator \
    -R ${REF}/genome/Homo_sapiens.GRCh37.fa \
    --known-sites ${REF}/annotations/Homo_sapiens.GRCh37.dbSNP150.vcf.gz \
    -L 9:130215000-130636000 \
    -O alignment/${i}/${i}.sorted.dup.recalibration_report.grp \
    -I alignment/${i}/${i}.sorted.dup.bam

    java -Xmx2G -jar ${GATK_JAR} ApplyBQSR \
      -R ${REF}/genome/Homo_sapiens.GRCh37.fa \
      -bqsr alignment/${i}/${i}.sorted.dup.recalibration_report.grp \
      -O alignment/${i}/${i}.sorted.dup.recal.bam \
      -I alignment/${i}/${i}.sorted.dup.bam
done

Extract BAM metrics

Once your whole bam is generated, it’s always a good thing to check the data again to see if everything makes sense.

Compute coverage If you have data from a capture kit, you should see how well your targets worked

Insert Size It tells you if your library worked

Alignment metrics It tells you if your sample and you reference fit together

Compute coverage

Both GATK and BVATools have depth of coverage tools.

Here we’ll use the GATK one

# Get Depth
#switch to old GATK 3.8
module unload  mugqic/GenomeAnalysisTK/4.1.0.0
module load mugqic/GenomeAnalysisTK/3.8

for i in normal tumor
do
  java  -Xmx2G -jar ${GATK_JAR} \
    -T DepthOfCoverage \
    --omitDepthOutputAtEachBase \
    --summaryCoverageThreshold 10 \
    --summaryCoverageThreshold 25 \
    --summaryCoverageThreshold 50 \
    --summaryCoverageThreshold 100 \
    --start 1 --stop 500 --nBins 499 -dt NONE \
    -R ${REF}/genome/Homo_sapiens.GRCh37.fa \
    -o alignment/${i}/${i}.sorted.dup.recal.coverage \
    -I alignment/${i}/${i}.sorted.dup.recal.bam \
    -L 9:130215000-130636000 
done

#return to GATK 4
module unload mugqic/GenomeAnalysisTK/3.8
module load  mugqic/GenomeAnalysisTK/4.1.0.0

note on DepthOfCoverage command

Coverage is the expected ~50x in this project

Look at the coverage:

cat alignment/normal/normal.sorted.dup.recal.coverage.sample_interval_summary
cat alignment/tumor/tumor.sorted.dup.recal.coverage.sample_interval_summary

Is the coverage fit with the expectation ? solution

Insert Size

It corresponds to the size of DNA fragments sequenced.

Different from the gap size (= distance between reads) !

These metrics are computed using Picard:

# Get insert size
for i in normal tumor
do
  java -Xmx2G -jar ${GATK_JAR}  CollectInsertSizeMetrics \
    -R ${REF}/genome/Homo_sapiens.GRCh37.fa \
    -I alignment/${i}/${i}.sorted.dup.recal.bam \
    -O alignment/${i}/${i}.sorted.dup.recal.metric.insertSize.tsv \
    -H alignment/${i}/${i}.sorted.dup.recal.metric.insertSize.histo.pdf \
    --METRIC_ACCUMULATION_LEVEL LIBRARY
done

look at the output

head -20 alignment/normal/normal.sorted.dup.recal.metric.insertSize.tsv
head -20 alignment/tumor/tumor.sorted.dup.recal.metric.insertSize.tsv

There is something interesting going on with our libraries.

Can you tell what it is? Solution

Which library is the most suitable for cancer analysis ? Solution

Alignment metrics

For the alignment metrics, samtools flagstat is very fast but with bwa-mem since some reads get broken into pieces, the numbers are a bit confusing.

We prefer the Picard way of computing metrics:

# Get alignment metrics
for i in normal tumor
do
  java -Xmx2G -jar ${GATK_JAR}  CollectAlignmentSummaryMetrics \
    -R ${REF}/genome/Homo_sapiens.GRCh37.fa \
    -I alignment/${i}/${i}.sorted.dup.recal.bam \
    -O alignment/${i}/${i}.sorted.dup.recal.metric.alignment.tsv \
    --METRIC_ACCUMULATION_LEVEL LIBRARY
done

explore the results

less -S alignment/normal/normal.sorted.dup.recal.metric.alignment.tsv
less -S alignment/tumor/tumor.sorted.dup.recal.metric.alignment.tsv

Do you think the sample and the reference genome fit together ? Solution

Exit the container environment

exit

Aknowledgments

I would like to thank and acknowledge Pierre-Olivier Quirion, Robert Eveleigh, Edouard Henrion for their inputs while building this workshop.