Module 3
Lab
Introduction
This workshop will show you how to launch individual first steps of a DNA-Seq pipeline
We will be working on a 1000 genome sample, NA12878. You can find the whole raw data on the 1000 genome website: http://www.1000genomes.org/data
NA12878 is the child of the trio while NA12891 and NA12892 are her parents.
| Mother | Father | Child |
|---|---|---|
| NA12892 | NA12891 | NA12878 |
If you finish early, feel free to perform the same steps on the other two individuals: NA12891 & NA12892.
For practical reasons we subsampled the reads from the sample because running the whole dataset would take way too much time and resources. We’re going to focus on the reads extracted from a 300 kbp stretch of chromosome 1.
| Chromosome | Start | End |
|---|---|---|
| 1 | 17704860 | 18004860 |
Original Setup
Environment setup
Launching the container
mkdir -p $HOME/workspace/HTG/Module3/
docker run --privileged -v /tmp:/tmp --network host -it -w $PWD -v $HOME:$HOME \
--user $UID:$GROUPS -v /etc/group:/etc/group -v /etc/passwd:/etc/passwd \
-v /etc/fonts/:/etc/fonts/ -v /media:/media c3genomics/genpipes:0.8
Variable Assignment
export WORK_DIR_M3=$HOME/workspace/HTG/Module3/
export REF=$MUGQIC_INSTALL_HOME/genomes/species/Homo_sapiens.GRCh37/
Data Files
The initial structure of your folders should look like this:
ROOT
|-- raw_reads/ # fastqs from the center (down sampled)
`-- NA12878/ # Child sample directory
`-- NA12891/ # Father sample directory
`-- NA12892/ # Mother sample directory
`-- scripts/ # command lines scripts
`-- saved_results/ # precomputed final files
`-- adapters.fa # fasta file containing the adapter used for sequencing`
First Data Glance
So you’ve just recieved an email saying that your data is ready for download from the sequencing center of your choice.
1. What should you do?
Solution (click here)
The first thing to do is to download it.
The second thing is making sure it is of good quality.
FASTQ Files
Let’s first explore the FASTQ file.
Try these commands.
less -S raw_reads/NA12878/NA12878_CBW_chr1_R1.fastq.gz
These are FASTQ files.
1. Could you describe the FASTQ format?
Solution (click here)
There is four lines for each read:
- Header 1
- DNA sequence
- Header 2
- Quality values
zcat raw_reads/NA12878/NA12878_CBW_chr1_R1.fastq.gz | head -n4
zcat raw_reads/NA12878/NA12878_CBW_chr1_R2.fastq.gz | head -n4
2. What was special about the output and why was it like like? :::: {.callout title=“Solution (click here)”type=“gray” style=“subtle” icon=“true” collapsible=“true”}
It's the same header with a /1 or /2 towards the end. Meaning these are paired data.
::::
You could also count the reads.
zgrep -c "^@SN1114" raw_reads/NA12878/NA12878_CBW_chr1_R1.fastq.gz
We found 56512 reads.
3. Why shouldn’t you just do?
zgrep -c "^@" raw_reads/NA12878/NA12878_CBW_chr1_R1.fastq.gz
Solution (click here)
Because the ASCII quality character has @ as a valid value. If the quality line starts with this character you'll count it as a read.
By this method 82325 counts are found!
Quality
We can’t look at all the reads. Especially when working with whole genome 30x data. You could easilly have Billions of reads.
Tools like FastQC and BVATools readsqc can be used to plot many metrics from these data sets.
Let’s look at the data:
mkdir -p originalQC/NA12878/
java -Xmx1G -jar ${BVATOOLS_JAR} readsqc \
--read1 raw_reads/NA12878/NA12878_CBW_chr1_R1.fastq.gz \
--read2 raw_reads/NA12878/NA12878_CBW_chr1_R2.fastq.gz \
--threads 2 --regionName ACTL8 --output originalQC/NA12878/
Open the images
1. What stands out in the graphs?
Solution (click here)
Of the raw data we see that:
- Some reads have 3' ends.
All the generated graphics have their uses. This being said, 2 of them are particularly useful to get an overal picture of how good or bad a run went.
These are the Quality box plots

and the nucleotide content graphs.

The Box plot shows the quality distribution of your data. The Graph goes > 100 because both ends are appended one after the other.
The quality of a base is computated using the Phread quality score.

The formula outputs an integer that is encoded using an ASCII table.

The way the lookup is done is by taking the the phred score adding 33 and using this number as a lookup in the table. The Wikipedia entry for the FASTQ format has a summary of the varying values.
Older illumina runs were using phred+64 instead of phred+33 to encode their fastq files.
Trimming
After this careful analysis of the raw data we see that
- Some reads have bad 3’ ends.
- No read has adapter sequences in it.
Although nowadays this doesn’t happen often, it does still happen. In some cases, miRNA, it is expected to have adapters. Since they are not part of the genome of interest they should be removed if enough reads have them.
To be able to remove adapters and low qualtity bases, we will use Trimmomatic.
The adapter file is already in your reference folder.
We can look at the adapters
cat adapters.fa
1. Why are there 2 different ones?
Solution (click here)
Because both ends of the fragment don't have the same adapter.
Let’s try removing them and see what happens.
mkdir -p reads/NA12878/
java -Xmx2G -cp $TRIMMOMATIC_JAR org.usadellab.trimmomatic.TrimmomaticPE -threads 2 -phred33 \
raw_reads/NA12878/NA12878_CBW_chr1_R1.fastq.gz \
raw_reads/NA12878/NA12878_CBW_chr1_R2.fastq.gz \
reads/NA12878/NA12878_CBW_chr1_R1.t20l32.fastq.gz \
reads/NA12878/NA12878_CBW_chr1_S1.t20l32.fastq.gz \
reads/NA12878/NA12878_CBW_chr1_R2.t20l32.fastq.gz \
reads/NA12878/NA12878_CBW_chr1_S2.t20l32.fastq.gz \
ILLUMINACLIP:adapters.fa:2:30:15 TRAILING:20 MINLEN:32 \
2> reads/NA12878/NA12878.trim.out
cat reads/NA12878/NA12878.trim.out
2. What does Trimmomatic says it did?
Solution (click here)
Of the 56512 input pairs:
- 99.92% were kept
- 0.06% had only a valid read1
- 0.02% had only a valid read2
- None were fully discarded
Let’s look at the graphs now
mkdir -p postTrimQC/
java -Xmx1G -jar ${BVATOOLS_JAR} readsqc \
--read1 reads/NA12878/NA12878_CBW_chr1_R1.t20l32.fastq.gz \
--read2 reads/NA12878/NA12878_CBW_chr1_R2.t20l32.fastq.gz \
--threads 2 --regionName ACTL8 --output postTrimQC/
** 3. How does it look now?**
Solution (click here)
It looks better, but there is still some medium-low quality bases that remain.
** 4. Could we have done a better job?**
Solution (click here)
A sliding windows approach would have been more efficient in this case but it comes with the cost of losing more reads (92% both survived)
Raw:
<img src="./img/raw.png" width="1000" alt="Raw.">
Trailing:
<img src="./img/TR_trim_res.png" width="1000" alt="Trimming
Sliding windows:
<img src="./img/SW_trim_res.png" width="1000" alt="Sliding window.">
Alignment
The raw reads are now cleaned up of artefacts we can align the read to the reference.
In case you have multiple readsets or library you should align them separately!
1. Why should this be done separately?
Solution (click here)
For speed, you can align each in parallel.
To track where the reads came from.
mkdir -p alignment/NA12878/
bwa mem -M -t 2 \
-R '@RG\tID:NA12878\tSM:NA12878\tLB:NA12878\tPU:runNA12878_1\tCN:Broad Institute\tPL:ILLUMINA' \
$REF/genome/bwa_index/Homo_sapiens.GRCh37.fa \
reads/NA12878/NA12878_CBW_chr1_R1.t20l32.fastq.gz \
reads/NA12878/NA12878_CBW_chr1_R2.t20l32.fastq.gz \
| java -Xmx2G -jar ${GATK_JAR} SortSam \
-I /dev/stdin \
-O alignment/NA12878/NA12878.sorted.bam \
-SO coordinate \
--CREATE_INDEX true --MAX_RECORDS_IN_RAM 500000
2. Why is it important to set Read Group information?
Solution (click here)
Many tools require it (not the best reason).
To help differentiate lanes of sequencing in the final BAM.
When generating metrics, many tools can use this information to generate separate metrics.
The details of the fields can be found in the SAM/BAM specifications Here For most cases, only the sample name, platform unit and library one are important.
3. Why did we pipe the output of one to the other? Could we have done it differently?
Solution (click here)
Mostly to save IO operations and space. Piping skips the SAM generation.
The problem though is that more RAM and processors are needed since we sort the output automatically.
Lane Merging (Optional)
In case we generate multiple lane of sequencing or mutliple library. It is not practical to keep the data splited and all the reads should be merge into one massive file.
Since we identified the reads in the BAM with read groups, even after the merging, we can still identify the origin of each read.
SAM/BAM
Let’s spend some time to explore bam files.
Try
samtools view alignment/NA12878/NA12878.sorted.bam | head -n4
Here you have examples of alignment results. A full description of the flags can be found in the SAM specification.
Try using picards explain flag site to understand what is going on with your reads.
The flag is the 2nd column.
1. What do the flags of the first 4 reads mean?
Solution (click here)
Flag 99:
The read is paired The read is mapped in proper pair The read is the first in the pair The mate is in reverse strand Flag 147:
The read is paired The read is mapped in proper pair The read is the second in the pair The read is on reverse strand Flag 161:
The read is paired The read is the second in the pair The mate is on reverse strand Flag 65:
The read is paired The read is the first in the pair By looking at the read names you can notice that these 4 entries represent 4 read of 3 different pairs
Let’s take the 3nd one and find it’s pair.
Try
samtools view alignment/NA12878/NA12878.sorted.bam | grep "1313:19317:61840"
2. Why did searching one name find both reads?
Solution (click here)
Because we sequenced paired-end reads and the bam file contains the 2 reads of the pair.
You can use samtools to filter reads as well.
# Say you want to count the *un-aligned* reads, you can use
samtools view -c -f4 alignment/NA12878/NA12878.sorted.bam
# Or you want to count the *aligned* reads you, can use
samtools view -c -F4 alignment/NA12878/NA12878.sorted.bam
3. How many reads mapped and unmapped were there?
Solution (click here)
If you want to count the un-aligned reads you can use:
Number of unmapped reads : 284
Number of mapped reads : 112740
Another useful bit of information in the SAM is the CIGAR string. It’s the 6th column in the file. This column explains how the alignment was achieved.
- M == base aligns but doesn’t have to be a match. A SNP will have an M even if it disagrees with the reference.
- I == Insertion
- D == Deletion
- S == soft-clips. These are handy to find un removed adapters, viral insertions, etc.
An in depth explanation of the CIGAR can be found here The exact details of the cigar string can be found in the SAM spec as well. Another good site
Cleaning up Alignments
We started by cleaning up the raw reads. Now we need to fix and clean some alignments.
Indel realignment
The first step for this is to realign around indels and snp dense regions. The Genome Analysis toolkit has a tool for this called IndelRealigner.
It basically runs in 2 steps
1 - Find the targets 2 - Realign them.
#switch to old GATK 3.8
module unload mugqic/GenomeAnalysisTK/4.1.0.0
module load mugqic/GenomeAnalysisTK/3.8
java -Xmx2G -jar ${GATK_JAR} \
-T RealignerTargetCreator \
-R $REF/genome/Homo_sapiens.GRCh37.fa \
-o alignment/NA12878/realign.intervals \
-I alignment/NA12878/NA12878.sorted.bam \
-L 1
java -Xmx2G -jar ${GATK_JAR} \
-T IndelRealigner \
-R $REF/genome/Homo_sapiens.GRCh37.fa \
-targetIntervals alignment/NA12878/realign.intervals \
-o alignment/NA12878/NA12878.realigned.sorted.bam \
-I alignment/NA12878/NA12878.sorted.bam
#return to GATK 4
module unload mugqic/GenomeAnalysisTK/3.8
module load mugqic/GenomeAnalysisTK/4.1.0.0
1. How many regions did it think needed cleaning?
Solution (click here)
wc -l alignment/NA12878/realign.intervals
286 to be precise. But it does pickup all the regions with any reads with deletions. Which does not mean that every region has to be realigned.
Mark Duplicates
As the step says, this is to mark duplicate reads.
2. What are duplicate reads?
Solution (click here)
Different read pairs representing the same initial DNA fragment.
3. What are they caused by?
Solution (click here)
PCR reactions (PCR duplicates).
Some clusters that are thought of being separate in the flowcell but are the same (optical duplicates - old illumina flowcell).
in the new illumina flowcell when the loading is low one fragement can propagate from one nanowell to another free neighbour nanowell (Clustering duplicates).
4. What are the ways to detect them?
Solution (click here)
GATK and samtools uses the alignment positions:
- Both 5’ ends of both reads need to have the same positions.
- Each reads have to be on the same strand as well.
Another method is to use a kmer approach:
- Take a part of both ends of the fragment
- Build a hash table
- Count the similar hits
Brute force, compare all the sequences.
Here we will use the GATK approach:
java -Xmx2G -jar ${GATK_JAR} MarkDuplicates \
--REMOVE_DUPLICATES false --CREATE_INDEX true \
-I alignment/NA12878/NA12878.realigned.sorted.bam \
-O alignment/NA12878/NA12878.sorted.dup.bam \
--METRICS_FILE=alignment/NA12878/NA12878.sorted.dup.metrics
We can look in the metrics output to see what happened.
less alignment/NA12878/NA12878.sorted.dup.metrics
5. How many duplicates were there?
Solution (click here)
- 0.6%
Note that most of theses duplicates come form PCR reactions
This is very low, we expect in general <2%.
Note it computed the metrics for each library.
6. Why is this important to do it by library and not to combine everything?
Solution (click here)
Each library represents a set of different DNA fragments.
Each library involves different PCR reactions.
So PCR duplicates can not occur between fragment of two different libraries.
But similar fragment could be found between libraries when the coverage is high.
Recalibration
This is the last BAM cleaning up step.
The goal for this step is to try to recalibrate base quality scores. The vendors tend to inflate the values of the bases in the reads. Also, this step tries to lower the scores of some biased motifs for some technologies.
It runs in 2 steps,
1 - Build covariates based on context and known snp sites 2 - Correct the reads based on these metrics
java -Xmx2G -jar ${GATK_JAR} BaseRecalibrator \
-R ${REF}/genome/Homo_sapiens.GRCh37.fa \
--known-sites ${REF}/annotations/Homo_sapiens.GRCh37.dbSNP150.vcf.gz \
-L 1:17704860-18004860 \
-O alignment/NA12878/NA12878.sorted.dup.recalibration_report.grp \
-I alignment/NA12878/NA12878.sorted.dup.bam
java -Xmx2G -jar ${GATK_JAR} ApplyBQSR \
-R ${REF}/genome/Homo_sapiens.GRCh37.fa \
-bqsr alignment/NA12878/NA12878.sorted.dup.recalibration_report.grp \
-O alignment/NA12878/NA12878.sorted.dup.recal.bam \
-I alignment/NA12878/NA12878.sorted.dup.bam
Extract Metrics
Once your whole bam is generated, it’s always a good thing to check the data again to see if everything makes sense.
Compute coverage
If you have data from a capture kit, you should see how well your targets worked.
Both GATK and BVATools have depth of coverage tools. We wrote our own in BVAtools because
- GATK was deprecating theirs -GATK’s is very slow
- We were missing some output that we wanted from the GATK’s one (GC per interval, valid pairs, etc)
Here we’ll use the GATK one
#switch to old GATK 3.8
module unload mugqic/GenomeAnalysisTK/4.1.0.0
module load mugqic/GenomeAnalysisTK/3.8
java -Xmx2G -jar ${GATK_JAR} \
-T DepthOfCoverage \
--omitDepthOutputAtEachBase \
--summaryCoverageThreshold 10 \
--summaryCoverageThreshold 25 \
--summaryCoverageThreshold 50 \
--summaryCoverageThreshold 100 \
--start 1 --stop 500 --nBins 499 -dt NONE \
-R ${REF}/genome/Homo_sapiens.GRCh37.fa \
-o alignment/NA12878/NA12878.sorted.dup.recal.coverage \
-I alignment/NA12878/NA12878.sorted.dup.recal.bam \
-L 1:17700000-18100000
#return to GATK 4
module unload mugqic/GenomeAnalysisTK/3.8
module load mugqic/GenomeAnalysisTK/4.1.0.0
#### Look at the coverage
less -S alignment/NA12878/NA12878.sorted.dup.recal.coverage.sample_interval_summary
Coverage is the expected ~30x.
summaryCoverageThreshold is a usefull function to see if your coverage is uniform. Another way is to compare the mean to the median. If both are almost equal, your coverage is pretty flat. If both are quite different, that means something is wrong in your coverage.
Insert Size
java -Xmx2G -jar ${GATK_JAR} CollectInsertSizeMetrics \
-R ${REF}/genome/Homo_sapiens.GRCh37.fa \
-I alignment/NA12878/NA12878.sorted.dup.recal.bam \
-O alignment/NA12878/NA12878.sorted.dup.recal.metric.insertSize.tsv \
-H alignment/NA12878/NA12878.sorted.dup.recal.metric.insertSize.histo.pdf \
--METRIC_ACCUMULATION_LEVEL LIBRARY
#look at the output
less -S alignment/NA12878/NA12878.sorted.dup.recal.metric.insertSize.tsv
1. What is the insert size and the corresponding standard deviation?
Solution (click here)
head -9 alignment/NA12878/NA12878.sorted.dup.recal.metric.insertSize.tsv | tail -3 | cut -f6,7
- Insert size = 295bp
- SD = 58bp
2. Is the insert-size important?
Solution (click here)
Yes.
In general we try to obtain the longest fragment with high precision (smallest SD).
The final size of the fragment is not a big issue if it’s generally longer than the sequencing design (here 200bp) to avoid overlapp. That being said, longer fragment basically could help to catch larger SV events. The precision is important because smaller SD will allow detecting a wider range of SV events.
Alignment metrics
For the alignment metrics, we used to use samtools flagstat but with bwa mem since some reads get broken into pieces, the numbers are a bit confusing. You can try it if you want.
We prefer the GATK way of computing metrics
java -Xmx2G -jar ${GATK_JAR} CollectAlignmentSummaryMetrics \
-R ${REF}/genome/Homo_sapiens.GRCh37.fa \
-I alignment/NA12878/NA12878.sorted.dup.recal.bam \
-O alignment/NA12878/NA12878.sorted.dup.recal.metric.alignment.tsv \
--METRIC_ACCUMULATION_LEVEL LIBRARY
#### explore the results
less -S alignment/NA12878/NA12878.sorted.dup.recal.metric.alignment.tsv
3. What is the percent of aligned reads?
Solution (click here)
head -10 alignment/NA12878/NA12878.sorted.dup.recal.metric.alignment.tsv | tail -4 | cut -f7
The alignment rate is 99.7%.
Usually, we consider:
- A good alignment if > 90%
- Reference assembly issues if [75-90]%
- Probably a mismatch between sample and reference if < 75 %
Investigating the Trio (Optional)
At this point we have aligned and called variants in one individual. However, we actually have FASTQ and BAM files for three family members (mother and father)!
As additional practice, perform the same steps for the other two individuals (her parents): NA12891 and NA12892.
Quit working node Environment
exit
Summary
In this lab, we aligned reads from the sample NA12878 to the reference genome GRCh37:
- We became familiar with FASTQ and SAM/BAM formats.
- We checked read QC with BVAtools.
- We trimmed unreliable bases from the read ends using Trimmomatic.
- We aligned the reads to the reference using BWA.
- We sorted the alignments by chromosome position using GATK.
- We realigned short indels using GATK.
- We fixed mate issues using GATK.
- We recalibrate the Base Quality using GATK.
- We generate alignment metrics using GATK.