Module 4: Single Nucleotide Variant Calling

Introduction

The goal of this practical session is to identify single nucleotide variants (SNVs) in a human genome and to annotate them. In the previous module 3, we have aligned the reads from NA12878 (daughter) in a small region on chromosome 1. We will continue to use the data generated during the Module 3.

NA12878 is the child of the trio while NA12891 and NA12892 are her parents.

Pedigree

For practical reasons we subsampled the reads from the sample because running the whole dataset would take way too much time and resources. We’re going to focus on the reads extracted from a 300 kbp stretch of chromosome 1

Chromosome	Start	End
1	17704860	18004860

Original Setup

Software Requirements

These are all already installed, but here are the original links.

In this session, we will particularly focus on GATK HaplotypeCaller SNV detection tool. The main advantage of HaplotypeCaller is to do the calling using a local de-novo assembly approach. When the program encounters a region showing signs of variation, it discards the existing mapping information and completely reassembles the reads in that region. This allow a better accuracy in regions that are traditionally difficult to call, for example when they contain different types of variants close to each other.

Environment setup

mkdir -p $HOME/workspace/HTG/Module4/

docker run --privileged -v /tmp:/tmp --network host -it -w $PWD -v $HOME:$HOME \
--user $UID:$GROUPS -v /etc/group:/etc/group  -v /etc/passwd:/etc/passwd \
-v /etc/fonts/:/etc/fonts/ -v /media:/media c3genomics/genpipes:0.8

export WORK_DIR_M4=$HOME/workspace/HTG/Module4/
export REF=$MUGQIC_INSTALL_HOME/genomes/species/Homo_sapiens.GRCh37/
mkdir -p ${WORK_DIR_M4}/bam/NA12878
cd $WORK_DIR_M4

ln -s ~/CourseData/HTG_data/Module4/* .
cp $HOME/workspace/HTG/Module3/alignment/NA12878/NA12878.sorted.ba* bam/NA12878
cp $HOME/workspace/HTG/Module3/alignment/NA12878/NA12878.sorted.dup.recal.ba* bam/NA12878


module load mugqic/java/openjdk-jdk1.8.0_72 mugqic/GenomeAnalysisTK/4.1.0.0 mugqic/snpEff/4.3

Data files

The initial structure of your folders should look like this:

ROOT
|-- bam/               # bam file from the previous Module(down sampled)
    `-- NA12878/             # Child sample directory
`-- scripts/                 # command lines scripts
`-- saved_results/           # precomputed final files

Cheat sheets

Input files

Let’s look into the NA12878 bam folders.

ls bam/NA12878/

Our starting data set consists of 100 bp paired-end Illumina reads from the child (NA12878) that have been aligned to GRCh37 during one of the previous modules (NA12878.sorted.bam). We also have the same data after duplicate removal, indel realignment and base recalibration (NA12878.sorted.dup.recal.bam).

1. Do you know what are the .bai files?

Calling Variants with GATK

If you recall from the previous module, we first mapped the reads to GRCh37 and then we removed duplicate reads and realigned the reads around the indels.

Let’s call SNPs in NA12878 using both the original and the improved bam files:

mkdir -p variants
#NA12878.sort
java -Xmx2g -jar $GATK_JAR HaplotypeCaller \
-R $REF/genome/Homo_sapiens.GRCh37.fa \
-I bam/NA12878/NA12878.sorted.bam \
-O variants/NA12878.hc.vcf \
-L 1:17704860-18004860

#NA12878.sort.rmdup.realign
java -Xmx2g -jar $GATK_JAR HaplotypeCaller \
-R $REF/genome/Homo_sapiens.GRCh37.fa \
-I bam/NA12878/NA12878.sorted.dup.recal.bam \
-O variants/NA12878.rmdup.realign.hc.vcf \
-L 1:17704860-18004860

-Xmx2g instructs java to allow up 2 GB of RAM to be used for GATK.

-R specifies which reference sequence to use.

-I specifies the input BAM files.

-L indicates the reference region where SNP calling should take place

Investigating the SNP Calls

Use less to take a look at the vcf files:

less -S variants/NA12878.rmdup.realign.hc.vcf

Vcf is a daunting format at first glance, but you can find some basic information about the format [here](https://www.internationalgenome.org/wiki/Analysis/vcf4.0.

Fields vary from caller to caller. Some values are more constant. The ref vs alt alleles, variant quality (QUAL column) and the per-sample genotype (GT) values are almost always there.

1. How do you figure out what the genotype is for each variant?

2. Do we have any annotation information yet?

3. How many SNPs were found?

4. Did we find the same number of variants using the files before and after duplicate removal and realignment?

The number of variant can be computed using this simple command:

grep -v "^#" variants/NA12878.hc.vcf | wc -l

We have 399 in the raw vcf, we previously saw 404 in the realigned vcf.

In that case the impact of bam improvement is very low because HaplotypeCaller perform internally a similar step to indel-realignment.

Looking for Differences Between the Two VCF Files

Use the following command to pull out differences between the two files:

diff <(grep -v "^#" variants/NA12878.hc.vcf | cut -f1-2 | sort) \
<(grep -v "^#" variants/NA12878.rmdup.realign.hc.vcf | cut -f1-2 | sort)

103a104
> 1      17744709
211a213
> 1      17811139
244a247
> 1      17830757
265,266c268,270
< 1      17845333
< 1      17845337
---
> 1      17845298
> 1      17845318
> 1      17845334
274a279
> 1      17850941
303a309
> 1      17875141
308d313
< 1      17879077

5. Is this result concordant to our previous comparison of files before and after duplicate removal and realignment?

Use IGV to investigate the SNPs

The best way to see and understand the differences between the two vcf files will be to look at them in IGV.

If you need, the IGV color codes can be found here: [IGV color code by insert size]99(https://igv.org/doc/desktop/) and IGV color code by pair orientation.

Option 1: You can view your files (bam and vcf files) in the IGV browser by using the URL for that file.

In a browser, like Firefox, type in the server name and all files will be shown there. Find your bam and your vcf files, right click it and ‘copy the link location’.

Next, open IGV and select b37 as the reference genome as you did in the visualization module.

In IGV, load both the original and the realigned bam files (NA12878.bwa.sort.bam and NA12878.bwa.sort.rmdup.realign.bam) using (File->Load from URL…).

After you have loaded the two bam files, load the two vcf files (NA12878.hc.vcf and NA12878.rmdup.realign.hc.vcf) in the same way.

Option 2: Alternatively, you can download all the NA12878.* files in the current directory to your local computer:

To do this you can use the procedure that was described previously. Open another terminal and use the scp command to copy your data into your local computer

Next, open IGV and select b37 as the reference genome as you did in the visualization module.

In IGV, load both the original and the realigned bam files (NA12878.bwa.sort.bam and NA12878.bwa.sort.rmdup.realign.bam) using (File->Load from file…).

After you have loaded the two bam files, load the two vcf files (NA12878.hc.vcf and NA12878.rmdup.realign.hc.vcf) in the same way.

Finally, go to a region on chromsome 1 with reads (1:17704860-18004860) and spend some time SNP gazing…

6. Do the SNPs look believable?

7. Are there any positions that you think should have been called as a SNP, but weren’t?

Looking for INDELs

INDELs can be found by looking for rows where the reference base column and the alternate base column are different lengths. It’s slightly more complicated than that since, you’ll also pick up the comma delimited alternate bases.

Here’s an awk expression that almost picks out the INDELs:

grep -v "^#" variants/NA12878.rmdup.realign.hc.vcf \
| awk '{ if(length($4) != length($5)) { print $0 } }' \
| less -S

You can find a slightly more advanced awk script that separates the SNPs from the INDELs here.

8. Did you find any INDELs?

The number of indel can be computed using this command:

grep -v "^#" variants/NA12878.rmdup.realign.hc.vcf | awk '{ if(length($4) != length($5)) { print $0 } }' | wc -l

We have 82 INDELs in the realigned vcf.

9. Can you find the largest INDEL?

Largest indels size can be computed by extending the previous command while printing the size of the indel:

grep -v "^#" variants/NA12878.rmdup.realign.hc.vcf | \
awk '{ if(length($4) != length($5)) { print sqrt((length($4) - length($5))^2) "\t"$0 } }' | \
sort -k1,1nr | head

The indel sizes are computed as the absolute value of the difference in charater size between the 2 alleles.

The largest ones are 15bp long chr1 17798774 and chr1 17817280.

Filter the Variants

Typically variant callers will only perform a minimal amount of filtering when presenting variant calls.

To perform more rigorous filtering, another program must be used. In our case, we will use the VariantFiltration tool in GATK.

Note

The best practice when using GATK is to use the VariantRecalibrator. In our data set, we had too few variants to accurately use the variant recalibrator and therefore we used the VariantFiltration tool instead.

java -Xmx2g -jar $GATK_JAR VariantFiltration \
-R $REF/genome/Homo_sapiens.GRCh37.fa \
-V variants/NA12878.rmdup.realign.hc.vcf \
-O variants/NA12878.rmdup.realign.hc.filter.vcf \
-filter "QD < 2.0" \
-filter "FS > 200.0" \
-filter "MQ < 40.0" \
--filter-name QDFilter \
--filter-name FSFilter \
--filter-name MQFilter

-Xmx2g instructs java to allow up 2 GB of RAM to be used for GATK.

-R specifies which reference sequence to use.

-V specifies the input vcf file.

-O specifies the output vcf file.

-filter defines an expression using the vcf INFO and genotype variables.

--filter-name defines what the filter field should display if that filter is true.

1. What is QD, FS, and MQ?

Adding Functional Consequence

The next step in trying to make sense of the variant calls is to assign functional consequence to each variant.

At the most basic level, this involves using gene annotations to determine if variants are sense, missense, or nonsense.

We typically use SnpEff but many use Annovar and VEP as well.

Let’s run snpEff

java -Xmx4G -jar $SNPEFF_HOME/snpEff.jar eff \
-v -no-intergenic \
-i vcf -o vcf GRCh37.75 variants/NA12878.rmdup.realign.hc.filter.vcf >  variants/NA12878.rmdup.realign.hc.filter.snpeff.vcf

-Xmx2g instructs java to allow up 4 GB of RAM to be used for snpEff.

-v specifies verbose output.

-no-intergenic specifies that we want to skip functional consequence testing in intergenic regions.

-i and -o specify the input and output file format respectively. In this case, we specify vcf for both.

GRCh37.75 specifies that we want to use the GRCh37.75 annotation database.

variants/NA12878.rmdup.realign.hc.filter.vcf specifies our input vcf filename

variants/NA12878.rmdup.realign.hc.filter.snpeff.vcf specifies our output vcf filename

Investigating the Functional Consequence of Variants

You can learn more about the meaning of snpEff annotations here.

Use less to look at the new vcf file:

less -S variants/NA12878.rmdup.realign.hc.filter.snpeff.vcf

We can see in the vcf that snpEff added few sections. These are hard to decipher directly from the VCF other tools or scripts, need to be used to make sens of this.

The annotation is presented in the INFO field using the new ANN format. For more information on this field see here. Typically, we have:

Here’s an example of a typical annotation:

ANN=C|intron_variant|MODIFIER|PADI6|PADI6|transcript|NM_207421.4|Coding|5/16|c.553+80T>C||||||

1. What does the example annotation actually mean?

Next, you should view or download the report generated by snpEff.

Use the procedure described previously to retrieve:

snpEff_summary.html

Next, open the file in any web browser.

Finding Impactful Variants

One nice feature in snpEff is that it tries to assess the impact of each variant. You can read more about the effect categories here.

2. How many variants had a high impact?

3. How many variants had a moderate impact?

4. What effect categories were represented in these variants?

5. Open that position in IGV, what do you see?

Adding dbSNP Annotations

Go back to looking at your last vcf file:

less -S variants/NA12878.rmdup.realign.hc.filter.snpeff.vcf

1. What do you see in the third column?

The third column in the vcf file is reserved for identifiers. Perhaps the most common identifier is the dbSNP rsID.

Use the following command to generate dbSNP rsIDs for our vcf file:

#switch to old GATK 3.8
module unload  mugqic/GenomeAnalysisTK/4.1.0.0
module load mugqic/GenomeAnalysisTK/3.8

java -Xmx2g -jar $GATK_JAR -T VariantAnnotator \
-R $REF/genome/Homo_sapiens.GRCh37.fa \
--dbsnp $REF/annotations/Homo_sapiens.GRCh37.dbSNP150.vcf.gz \
-V variants/NA12878.rmdup.realign.hc.filter.snpeff.vcf \
-o variants/NA12878.rmdup.realign.hc.filter.snpeff.dbsnp.vcf \
-L 1:17704860-18004860

#return to GATK 4
module unload mugqic/GenomeAnalysisTK/3.8
module load  mugqic/GenomeAnalysisTK/4.1.0.0

-Xmx2g instructs java to allow up 2 GB of RAM to be used for GATK.

-R specifies which reference sequence to use.

--dbsnp specifies the input dbSNP vcf file. This is used as the source for the annotations.

-V specifies the input vcf file.

-o specifies the output vcf file.

-L defines which regions we should annotate. In this case, I chose the chromosomes that contain the regions we are investigating.

2. What percentage of the variants that passed all filters were also in dbSNP?

The precentage could be estimate directly by this command:

awk ' BEGIN {FS="\t" ; p=0; d=0 ; OFS="\t"} \
{if ($7 == "PASS") {p++; if ($3 != ".") {d++}}} \
END {print "PASS:", p , "dbSNP:" , d , "%:", d/p} ' variants/NA12878.rmdup.realign.hc.filter.snpeff.dbsnp.vcf

PASS:      404      dbSNP:      397      %:      0.982673

For those unfamiliar with awk commands, they can use these commands and then compute manually the percentage.

grep -v ^# variants/NA12878.rmdup.realign.hc.filter.snpeff.dbsnp.vcf | grep PASS | grep -c rs
grep -v ^# variants/NA12878.rmdup.realign.hc.filter.snpeff.dbsnp.vcf | grep -c PASS

397
404

3. Can you find a variant that passed and wasn’t in dbSNP?

Investigating the Trio (Optional)

At this point we have aligned and called variants in one individual. However, we actually have FASTQ and BAM files for three family members!

As additional practice, perform the same steps for the other two individuals (her parents): NA12891 and NA12892. Here are some additional things that you might want to look at:

1. If you load up all three realigned BAM files and all three final vcf files into IGV, do the variants look plausible? Use a Punnett square to help evaluate this. i.e. if both parents have a homozygous reference call and the child has a homozygous variant call at that locus, this might indicate a trio conflict.

2. Do you find any additional high or moderate impact variants in either of the parents?

We can extract the HIGH impact variant in all sample using this command:

grep HIGH variants/NA128*.rmdup.realign.hc.filter.snpeff.vcf

We don’t find any high impact variant in the trio.

3. Do all three family members have the same genotype for Rs7538876 and Rs2254135?

For Rs7538876:

grep rs7538876 variants/NA128*.rmdup.realign.hc.filter.snpeff.dbsnp.vcf

One parent is Homozygote variant the second one is Heterozygote and the child is Homozygote variant.

For Rs2254135:

grep rs2254135 variants/NA128*.rmdup.realign.hc.filter.snpeff.dbsnp.vcf

The two parents are Heterozygotes and the child is Homozygote variant.

GATK produces even better variant calling results if all three BAM files are specified at the same time (i.e. specifying multiple -I filename options). Try this and then perform the rest of module 4 on the trio vcf file. Does this seem to improve your variant calling results? Does it seem to reduce the trio conflict rate?

Quit the Container Environment

exit

Acknowledgements

This module is heavily based on a previous module prepared by Michael Stromberg and Guillaume Bourque.

Lab Complete!

You’re done Module 4! We hope that you enjoyed the lab and that you continue to enjoy Single Nucleotide Variant Calling.

Lecture

Lab

Introduction

Original Setup

Software Requirements

Environment setup

Data files

Cheat sheets

Input files

Calling Variants with GATK

Investigating the SNP Calls

Looking for Differences Between the Two VCF Files

Use IGV to investigate the SNPs

Looking for INDELs

Filter the Variants

Adding Functional Consequence

Investigating the Functional Consequence of Variants

Finding Impactful Variants

Adding dbSNP Annotations

Investigating the Trio (Optional)

Acknowledgements