Module 2
Lab (Part 1)
Docker
Docker is what is known as a container service. Basically docker containers are like miniature versions of Virtual Machines, containing a base operating system and other code. Most often these exist to run specific programs in a controlled and self-contained environment, allowing you to run programs which may have complex software dependencies without needing to install them.
In this lab many of the software tools you will use are ones where we will use the version that is distributed in a docker container. Docker itself could be the subject of an entire workshop, but getting started with Docker can be easy. For our purposes we will only need to be able to run a program via Docker and understand how Docker containers interact with the file system of the AWS virtual machine we are using.
First, verify docker is installed:
which docker
Question: What is the path for the docker executable installed on your system?
Docker containers come in the form of an image, these can be pulled (like code can be pulled with git) from remote repositories. A number of docker images have already been pulled for this AWS instance. You can see what images are already available on a system with the following command:
docker image list
There should be three images on the system already. These images are listed as repository/image name. Repositories can have a directory structure, and so everything to the left of the final / is the repository and directory path, while the string to the right of the terminal / is the image name. Images are also tagged with names that may provide version numbers or the names of what is known as a release. The most recent release of an image in a repository is also given the tage latest.
Question: What are the names of the 3 docker images on the system? What repository did they come from? Can you use this informaton to find where on the internet the repository is for the two images that come from the same repository?
Docker images/containers are run on a system using the docker run command. This takes on the following structure, many of these are optional arguments:
docker run [docker options] [repository]/[image name]:[tag] [command] [command options]
As an example, you can run the bwa command in its associated container with the following:
docker run pegi3s/bwa bwa
BWA is an example of a program that when run without any options it prints out a help message and exits, giving you information on how to run the program. These messages often provide other information as well, such as the version of the software being run.
Question: What is the version of BWA being run with this docker container?
One important aspect of Docker containers is that because they are self-contained pieces of code, they have their own filesystem “inside” the container. This exists in memory as the container is run, and is gone when the container stops running. Without some help a docker container doesn’t “see” your filesystem and can’t read or write files on that filesystem, it can only interact with files in its own system.
In order for a docker container to interact with your filesystem, you need to mount parts of that filesystem inside the docker container as volumes. This is handled as an option to docker run with the -v command-line flag. This takes the following structure:
docker run -v [host path]:[container path] [other docker options] [repository]/[image name]:[tag] [command] [command options]
You can specify -v [host path]:[container path]`` multiple times to mount multiple different directories inside your running container. It is possible to run a container in interactive mode with the-itcommnad-line flag so you can navigate and work "inside" the container. This can be useful for understanding all of the software that may exist within the container as well as its internal filesystem structure. In this tutorial you will be given appropriate examples of volume mounting in order to mount/CourseDataand/workspace`.
Investigating Some Genomics File Formats and Metadata
FastQ Files
For most Next-Generation sequencing instruments the primary output is in the form of FastQ files, which are most often encountered in their compressed form: .fastq.gz (remember that Linux doesn’t actually care about filename extensions and several alternatives such as .fq.gz are in common use).
You can examine FastQ files, gzipped or not, from the command-line of a linux or MacOS system, with standard tools. One commonly used tool for viewing plain-text files from the command-line in chunks instead of loading the whole thing into memory (which would be a bad idea with msot FastQ files as they are very large) is the less command. When dealing with compressed files a version of the tool called zless is used instead. There are several small FastQ files we will use in this Lab in /home/ubuntu/CourseData/Module2/FastQs/.
Note: these ‘z’ forms exist for many common command-line tools:
grep/zgrep,less/zless, andcat/zcatamong others.
Note: If you want to hear a story of deeply nerdy humour, and why the tool less is called less, feel free to ask.
Take a look at the files (one at a time) normal_test_1.fastq.gz and normal_test_1.fastq.gz in that directory using the zless tool. To page through the file hit the spacebar.
zless normal_test_1.fastq.gz
zless normal_test_2.fastq.gz
Questions: How many base pairs are the sequencing reads? What is the instrument ID of the sequencer these sequencing reads were generated on? What sequencing lane were the reads from in the normal and the tumour? What were the flowcell IDs?
Because of the structure of a FastQ File, you can use the number of lines in a file to determine the number of sequencing reads, zcat will output all of the lines to standard out, and we can pass that to wc -l to count the number of lines, dividing by 4 will give you the number of reads. This can be especially important to ensure that the Read 1 and Read 2 files are the same size. If they aren’t, something may be wrong with your data or it has come from a source where you have unpaired (singleton) reads.
zcat normal_test_1.fastq.gz | wc -l
Questions: How many reads are in the files? Are they all the same size?
Because FastQ ID lines start with the @ symbol, you might think you could use the grep matching command to count the number of reads without needing to do any math. And on this particular data, it may actually work. Test it yourself:
zgrep "@" normal_test_1.fastq.gz | wc -l
Question:* Why would this be a bad idea in practice?
BED Files
BED (Browser Extensible data) files are used most often to represent genomic features. Many programs can use BED files to draw representations, like lines, on top of some sort of genomic data structure. For instance the UCSC Genome Browser, or IGV which you will use later on in this workshop. An example BED file can be found in /home/ubuntu/CourseData/Module2/accessory_files. A common use case for BED files is to give genomic ranges.
The BED format didn’t have a formal specification until 2021! The original UCSC description of the file format is here. The formal 1.0 specification from the Global Alliance for Genomics and Health is here.
You can print the full contents of this file to standard output:
cat gene_target_regions.bed
This is a minimal BED file with only the 3 required columns.
Question: What does each column represent?
Note: BED files use a 0-based coordinate system, such that the first base pair of a contig (typically a chromosome in our work) is a 0 instead of a 1.
BED files are structured into column based data, where each column should be separated by a tab character. Tabs and other whitespace (like simple spaces) can’t be distinguished when we output a file like this, but mixing up tabs and spaces is a common source of files not behaving, or not behaving correctly, in genomics output. Unix lets us see these special characters, like tabs and new line characters.
cat -A gene_target_regions.bed
Question: What combination of characters are used to represent tabs and newlines?
Make a copy of this file cp gene_target_regions.bed test.bed and then edit the copy you made using the command-line program nano: nano test.bed. This opens a simple text editor that you can navigate with your keyboard arrows but not with your mouse. Delete a tab and replace it with several spaces instead and repeat the cat -A command but on your edited test.bed file. Notice the difference between tabs and spaces when you view the file this way.
BAM Files
BAM files (Binary SAM), are another file format for containing information about sequencing reads. While we most often see these for storing alignment (short-read mapping) data, they can actually contain unaligned sequencing reads as well. In fact the newer versions of the GATK pipeline we are dealing with here often start by converting FastQ files to Unaligned BAMs. We aren’t going to do that here. BAM files are often (but not always) accompanied by an index file (*.bai). Because BAMs are compressed and generally sorted in some fashion (usuually by genomic coordinate but sometimesby Read Name), Index files allow other tools to efficiently move through and read that file. For instance by jumping directly to the part of a file where a genomic coordinate starts instead of reading it line by line.
The SAM/BAM specification gives you the current format specifications of SAM and BAM files. Remember, a BAM is a compressed binary version of the SAM specification, which is encoded in plain text, although there are some differences, for instance SAM files use a 1-based coordinate system while BAMs are 0-based.
Here we will use the tool samtools to explore BAM files. We have some example BAM files in /home/ubuntu/CourseData/Module2/BAMs
All of the metadata associated with a file is found in its header. SAM/BAM files contain a number of different ‘tags’ like (RG?) and (SQ?) in the header. Information on what these tags are is found in section 1.3 of the linked specification PDF (page 3). View the header of your provided BAM file and reference the specification to interpret what each of these tagged sections is referencing. To view the header of the Normal sample in this directory:
samtools head C-GIAB.normal.regions_of_interest.bam
You’ll notice there is a lot of information here. You can page through all of the info by piping the output of this command to less, again using the spacebar to page through the file:
samtools head C-GIAB.normal.regions_of_interest.bam | less
You’ll see the sort order of a BAM file is specified with the SO tag.
Question: What are the sort orders of the sample BAM files?
After the (SQ?) tags, which give you the sequences the reads were aligned against, (PG?) tags give you different programs and command-lines that have been used on the data, appearing in the order they were used. In this case there have been some very lengthy command-lines with lots of options used on these files! Luckily the program is listed first with the ID field giving its name, and then all of the options
Question: What programs were used on this file?
We can use samtools to do lots of things. Most commonly we use it to sort files into different sort orders, to convert from SAM to BAM or vise versa, and can even use it to view alignments.
To take a BAM file and “undo” the alignment process to create paired FastQ files, we will first need to change the sort order so that it is sorted by read name instead of coordinate sort order.
samtools sort -n -@ 4 -O BAM -o [output_file_name.bam] [input.bam]
Remember to specify paths if necessary for input and output files, and that the output filename doesn’t already exist. Specifying samtools sort -h will give you the list of options you can use with samtools sort.
Question: What do the -n, -@ and -O command-line flags do?
To output the read information back into FastQ format you use the samtools fastq command:
samtools fastq -1 [output_file_name_1.fastq.gz] -2 [output_file_name_2.fastq.gz] -@ 2 [input_file.bam]
Question: If you use zless to look at these FastQ files, what do you notice compared to the ones you looked at previously?
VCF Files
Variant Call Format (VCF) files are used to store information about genomic variants, a more generic term than mutation. The current specification for VCF format is here.
Note: VCF files are 1-based coordinate systems. Keep this in mind when converting back and forth between BAMs, BEDs, and VCFs! Conversion tools will handle this for you, but this often trips people up when looking at mutations from VCF files in BAM files for instance, which you will be doing later!
Like FastQs, VCFs may be encountered in either their compressed (*.gz) or uncompressed format. VCFs are typically compressed with bgzip, which is mutually compatible with the standard gzip command and uses the same file extension. BGzip is basically a special version of gzip that was created specifically for genomics file formats like the VCF. Like BAMs, compressed VCFs should be indexed (here the command/tool for indexing a compressed VCF is tabix).
Note: If you just google the term VCF you may encounter an entirely different file format which was created for cell phones a long time ago. Sometimes you have to specify genomics or variant or similar in a google search term!
Uncompressed VCFs can be examined using the less command, while gzip/bgzip compressed VCFs can be examined with zless. As with other genomic file formats the metadata of a VCF file is contained in the header. The header of a VCF file are all of the lines that start with one or more ‘#’ characters. The last of these is a line giving the column headers for the variant that follows. The variant data section of a VCF, like a BED file, is column-based, with each column separated by a tab character.
The Header section of a VCF file contains information about various tags and fields in the VCF file. In addition, when a VCF file is annotated, all of the information about the annotations that are added to the INFO column of the data have their various tags explained here in the header. The various values that can be found in the FILTER column are also described here.
An couple of example VCF files can be found in ~/CourseData/Module2/VCFs. One of these is an extremely minimal test VCF while the other is the output of MuTect2, a somatic variant caller you will use later in this workshop. The general structure of the VCF file is more apparent if you investigate the test_panel.GRCh38.minimal.vcf file first, before looking at the mutect2_tumour_normal.filtered.regions_of_interest.vcf.gz file.
Question: What file format specification was used for each file? Is it the same or different?
Question: In the Mutec2 generated VCF how many different values can the FILTER field take?
Question: What is the ID in the FORMAT field for the Approximate read Depth at a site/variant?
Question: What are the genomic coordinates of the first variant listed in the file? What are the Reference and Alt alleles? What entries are given in the filter field?
Lab (Part 2) - Implementing the GATK Best Practices Workflow
In this portion of the Lab we will be implementing a basic approximation of the Broad Institute’s [Best Practices Workflow] (https://gatk.broadinstitute.org/hc/en-us/sections/360007226651-Best-Practices-Workflows), which has been a standard pipeline for the analyis of human-derived next-generation sequencing data, particular whole genomes, for over a decade. There are in fact a number of best-practices workflows. Here we will focus on the Data pre-processing for variant discovery workflow. This is broadly applicable and takes data from FastQs or Unaligned BAMs to aligned BAMs that are ready for variant calling. This workflow is generally applicable and can feed into germline, somatic, or mitochondrial short variant calling, copy-number analysis, etc. Notably this is NOT appropriate for handling RNA-derived data such as a transcriptome.
For the GATK portion of the lab we will be using the set of small FastQs found in the /home/ubuntu/CourseData/Module2/FastQs directory. These files only have 100,000 reads in them, so they will run through each step of the process very fast. That isn’t enough reads, when drawn from the whole genome, to do anything particularly useful. But some other example data is included in the ~/CourseData folder that can be used for visualization of BAMs with IGV or for running variant calling like MuTect2.
FastQC Quality Control
Prior to doing any actual analysis of data we receive from a sequencing experiment, we first have to perform so basic quality control to make sure the data is actually good. The FastQC program is a program that has been around “forever” in the NGS space for doing just that. This is one of the programs we will be using via a docker container. We are also going to want to make sure that our output goes into the ~/workspace directory of our AWS image as this directory is viewable in a web browser, which will let use view compatible output files as well as download data to our local machine.
To run FastQC on the FastQ file with Read 1 data from the normal sample in the FastQs directory and output to a FastQC folder in the ~/workspace directory we will mount the home directory itself to the /data directory of the container. We need to also create the output directory first. The the ~/workspace directory create a folder caleld FastQC: mkdir FastQC.
We can then run FastQC with docker. Note that this particular docker container automatically runs the fastqc command when the docker container is invoked, so it is not specified here in the command-line.
docker run --rm -v /home/ubuntu/CourseData:/data -v /home/ubuntu/workspace:/workspace pegi3s/fastqc -o /workspace/FastQC/ /data/Module2/FastQs/normal_test_1.fastq.gz
FastQC can even be run on more than one FastQ file at a time. Take a look at the command line options for FastQC by running docker run --rm pegi3s/fastqc -h and then modify the above command to run on all of the FastQ files in the FastQ directory.
You can view the contents of your ~/workspace directory by going to: http://NUM.uhn-hpc.ca/ where NUM is the number for your particular AWS instance. You can click on any of the HTML files in the FastQC directory to open the FastQC reports directly.
Question: How many reads are in the normal_test_1.fastq.gz file according to its FastQC output? How many total bases?
Question: What is the GC % of th data?
Question: What is the average quality per read?
Alignment/Short-Read Mapping with BWA-MEM
In this and following steps, the command-lines provided for you in code blocks:
Like This
May not be complete. They will server as guidance for building the complete and proper command-line yourself.
The first step of the GATK workflow is to use a short-read mapping algorithm to align our FastQs to a Human Reference Genome (GRCh38). We will use the program BWA for this, again running BWA-MEM from a docker container. BWA-MEM as general input takes 1 or more FastQ files and a reference genome file, and then outputs the results to standard out in the uncompressed SAM format. We will re-route that output to a file. In this case, and for many of our intermediate processing steps, we will not output our results to the ~/workspace directory, but instead will output them to locations in the ~/CourseData directory. Since the outputs of one program will be the inputs of the next program in our chain this is generally more convenient. We will do all of our intermediate processing in the ~/CourseData/Module2/Processing directory. We also have a tumour and a normal sample in our FastQs, each of these steps, until we get to variant calling, we run on each of those individually. For convenience the command-lines provided below will assume you are running on the Normal tissue sample. Simply repeat, changing file names as appropriate for the tumour output.
We can view all of the command-line options for BWA-mem by running the following command, which will also serve as the base for all of our BWA-MEM commands.
docker run --rm -v "/home/ubuntu/CourseData:/data" pegi3s/bwa bwa mem
We see the basic usage is: bwa mem [options] <idxbase> <in1.fq> [in2.fq]. In our case we have two FastQ files, which are gzipped (bwa-mem reads gzipped FastQs just fine). Our GRCh38 humans reference is located at: ~/CourseData/tools/reference/GATK-bundle-GRCh38/Homo_sapiens_assembly38.fasta.
Question: What other files are located in the
~/CourseData/tools/reference/GATK-bundle-GRCh38/directory? Hint: The various iterations on Homo_sapiens_assembly38.fasta with extra extensions are index files used by various programs to efficiently use the FASTA reference file.
So the command-line structure for BWA-MEM (without the docker bits added on) looks like this to align two gzipped FastQ files to a reference FASTA file and output to a SAM file:
bwa mem /path/to/reference.fasta /path/to/sample_1.fastq.gz /path/to/sample_2.fastq.gz -o /path/to/output.sam
You have test FastQ files for the normal sample:
/home/ubuntu/CourseData/Module2/FastQs/normal_test_1.fastq.gz
/home/ubuntu/CourseData/Module2/FastQs/normal_test_2.fastq.gz
and the Tumour sample:
/home/ubuntu/CourseData/Module2/FastQs/tumour_test_1.fastq.gz
/home/ubuntu/CourseData/Module2/FastQs/tumour_test_2.fastq.gz
Create the command-line, using the docker elements and correct paths for docker to align the two normal FastQ files to the provided Human reference, and output to a SAM file in the ~/CourseData/Module2/Processing directory.
Some hints, if you use the same volume mounting paths and names as I did above with this docker command docker run --rm -v "/home/ubuntu/CourseData:/data" pegi3s/bwa bwa mem then the normal FastQ files would have the following paths in the Docker container:
/data/Module2/FastQs/normal_test_2.fastq.gz
/data/Module2/FastQs/normal_test_1.fastq.gz
and the processing directory path would be /data/Module2/Processing.
Use Samtools to sort the results and convert a SAM to a BAM
For further processing with the GATK we need to sort the SAM so that it is coordinate, and not read sorted, and convert it to a BAM file. We can do this from inside our ~/CourseData/Module2/Processing directory and we can do it in a single step:
samtools sort -@ 4 -O bam -o normal.bam normal.sam
Replace normal.bam and normal.sam with whatever you called your SAM file and whatever you want to call the BAM version of that file. Typically we would keep the file names the same and just change the extension. Repeat for the tumour SAM.
Question: What command-line flag could you give to the samtools sort command above to have it automatically create an index file for the output BAM?
Question: If you wanted to sort the SAM, but output to a SAM file instead how would you modify the file? What about if you wanted to create a BAM file but change its compression level to 0? Making a binary but uncompressed file?
GATK Pre-Processing
Aligned BAM files are not yet ready for variant calling, and some intermediate pre-processing needs to be done.
Add Read Group Data
First, we want to add Read Groups to our data which gives a sample name and information like Library and Lane IDs, as well as other metadata like the kind of sequencing that was used. Here we will use the GATK software package itself, which we are running from another docker container. Again the example below will just be for the normal sample, you’ll need to repeat it for the tumour. I will assume that your BAM file is just called normal.bam so you’ll need to change that and other possible parameters in order to match this to your own preferences. You can call files pretty much anything you want after all, but I will use some naming conventions here that I find helpful for tracking where in the workflow a file was produced.
You can read more about Read Groups here. Also in the SAM specification that you were provided earlier.
The Basic Docker part of your command-line for the GATK should be:
docker run -v /home/ubuntu/CourseData:/gatk/CourseData --rm public.ecr.aws/aws-genomics/broadinstitute/gatk:4.2.6.1-corretto-11 gatk
I will assume this portion in the command-lines below and replace that with \[docker gatk\]. Remember that the [] here is not meant to by typed literally, this is indicating a variable where you will replace it with the appropriate value, like the docker associated command-line above.
[docker gatk] AddOrReplaceReadGroups -I /gatk/CourseData/Module2/Processing/normal.bam -O /gatk/CourseData/Module2/Processing/normal.rg.bam --RGID normal_lane1 --RGLB normal_lane1 --RGSM normal --RGPL ILLUMINA --RGPU Illumina
You can put whatever you want in the various RG tabs you are adding. Here I am giving Read Group IDs (RGID) and lanes the same value, and specifying the data came from Lane 1 of the Flowcell. You can alter this to whatever the Lane was that the normal and tumour came from in their respective files for accuracy. I have also given it the sample name of ‘normal’. In the tumour you would want to use –RGSM tumour. This will be important for variant calling later on!
Modify this command-line however you need to, and also modify and run again for the tumour sample.
MarkDuplicates
PCR duplication rates vary, but even at their lowest are around 1-2%, so even in our 100,000 test reads we probably have a few. Either way, for whole genome sequencing workflows this is a crucial data quality control step.
Note: While we refer to this step colloquially as “Duplicate Removal” we are actually only marking them in this step. The duplicate reads are still in the BAM file, but a flag is set that marks them as being a PCR duplicate. Therefore this is what we call a lossless step.
The preferred method of doing this is with the MarkDuplicatesSpark tool; however, because of the java version in use in our container we need to use the non-spark version with this test setup. The command to invoke them is nearly identical. MarkDuplicatesSpark scales better on real data as it can use multiple CPU threads. But for our purposes this works just as well.
[docker gatk] MarkDuplicates -I /gatk/CourseData/Module2/Processing/normal.fixed.bam -O /gatk/CourseData/Module2/Processing/normal.dup_marked.bam -M /gatk/CourseData/Module2/Processing/normal.duplicate_metrics.txt
Question:: What information is found in the duplicate_metrics.txt file? How many Duplicates were marked by this process?
Recalibrated Base Qualities
The Base Quality Scores that come from Illumina Sequencing data are good, but the GATK uses several models based on the data to recalibrate these based on their own models as well as the data in your alignment. This step works best with Whole Genome Sequencing. This is a step that runs in two parts, first we do a calculation that creates a recalibration table, and then we apply it.
The recalibration here uses some additional reference files, VCF files that contain variant information from dbSNP and GnomAD. Two databases of population allele references. This gives us locations of known polymorphisms in the human genome which is an input to the model.
[docker gatk] BaseRecalibrator -I /gatk/CourseData/Module2/Processing/normal.dup_marked.bam -R /gatk/CourseData/tools/reference/GATK-bundle-GRCh38/Homo_sapiens_assembly38.fasta --known-sites /gatk/CourseData/tools/reference/GATK-bundle-GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf --known-sites /gatk/CourseData/tools/reference/GATK-bundle-GRCh38/af-only-gnomad.hg38.vcf.gz -O /gatk/CourseData/Module2/Processing/normal.bqsr_recal.table
Now apply it:
[docker gatk] ApplyBQSR -R /gatk/CourseData/tools/reference/GATK-bundle-GRCh38/Homo_sapiens_assembly38.fasta -I /gatk/CourseData/Module2/Processing/normal.fixed.bam --bqsr-recal-file /gatk/CourseData/Module2/Processing/normal.bqsr_recal.table -O /gatk/CourseData/Module2/BAMs/normal.recalibrated.bam
At this point we have a complete and pre-processed BAM file ready for variant calling!
Question: We changed the output here for the final recalibrated BAM file as this is a ‘final’ output. Why is this good practice?
Alternatives to Try
Some tools can be run in different modes or with differe t options, and are worth exploring. For instance, it is possible to add Read Group information while running bwa mem with the -R flag flollowed by a string for adding the information:
-R '@RG\tID:foo\tSM:bar'
Try taking the RG tags you added with the GATK tool AddOrReplaceReadGroups and use this option with BWA MEM to add them in one go. You can them use samtools head to view the headers of your files and compare the (RG?) sections. Try taking this sam, convering it into a BAM, and then proceeding with downstream GATK tools. You can skip the AddOrReplaceReadGroups step, adjust your BAM file names accordingly.
You can also try skipping the SetNmMdAndUqTags step.
Question: What does your command line look like?
Putting Everything Together
Running commands one-by-one on the command-line is the best way to learn the ins and outs of a process and how to use specific tools. However, if we are going to do this more than once we want our work to be reproducible, which means putting things together as a script that we can simply configure and run whenever we need it. When you hear a bioinformatician refer to a pipeline that is all they really mean. A bit of code infrastructure that automates a process, taking inputs through a series of steps to get a final output.
In the pre-work for this course you learned about simple shell scripting. While there are more complex ways of implementing a pipeline, these are usually only needed when you are going to run lots of data and want robust error handling and other features. If you are only going to occasionally run a task, are working primarily on a server like this one, and don’t need to handle a lot of complex data then a simple shell script will suffice.
Take the commands you have execute so far and put them into a basic shell script. To make it more useful so that you don’t have edit all your commands to change file names everytime you want to run it you should look for all of the places that you can substitute out values for variable names. That way you can simply change variable names as appropriate.
The Module2 directory structure is a good one for this kind of work. It organizes inputs (FastQs and accessory files), gives locations for final outputs (BAMs and VCFs), and has a central Processing directory whose contents can be safely deleted once the final outputs are created.
The best targets for variables are: - Paths to specific directories, especially a ‘base’ working directory of some kind. For instance /home/ubuntu/CourseData/Module2 here. - Sample names or IDs that can be substituted into file names - Paths to specific files that get used a lot, like for instance the GRCh38 reference sequence - Long strings of supporting code that gets used again and again, like some of the docker wrapping code used here for many programs.
Some other pointers for good script writing:
- Scripts execute the commands inside them ‘silently’, so it is good practice to use echo in your scripts to output a command you are executing to standard output before executing.
- Outputting some basic information at the top of a script, like printing out what sample name or IDs are being used, references, binaries, etc can also be useful
- Repeat exact code as little as possible to minimize the number of places you need to make changes for fixing bugs or altering code
- Define as many variables as possible at the top of the code
- Try and keep all the places you need to make edits for having the script run on a different sample be at the top
Once you have a shell script you should be able to execute it:
sh [scriptname]
Viewing BAMs in IGV
Some additional BAMs, made from the same data source as what we have been using so far, have been provided in the ~/CourseData/Module2/BAMs directory. Here a complete alignment of all of the data was made, and then the BAM was “subset” so that it only contained data from a few regions of the genome. The regions of interest are the ones specified in the BED file we investigated earlier. We will investigate these BAMs in IGV to see what genes were included in our BAM.
The Integrative Genomics Viewer (IGV) is a tool primarily used for viewing BAM files so you can visually see how the reads are aligned, the context of mutations, etc. You can download the program here. While there is an alternative version that runs in a browser window without installing, in my opinion that should only be used in the last resort. At the Download link there are a number of versions available depending on your operating system and CPU chipset in the case of MacOS versions. Go ahead and download and install IGV on your computer.
In order to make downloading the BAM files in the ~/CourseData/Module2/BAMs directory and their associated index files easy, we will create what is known as a soft-link to them in the ~/workspace directory so you can download them from the browser. Here is the general form of the command:
ln -s /path/to/original_file_or_directory /path/to/symlink
Run for each of the BAMs and BAI files in the Module2 BAMs directory and also do it for the BED file in the ~/CourseData/Module2/accessory_files directory. You can then download them via the browser like how you viewed the FastQC outputs previously.
First make sure the correct reference sequence is selected in the drop down menu in the upper left. The default for a new installation should be the human GRCh38 reference, which is what we want to use. IGV can be used with any reference, not just human sequences so if you ever find yourself doing microbial or viral genomics, IGV is still a useful tool!
Learning all of the ins and outs of IGV is beyond the scope of this lab; however, using it to view the basics of a BAM file only requires default options.
When you have IGV installed and running, you can open files in it with File -> Load From File. You can use this to load BAM files, and you can open multiple files at once. Navigate to where you downloaded the BAMs and thier BAI index files, select both BAM files (no need to select the BAIs) and then open them. Files load as Tracks in IGV, on the very bottom we have a representative of the reference sequence. Each BAM loads two tracks, one for the alignments themselves (only viewable when zoomed in to an appropriate resolution level) and one for plotting the coverage.
We can also load VCF files, which will then indicate mutations from that file as their own Track, and BED files which will plot the ranges from that file in their own Track. VCFs will tend to load as tracks at the top while BEDs will load at the bottom. Use File -> Load From File to open the BED file you downloaded as well, which plots the targeted regions of the genome that are of interest.
There are two easy ways to navigate to particular regions of the genome in IGV. You can type in a gene name, like KRAS, or specify genomic coordinates (like that are in your BED file) with the format: chr9:1000-1500. You can then change zoom levels in and out to view your alignment. Try navigating to the KRAS gene and also to each of the genomic coordinates in the BED file to inspect your BAM.
By default IGV will show base pairs where the base in the sequencing read and that in the reference genome match as just a grey block. Mismatches will have the base shown and coloured, insertions in the read show up as an I character, and deletions as a - character.
Question:: What genes of interest were included in the regions selected for your BAM files? Which of these are tumour suppressors and which are oncogenes?
You will use IGV more heavily tomorrow, epsecially when reviewing data for the Cases!
Somatic Variant Calling with MuTect2
If you have time, you can continue on with Variant Calling and Variant Annotation, which will get you started for Day 2!
Once both your tumour and normal Recalibrated BAM files are produced, you are ready for Variant Calling. However, the simple test data we have used so far won’t produce any mutation calls. After all we only aligned 100,000 sequencing reads across the entire human genome. Instead we will use the more complete BAMs you investigated with IGV above for variant calling.
MuTect2 is included in the GATK package and is run the same way, with the same docker-associated code on your command-line. Variant calling is more resource intensive on the RAM usage side, and so the java machine in the docker container needs to be given some more specific options to make use of the RAM. In this case we are going to give it 12 gigs of RAM (our virtual machine has 16) with the --java-options "-Xmx12G".
MuTect2 needs a Reference FASTA file like we have used previously, and of course when running any sort of genomics analysis you will always use the exact same reference throughout.
MuTect2 can be run in a few different modes, and is a somatic variant caller optimized for tumour data. It can run with a tumour-only or with matched tumour-normal data. This second option is what we will be using. To run in this mode you specify the -I input file flag twice, giving it both the tumour BAM and the normal BAM, and then use the -normal flag to give it the name of the normal sample. This needs to match the Sample (SM) Read Group tag you have during the GATK pre-processing steps early, in this case ‘normal’ was the string we used.
MuTect2 can be run with lots of different options (see here).
In this case we will also be giving it a germline resource VCF with the --germline-resource command-line parameter and are using the GnomAD set here, which contains allele frequency information for polymorphisms and mutations seen in tens of thousands of genomes.
[docker gatk] --java-options "-Xmx12G" Mutect2 -R /gatk/CourseData/tools/reference/GATK-bundle-GRCh38/Homo_sapiens_assembly38.fasta -I /gatk/CourseData/Module2/BAMs/C-GIAB.tumour.regions_of_interest.bam -I /gatk/CourseData/Module2/BAMs/C-GIAB.normal.regions_of_interest.bam -normal normal --germline-resource /gatk/CourseData/tools/reference/GATK-bundle-GRCh38/af-only-gnomad.hg38.vcf.gz -O /gatk/CourseData/Module2/VCFs/mutect2_tumour_normal.regions_of_interest.vcf.gz
After running baseline somatic variant calling with MuTect2 we can also apply the FilterMutectCalls program. This can incorporate various additional pieces of information for Read Bias artifacts when dealing with FFPE data or information about potential contamination. We have neither of those here but will run FilterMuTectCalls anyway.
Note: The output file name provided here will overwrite the existing VCF in the VCFs directory. You should change its name to be unique, or output it to the workspace directory!
[docker gatk] --java-options "-Xmx12G" FilterMutectCalls -R /gatk/CourseData/tools/reference/GATK-bundle-GRCh38/Homo_sapiens_assembly38.fasta -V /gatk/CourseData/Module2/VCFs/mutect2_tumour_normal.regions_of_interest.vcf.gz -O /gatk/CourseData/Module2/VCFs/mutect2_tumour_normal.filtered.regions_of_interest.vcf.gz
The output here is now a ‘complete’ VCF.
Variant Annotation with VEP
Called variants on their own are not generally the most useful. If we want to gather information like what Gene a mutation is in, what the impact of a mutation is at the transcript and protein levels, whether a mutation is a known polymrphism in the GnomAD database or a known pathogenic mutation in ClinVar, we have to add annotations.
The annotator we will use here is VEP (Variant Effect Predictor) from Ensembl (Documentation).
We won’t be using a docker container here, but something similar called apptainer. You can see it works much the same way, although the -B flags here are used instead of -v.
Here we are going to annotate using a GRCh38 reference set, and are telling VEP to give us ‘everything’. VEP can be modified with all sort sof plug-ins and modules that go beyond what is added here, but this will be plenty to get you started looking at annotations!
apptainer exec -B /home/ubuntu/CourseData/Module2/VCFs:/data -B /media/cbwdata/CourseData/tools/vep_data:/cache ~/CourseData/tools/ensembl-vep.sif vep -i /data/test_panel.GRCh38.minimal.vcf -o /data/test_panel_GRCh38.minimal.vep.txt --cache --dir_cache /cache --species homo_sapiens --assembly GRCh38 --everything