Informatics on High-Throughput Sequencing Data
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. This means that you are able to copy, share and modify the work, as long as the result is distributed under the same license.
CBW HT-seq Integrative Assignment
Written originally by Mathieu Bourgey, edited by Florence Cavalli
Task
We will perform the same analysis as in Module 3 but using the mother and father samples i.e sample NA12891 and NA12891.
The fastq files are in the following directory of the cloud instance: ~/CourseData/HT_data/Module3/
* raw_reads/NA12891_CBW_chr1_R1.fastq.gz
* raw_reads/NA12891_CBW_chr1_R2.fastq.gz
* raw_reads/NA12892_CBW_chr1_R1.fastq.gz
* raw_reads/NA12892_CBW_chr1_R2.fastq.gz
Environment setup
#set up
export SOFT_DIR=/usr/local/
export WORK_DIR=~/workspace/HTseq/Integrative_Assignment/
export TRIMMOMATIC_JAR=$SOFT_DIR/Trimmomatic-0.36/trimmomatic-0.36.jar
export PICARD_JAR=$SOFT_DIR/picard/picard.jar
export GATK_JAR=$SOFT_DIR/GATK/GenomeAnalysisTK.jar
export BVATOOLS_JAR=$SOFT_DIR/bvatools/bvatools-1.6-full.jar
export REF=$WORK_DIR/reference/
rm -rf $WORK_DIR
mkdir -p $WORK_DIR
cd $WORK_DIR
ln -s ~/CourseData/HT_data/Module3/* .
Task list:
-
Check read QC
-
Trim unreliable bases from the read ends
-
Align the reads to the reference
-
Sort the alignments by chromosome position
-
Realign short indels
-
Fixe mate issues (optional)
-
Mark duplicates
-
Recalibrate the Base Quality
-
Generate alignment metrics
Discussion/Questions:
-
Explain the purpose of each step
-
Which software tool can be used for each step
The full commands can be downloaded here solution