Introduction to ChIP-seq and analysis
Introduction to ChIP-seq and analysis
by Martin Hirst and Edmund Su
Table of contents:
Common tools of the trade
- Explaination of tools
BWA Genomic Sequence Read alignment tool
PICARD Toolkit for BAM file manipulation
SAMTOOLS Toolkit for BAM file manipulation
BEDTOOLSToolkit for BED/BEDGRAPH file manipulation
MACS2 Enriched region identifier for ChIP-seq
UCSCtoolsManipulation of Wigs and Bed/Bedgraph to binary forms
- Resource files
- BWA INDEX:
- ~/CourseData/EPI_data/Module1/BWA_index/
- Enhancer file.bed:
- ~/CourseData/EPI_data/Module1/QC_resources/
- download various state7 and merge https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final/
- TSS.bed:
- ~/CourseData/EPI_data/Module1/QC_resources/
- Generated by downloading Ensemblv79 GTF convert to Bed +/-2kb of TSS. See https://www.biostars.org/p/56280/
4.HOX regions.bed
- ~/CourseData/EPI_data/Module1/QC_resources/
- Generated by downloading Ensemblv79 GTF convert to Bed then selecting for HOX
- Hg38 Black list regions
- ~/CourseData/EPI_data/Module1/QC_resources/
- https://www.encodeproject.org/files/ENCFF356LFX/@@download/ENCFF356LFX.bed.gz
- Resource files
1.CEMT Pooled Breast Basal
- ~/CourseData/EPI_data/Module1/CHEERC_resources
- https://epigenomesportal.ca/tracks/CEEHRC/hg38/69069.CEEHRC.CEMT0035.H3K27ac.peak_calls.bigBed
2.CEMT Pooled Breast Basal
- https://epigenomesportal.ca/tracks/CEEHRC/hg38/69067.CEEHRC.CEMT0035.H3K27ac.signal_unstranded.bigWig
3.CEMT Pooled Breast Basal
- ~/CourseData/EPI_data/Module1/CHEERC_resources
- https://epigenomesportal.ca/tracks/CEEHRC/hg38/69063.CEEHRC.CEMT0035.H3K27me3.peak_calls.bigBed
4.CEMT Pooled Breast Basal
- https://epigenomesportal.ca/tracks/CEEHRC/hg38/69061.CEEHRC.CEMT0035.H3K27me3.signal_unstranded.bigWig
5.CEMT Pooled Breast Basal
- ~/CourseData/EPI_data/Module1/CHEERC_resources
-
https://epigenomesportal.ca/tracks/CEEHRC/hg38/69054.CEEHRC.CEMT0035.H3K4me1.peak_calls.bigBed
6.CEMT Pooled Breast Basal
- https://epigenomesportal.ca/tracks/CEEHRC/hg38/69052.CEEHRC.CEMT0035.H3K4me1.signal_unstranded.bigWig
7.CEMT Pooled Breast Basal
- ~/CourseData/EPI_data/Module1/CHEERC_resources
- https://epigenomesportal.ca/tracks/CEEHRC/hg38/69057.CEEHRC.CEMT0035.H3K4me3.peak_calls.bigBed
8.CEMT Pooled Breast Basal
- https://epigenomesportal.ca/tracks/CEEHRC/hg38/69055.CEEHRC.CEMT0035.H3K4me3.signal_unstranded.bigWig
9.CEMT Pooled Breast Basal
- https://epigenomesportal.ca/tracks/CEEHRC/hg38/69070.CEEHRC.CEMT0035.Input.signal_unstranded.bigWig
10.CEMT Pooled Breast Stromal
- ~/CourseData/EPI_data/Module1/CHEERC_resources
- https://epigenomesportal.ca/tracks/CEEHRC/hg38/69088.CEEHRC.CEMT0036.H3K27ac.peak_calls.bigBed
11.CEMT Pooled Breast Stromal
- https://epigenomesportal.ca/tracks/CEEHRC/hg38/69086.CEEHRC.CEMT0036.H3K27ac.signal_unstranded.bigWig
12.CEMT Pooled Breast Stromal
- ~/CourseData/EPI_data/Module1/CHEERC_resources
- https://epigenomesportal.ca/tracks/CEEHRC/hg38/69082.CEEHRC.CEMT0036.H3K27me3.peak_calls.bigBed
13.CEMT Pooled Breast Stromal
- https://epigenomesportal.ca/tracks/CEEHRC/hg38/69080.CEEHRC.CEMT0036.H3K27me3.signal_unstranded.bigWig
14.CEMT Pooled Breast Stromal-
- ~/CourseData/EPI_data/Module1/CHEERC_resources
- https://epigenomesportal.ca/tracks/CEEHRC/hg38/69073.CEEHRC.CEMT0036.H3K4me1.peak_calls.bigBed
15.CEMT Pooled Breast Stromal
- https://epigenomesportal.ca/tracks/CEEHRC/hg38/69071.CEEHRC.CEMT0036.H3K4me1.signal_unstranded.bigWig
16.CEMT Pooled Breast Stromal
- ~/CourseData/EPI_data/Module1/CHEERC_resources
- https://epigenomesportal.ca/tracks/CEEHRC/hg38/69076.CEEHRC.CEMT0036.H3K4me3.peak_calls.bigBed
17.CEMT Pooled Breast Stromal
- https://epigenomesportal.ca/tracks/CEEHRC/hg38/69074.CEEHRC.CEMT0036.H3K4me3.signal_unstranded.bigWig
18.CEMT Pooled Breast Stromal
- https://epigenomesportal.ca/tracks/CEEHRC/hg38/69089.CEEHRC.CEMT0036.Input.signal_unstranded.bigWig
19.CEMT Pooled Breast Luminal
- ~/CourseData/EPI_data/Module1/CHEERC_resources
- https://epigenomesportal.ca/tracks/CEEHRC/hg38/69107.CEEHRC.CEMT0037.H3K27ac.peak_calls.bigBed
20.CEMT Pooled Breast Luminal
- https://epigenomesportal.ca/tracks/CEEHRC/hg38/69105.CEEHRC.CEMT0037.H3K27ac.signal_unstranded.bigWig
- CEMT Pooled Breast Luminal
- ~/CourseData/EPI_data/Module1/CHEERC_resources
- https://epigenomesportal.ca/tracks/CEEHRC/hg38/69101.CEEHRC.CEMT0037.H3K27me3.peak_calls.bigBed
22.CEMT Pooled Breast Luminal
- https://epigenomesportal.ca/tracks/CEEHRC/hg38/69099.CEEHRC.CEMT0037.H3K27me3.signal_unstranded.bigWig
23.CEMT Pooled Breast Luminal
- ~/CourseData/EPI_data/Module1/CHEERC_resources
- https://epigenomesportal.ca/tracks/CEEHRC/hg38/69092.CEEHRC.CEMT0037.H3K4me1.peak_calls.bigBed
24.CEMT Pooled Breast Luminal
- https://epigenomesportal.ca/tracks/CEEHRC/hg38/69090.CEEHRC.CEMT0037.H3K4me1.signal_unstranded.bigWig
25.CEMT Pooled Breast Luminal
- ~/CourseData/EPI_data/Module1/CHEERC_resources
- https://epigenomesportal.ca/tracks/CEEHRC/hg38/69095.CEEHRC.CEMT0037.H3K4me3.peak_calls.bigBed
26.CEMT Pooled Breast Luminal
- https://epigenomesportal.ca/tracks/CEEHRC/hg38/69093.CEEHRC.CEMT0037.H3K4me3.signal_unstranded.bigWig
27.CEMT Pooled Breast Luminal
- https://epigenomesportal.ca/tracks/CEEHRC/hg38/69108.CEEHRC.CEMT0037.Input.signal_unstranded.bigWig
28.CEMT Pooled Breast Luminal Progenitor
- ~/CourseData/EPI_data/Module1/CHEERC_resources
- https://epigenomesportal.ca/tracks/CEEHRC/hg38/69126.CEEHRC.CEMT0038.H3K27ac.peak_calls.bigBed
29.CEMT Pooled Breast Luminal Progenitor
- https://epigenomesportal.ca/tracks/CEEHRC/hg38/69124.CEEHRC.CEMT0038.H3K27ac.signal_unstranded.bigWig
30.CEMT Pooled Breast Luminal Progenitor
- ~/CourseData/EPI_data/Module1/CHEERC_resources
-
https://epigenomesportal.ca/tracks/CEEHRC/hg38/69120.CEEHRC.CEMT0038.H3K27me3.peak_calls.bigBed
31.CEMT Pooled Breast Luminal Progenitor
- https://epigenomesportal.ca/tracks/CEEHRC/hg38/69118.CEEHRC.CEMT0038.H3K27me3.signal_unstranded.bigWig
32.CEMT Pooled Breast Luminal Progenitor
- ~/CourseData/EPI_data/Module1/CHEERC_resources
-
https://epigenomesportal.ca/tracks/CEEHRC/hg38/69111.CEEHRC.CEMT0038.H3K4me1.peak_calls.bigBed
33.CEMT Pooled Breast Luminal Progenitor
- https://epigenomesportal.ca/tracks/CEEHRC/hg38/69109.CEEHRC.CEMT0038.H3K4me1.signal_unstranded.bigWig
34.CEMT Pooled Breast Luminal Progenitor
- ~/CourseData/EPI_data/Module1/CHEERC_resources
- https://epigenomesportal.ca/tracks/CEEHRC/hg38/69114.CEEHRC.CEMT0038.H3K4me3.peak_calls.bigBed
35.CEMT Pooled Breast Luminal Progenitor
- https://epigenomesportal.ca/tracks/CEEHRC/hg38/69112.CEEHRC.CEMT0038.H3K4me3.signal_unstranded.bigWig
36.CEMT Pooled Breast Luminal Progenitor
- https://epigenomesportal.ca/tracks/CEEHRC/hg38/69127.CEEHRC.CEMT0038.Input.signal_unstranded.bigWig
Module 1. BWA Alignment + BAM post processing
Setup
mkdir ~/workspace/{alignments,fastqc}
1. Make a reference directory and download appropriate fasta reference @ https://hgdownload.cse.ucsc.edu/goldenpath/ (https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz). For the live tutorial this step was done for you. We will be using a shortened version containining only chr19
Code:
mkdir ~/CourseData/BWA_index
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz -O ~/CourseData/BWA_index/
gunzip ~/CourseData/EPI_data/Module1/BWA_index/hg38.fa.gz
2 .Index Fasta using BWA. For the live tutorial, this step was done for you.
Code:
bwa index ~/CourseData/EPI_data/Module1/BWA_index/chr19.hg38_no_alt.fa
ls ~/CourseData/EPI_data/Module1/BWA_index/ -lth
Output:
total 155M
8126488 -rw-rw-r-- 1 ubuntu ubuntu 28M Sep 7 19:22 chr19.hg38_no_alt.fa.sa
8126486 -rw-rw-r-- 1 ubuntu ubuntu 108 Sep 7 19:22 chr19.hg38_no_alt.fa.amb
8126485 -rw-rw-r-- 1 ubuntu ubuntu 153 Sep 7 19:22 chr19.hg38_no_alt.fa.ann
8126484 -rw-rw-r-- 1 ubuntu ubuntu 14M Sep 7 19:22 chr19.hg38_no_alt.fa.pac
8126487 -rw-rw-r-- 1 ubuntu ubuntu 56M Sep 7 19:22 chr19.hg38_no_alt.fa.bwt
8126483 -rw-rw-r-- 1 ubuntu ubuntu 57M Sep 7 18:36 chr19.hg38_no_alt.fa
3. Run quality Check
Code:
mkdir -p ~/workspace/fastqc
fastq_file_A=~/CourseData/EPI_data/Module1/MCF10A_resources/R1_input.fastq.gz
fastq_file_B=~/CourseData/EPI_data/Module1/MCF10A_resources/R1_h3k27ac.fastq.gz
fastqc ${fastq_file_A} -o ~/workspace/fastqc
fastqc ${fastq_file_B} -o ~/workspace/fastqc
Output:
Started analysis of R1_input.fastq.gz
Approx 5% complete for R1_input.fastq.gz
Approx 10% complete for R1_input.fastq.gz
Approx 15% complete for R1_input.fastq.gz
Approx 20% complete for R1_input.fastq.gz
Approx 25% complete for R1_input.fastq.gz
Approx 30% complete for R1_input.fastq.gz
Approx 35% complete for R1_input.fastq.gz
Approx 40% complete for R1_input.fastq.gz
Approx 45% complete for R1_input.fastq.gz
Approx 50% complete for R1_input.fastq.gz
Approx 55% complete for R1_input.fastq.gz
Approx 60% complete for R1_input.fastq.gz
Approx 65% complete for R1_input.fastq.gz
Approx 70% complete for R1_input.fastq.gz
Approx 75% complete for R1_input.fastq.gz
Approx 80% complete for R1_input.fastq.gz
Approx 85% complete for R1_input.fastq.gz
Approx 90% complete for R1_input.fastq.gz
Approx 95% complete for R1_input.fastq.gz
Analysis complete for R1_input.fastq.gz
Comment:
Check out the HTML file produced!
Note the GC distribution of Input VS H3K27ac
Input
H3K27ac
4. Run alignment
Code:
mkdir -p ~/workspace/alignments
ref=~/CourseData/EPI_data/Module1/BWA_index/chr19.hg38_no_alt.fa
read1=~/CourseData/EPI_data/Module1/MCF10A_resources/R1_input.fastq.gz
read2=~/CourseData/EPI_data/Module1/MCF10A_resources/R2_input.fastq.gz
sample=MCF10A_input_chr19
bwa mem -M -t 4 ${ref} ${read1} ${read2} 2>~/workspace/alignments/alignment.log | samtools view -hbS -o ~/workspace/alignments/${sample}.bam
Output:
Check log
Code Breakdown:
mkdir -p ~/workspace/alignments # Make working directory
ref=~/CourseData/EPI_data/Module1/BWA_index/chr19.hg38_no_alt.fa # Preset set variables
read1=~/CourseData/EPI_data/Module1/MCF10A_resources/R1_input.fastq.gz
read2=~/CourseData/EPI_data/Module1/MCF10A_resources/R2_input.fastq.gz
sample=MCF10A_input_chr19
bwa mem -M -t 4 ${ref} ${read1} ${read2} 2>~/workspace/alignments/alignment.log # Alignment step {"-t 2":Peform alignment using four threads. Our tutorial uses m5.xlarge therefore provides four cores,"-M":mark short split hits as secondary}
| samtools view -hbS -o ~/workspace/alignments/${sample}.bam # Converts SAM to BAM format {"h":Include Header, "b": output BAM, "S":detect input format}
5. Coordinate Sort alignment File
Code:
sample=MCF10A_input_chr19
samtools sort ~/workspace/alignments/${sample}.bam -o ~/workspace/alignments/${sample}.sorted.bam
samtools view ~/workspace/alignments/${sample}.bam | head
samtools view ~/workspace/alignments/${sample}.sorted.bam | head
Output:
Output from ~/workspace/alignments/MCF10A_input_chr19.bam:
HS10_346:2:1101:3349:1883 83 chr19 49469101 60 75M = 49468882 -294 CTTGACAAGAAGGTTTTGAGGCCCCGCCCTTAGGACTCAAGTTACTAAGGAAGAGGCTGTCCTTAGCAACAGGGN DEEDDDDDDDDDDDDDDDDDDDFJJJJJJJJJJJJJJJJIIIIIIJJJJJJJJJJJJJIHJJHHHHHFFFFD=1# NM:i:1 MD:Z:74G0 MC:Z:75M AS:i:74 XS:i:0
HS10_346:2:1101:3349:1883 163 chr19 49468882 60 75M = 49469101 294 CCAGCAGGCCTGGCCAACGTGGTGACAGGAGACCGGGACCATCTGACCCGCTGCCTGGCCTTGCACCAAGACGTC CCCFFFFFGHGHGIJJJJJFHIEHIIJJJICHHIIGGIJGIJJIJJJJHGBEDDEEDDDDDDDCCDCDDDDDDBD NM:i:0 MD:Z:75 MC:Z:75M AS:i:75 XS:i:0
HS10_346:2:1101:3408:1953 83 chr19 23790237 60 75M = 23790104 -208 TCCACCCGCCTAGGCTTCCCAAAGTGGTGGGATTACAGACGTGAGCCACTGGACCCGGCCTGATTTTCTCTTGAN BDB9DDDEEFFFFFHHHHEJJJJJJJJIJJJJJJGIJJJIIJJJJJHGJJJIHJJJJJJJJJHHHHHFFFFD=1# NM:i:2 MD:Z:47G26A0 MC:Z:5M1I69M AS:i:69 XS:i:35
HS10_346:2:1101:3408:1953 163 chr19 23790104 60 5M1I69M = 23790237 208 ATTCTCCTGTCTCAGTCTCCTGAGTAGCTGGGATTACAGGCGCCCGCCACTGTGCCCGGCTACTTTTTGTATTTT CCCFFFFFDFFHHJJ?FHGIJJEHEFHIIGJJEGGIIFIIGHIJIJJJJIFIGJJIGGHEDEFEFEEED?DCFEE NM:i:1 MD:Z:74 MC:Z:75M AS:i:69 XS:i:50
HS10_346:2:1101:11940:1853 99 chr19 42651286 60 75M = 42651536 325 NTTTCTCCATCAACTTAGCTGGCAGCTCCTGTCCCCAGCAGCATCAGAGGCCCCATGAAAAGAGCTCCAGCAGGG #1=DFFFFHHHHHJJJJJJJJJIJGIJJJIIIIJJJJIJJJJJJJJJJJJJJJJJJIJIJJJJJHHHHHFFFFFD NM:i:1 MD:Z:0G74 MC:Z:75M AS:i:74 XS:i:31
HS10_346:2:1101:11940:1853 147 chr19 42651536 60 75M = 42651286 -325 GGGAGGAGAGAAGGGAACTGTAGGCCAATGGCTTTATTGGGTCTAGGGTGTTATTGACAGGTTTCCAGAAGGGAG HHHHJIJJJJJJIJIGHJJJJJJIHJJJJJIJJJJJJJJJJIGIJJJJJIHFCCCJJJJJJJHHHHHFFFFFCCC NM:i:0 MD:Z:75 MC:Z:75M AS:i:75 XS:i:0
HS10_346:2:1101:20208:1926 83 chr19 5876554 60 75M = 5876421 -208 GCGGTGAGGCCATCTATGCCCCTCGTTGGGGTCCTGGTCTTCATTGGACACCCCAGCTCCTCCCTCAGCCTGGGN DDDDDDDDDDDEDDDDDDDDFFHHHJJJJJJJJJJJJJJJJJIJJIHDD:JJJJJJJJJJJJHHHHHFFFFD=4# NM:i:2 MD:Z:15G58C0 MC:Z:32M1D24M2I17M AS:i:69 XS:i:24
HS10_346:2:1101:20208:1926 163 chr19 5876421 60 32M1D24M2I17M = 5876554 208 GGGGCTCTCTTGCCCTCCCAGCCAGATCATCCTTCTACTGGCTCCTCCAACCACCCCTGTGCCCCTGATTCTAGG CCCFFFFFHHHHHJJJJJIJJJJJJJJJJIJIIJIJJJJJIJJJICHIGHHHIIJJJCGEGGHHH=DBEFFDDEE NM:i:3 MD:Z:32^T41 MC:Z:75M AS:i:58 XS:i:0
HS10_346:2:1101:21139:1859 83 chr19 41375797 60 75M = 41375616 -256 AGCAAGACCCCGTTTTTTAAAAAATAATAATAAAAAAAAAATCCGCCGGGCGCGGTTGCTCACGCCTGTAATCCN ####################################################################?<@?70# NM:i:4 MD:Z:13C29T12G17T0 MC:Z:75M AS:i:59 XS:i:28
HS10_346:2:1101:21139:1859 163 chr19 41375616 60 75M = 41375797 256 AACAAGCCAACACGCCTTCAGCACTCCTCCGCAAAAAAACACCCCTAAACAAAATAGGCCAGGCGCGGTGACTCA :+:=?;+<<A7=<)<1+1+22?@3@B>B>30?A04==AA#################################### NM:i:2 MD:Z:13C31A29 MC:Z:75M AS:i:65 XS:i:24
Output from ~/workspace/alignments/$MCF10A_input_chr19.sorted.bam:
HS10_346:2:1304:19247:16557 117 chr19 60146 0 * = 60146 0 GTTGAGTAATTGCTGAGATGGGCAGTAGAGATGCTCAGGTCTGTGGTCCCTTTCCATCCCCACTTGATCTATTTT ########################################################################### MC:Z:54M21S AS:i:0 XS:i:0
HS10_346:2:1304:19247:16557 185 chr19 60146 60 54M21S = 60146 0 TACAAGGATAATCTGACCTGCAGGGTCGAGGAGTTGACGGTGCTGAGTTCCCTGGATGGCACCAAGATCGGCCCT DCCDDDCCCDCDECEEEFFFFHGHHIIHFGHIGGEIFAHCIEHGGIGHGHEHEDGEIIIHFG?HFFDDFFFD@@@ NM:i:0 MD:Z:54 AS:i:54 XS:i:0
HS10_346:2:1314:2495:92223 163 chr19 60167 60 75M = 60313 221 AGGGTCAAGGAGTTGACGGTGCTGAGTTCCCTGCACTCTCAGTAGGGACAGGCCCTATGCTGCCACCTGTACATG @@?DDDDDHHHHFGHJIJJFEHJJJJEIIIGHIJJJJGEIBHHGIJJIIJJJJJJJJIJJIJJJHGHHFDDFCEF NM:i:1 MD:Z:6G68 MC:Z:75M AS:i:70 XS:i:0
HS10_346:2:1207:4332:57794 99 chr19 60172 60 75M = 60350 253 CGAGGAGTTGACGGTGCTGAGTTCCCTGCACTCTCAGTAGGGACAGGCCCTATGCTGCCACCTGTACATGCTATC @C@FFDFDFFFHHIHIIJHHGGIIJGIEEHIJIJJJIJJIJJGIGHIHIHHJJJJJJJJIHHHEEHHHHFFFFFF NM:i:0 MD:Z:75 MC:Z:75M AS:i:75 XS:i:0
HS10_346:2:2315:10492:97635 121 chr19 60173 60 75M = 60173 0 GAGGAGTGGACGGTGCTGAGTTCCCTGCTCTCTCAGTAGGGACAGGCCCTATGCTGCCACCTGTACATGCTATCA ########################################################################### NM:i:3 MD:Z:7T20A45T0 AS:i:64 XS:i:0
HS10_346:2:2315:10492:97635 181 chr19 60173 0 * = 60173 0 AAAAATCGAAAATACTTTTAACAATTTGTATTTGATTTATAACTTTTAAACATTTTTATAATGACATTTAAAAAA IJIGIHFCBHEC=8GHJIIIHIEEEIHDGGHGIGIGGHGEEEGCIHHAC?HEGJJGJIIGHBHDDHHFFFFD@@@ MC:Z:75M AS:i:0 XS:i:0
HS10_346:2:1314:17241:18282 117 chr19 60211 0 * = 60211 0 CTCTGTGATCTTCTCCATGGCAGGATCTCCCAGCAGGTAAAGCAGAGCCGGAGCCAGGTGCAGGCCATTGGAGAG @CCD@@EEEHFHEA=HGGC@CF7=@@F;HF<CCBBDDB9DB@<>GG@E@HEIHACHDDHFDADF>HBDDDDB?@@ MC:Z:75M AS:i:0 XS:i:0
HS10_346:2:1314:17241:18282 185 chr19 60211 60 75M = 60211 0 GGGACAGGCCCTATGCTGCCACCTGTACATGCTATCTGAAGGACAGCCTCCAGGGCACACAGAGGATGGTATTTA ##AA=5;;BA;BB=(7.8/**)A7=909<A??BB????*A:7@11)22+3<3?3,33<2<=?CA<,C7?A?===: NM:i:0 MD:Z:75 AS:i:75 XS:i:0
HS10_346:2:1201:2781:75049 99 chr19 60221 60 75M = 60484 338 CTATGCTGCCACCTGTACATGCTATCTGAAGGACAGCCTCCAGGGCACACAGAGGATGGTATTTACACATGCACA CCCFFFFFHGGHHJJIJJJJJJJJJJJJJJJJJJJIIJJJJJGIJIIJJJJJIJJJJJJ@GIIJJJJIJHHHHHF NM:i:0 MD:Z:75 MC:Z:75M AS:i:75 XS:i:0
HS10_346:2:1314:2495:92223 83 chr19 60313 60 75M = 60167 -221 CAAGCACTTCACAACCCCTCATGATCACGTGCAGCAGACAAAGTGGCCTCTGCAGAGGGGGAACGGAGACCGGAG DCADDDDEEDCDFDHIIIEJJJJIJIHFHGGIEHGIHIIJJJJJJJIJIGIJIIJJJJJJIHFGHHHFDFFFCCC NM:i:1 MD:Z:41T33 MC:Z:75M AS:i:70 XS:i:0
Code Breakdown:
Coordinate sorts
Comment:
Take a look at the coordinate sorted bam vs original. When we view, notice the coordinates in the sorted bam were altered
6. Duplicate marking alignment
Code:
sample=MCF10A_input_chr19
java -jar /usr/local/picard/picard.jar MarkDuplicates I=~/workspace/alignments/${sample}.sorted.bam O=~/workspace/alignments/${sample}.dup_marked.sorted.bam M=~/workspace/alignments/${sample}.dup_marked.output.log ASSUME_SORTED=TRUE VALIDATION_STRINGENCY=LENIENT > ~/workspace/alignments/${sample}.dup_marked.error.log
Code Breakdown:
java -jar \ # Requires java to intepret
/usr/local/picard/picard.jar \ #Point to toolkit
MarkDuplicates \ #Specify function from toolkit
I=~/workspace/alignments/${sample}.sorted.bam \ #Input
O=~/workspace/alignments/${sample}.dup_marked.sorted.bam \ #output
M=~/workspace/alignments/${sample}.dup_marked.output.log \ #Work log
ASSUME_SORTED=TRUE \ #B/C we already sorted this will be true
VALIDATION_STRINGENCY=LENIENT \ #Emit warnings but keep going if possible
> ~/workspace/alignments/{sample}.dup_marked.error.log
Output:
See log. Its a bit long but the breakdown as follows:
1. A summary of the command used (so we can check the parameters)
2. A metric rollup equivalent to the flagstat step run later on
3. A histogram where, Col 1 is expected coverage col 2 is actual
"In case of many duplicates, the second column will result in much lower values, indicating that sequencing more will not add proportionally to the obtained effective coverage."
https://github.com/broadinstitute/picard/issues/917
7. Flagstats
Code:
sample=MCF10A_input_chr19
samtools flagstat ~/workspace/alignments/${sample}.dup_marked.sorted.bam > ~/workspace/alignments/${sample}.dup_marked.sorted.flagstat
Output:
1726243 + 0 in total (QC-passed reads + QC-failed reads)
34545 + 0 secondary
0 + 0 supplementary
64636 + 0 duplicates
1545613 + 0 mapped (89.54% : N/A)
1691698 + 0 paired in sequencing
845849 + 0 read1
845849 + 0 read2
1192976 + 0 properly paired (70.52% : N/A)
1330944 + 0 with itself and mate mapped
180124 + 0 singletons (10.65% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
Comment:
Run flagstat on MCF10A_input_chr19.sorted.bam. What are the differences?
Try running the alignment with MCF10A_H3K27ac_chr19!
8. Clean up!
Code:
rm ~/workspace/alignments/${sample}.sorted.bam
rm ~/workspace/alignments/${sample}.bam
Comment:
Good practice to remove redundant files
Module 2. Peak Calling
Setup
mkdir ~/workspace/peaks
samtools index -@4 ~/CourseData/EPI_data/Module1/MCF10A_resources/*.bam
1. Dedup BAM file
Code:
treatment_bam=~/CourseData/EPI_data/Module1/MCF10A_resources/MCF10A_H3K27ac.bam
treatment_dedup=~/workspace/peaks/MCF10A_H3K27ac.dedup.bam
samtools view -@4 ${treatment_bam} -bh -q10 -F1028 -o ${treatment_dedup}
input_bam=~/CourseData/EPI_data/Module1/MCF10A_resources/MCF10A_Input.bam
input_dedup=~/workspace/peaks/MCF10A_Input.dedup.bam
samtools view -@4 ${treatment_bam} -bh -q10 -F1028 -o ${treatment_dedup}
Code Breakdown:
samtools view -@4 ${treatment_bam} -bh -q10 -F1028 chr19# {"-F1028": removes reads that unmapped and duplicated,
"-@4": uses two threads,
"-bh": outputs in binary,
"q10": MAPQ must be >=10
Comments:
This step will take a while, copies have been prepared in resources. Smallest file taking 3 minutes. Longest taking 10.
2. Running a peak caller
Code:
mkdir -p ~/workspace/peaks
treatment_frag=~/CourseData/EPI_data/Module1/MCF10A_resources/MCF10A_H3K27ac.dedup.bam
input_frag=~/CourseData/EPI_data/Module1/MCF10A_resources/MCF10A_Input.dedup.bam
name=MCF10A_H3K27ac
macs3 callpeak -t ${treatment_frag} -c ${input_frag} -f BAMPE -g hs -n ${name} --keep-dup all --outdir ~/workspace/peaks/ --bdg 1> ~/workspace/peaks/${name}.out.log 2> ~/workspace/peaks/${name}.err.log
Code Breakdown:
macs3 callpeak \ #General purpose peak calling mode
-t ${treatment_frag} \ #Treatment file, can provide more than one to "pool"
-c ${input_frag} \ #Control File/Input
-f BAMPE \ #Instructs MACS2 on what kind of file to expect. Single/Paired-end bed/bam
-g hs \ #Sets appropriate genome size for background sampling hs=human, mm=humouse
-n ${name} \ #name
#-q 0.05 #FDR q value default
#-broad #"broad mode" for broad marks - stitches small peaks together
--outdir ~/workspace/peaks/ \ #where to output files otherwise stores in current working directory
--bdg \ #outputs pileup into bedgraphs
1> ~/workspace/peaks/${name}.out.log \ #output log
2> ~/workspace/peaks/${name}.err.log #error log
cat ~/workspace/peaks/${name}.err.lo
Comments:
List for common marks but is not exhaustive. Best practice would be to observe trends on genome browser and then decide.
Narrow Marks:
H3K27ac
H3K4me3
Broad Marks:
H3K27me3
H3K4me1
H3K36me3
H3K36me2
H3K9me3
Take a look at the err.log we see:
1. Summary of command (a good way of troubleshooting if the wrong paramters were provided
2. Progress of loading the two bams
3. Note the line "mean fragment size is determined as 227.0 bp from treatment"
Output:
See logs
2 Black list removal/Inspection
-
2.1 Black list removal
Code:
blacklist=~/CourseData/EPI_data/Module1/QC_resources/hg38_blacklist.bed sample="MCF10A_H3K27ac_peaks" bedtools intersect -v -a ~/workspace/peaks/${sample}.narrowPeak -b ${blacklist} > ~/workspace/peaks/${sample}.blacklistRemoved.narrowPeak bedtools intersect -u -a ~/workspace/peaks/${sample}.narrowPeak -b ${blacklist} > ~/workspace/peaks/${sample}.blacklist.narrowPeak
Output:
No log. Bedtools completes Silently
-
2.2 NarrowPeak inspection
Code:
sample="MCF10A_H3K27ac_peaks"
wc -l ~/workspace/peaks/${sample}.blacklistRemoved.narrowPeak
head ~/workspace/peaks/${sample}.blacklistRemoved.narrowPeak -n5
Output:
32384 /home/ubuntu/workspace/peaks/MCF10A_H3K27ac_peaks.blacklistRemoved.narrowPeak
chr1 10003 10467 MCF10A_H3K27ac_peak_1 2160 . 47.2436 221.596 216.028 179
chr1 17283 17514 MCF10A_H3K27ac_peak_2 27 . 3.9983 5.1706 2.76136 82
chr1 180312 180997 MCF10A_H3K27ac_peak_3 803 . 28.8766 85.1306 80.3433 461
chr1 777953 778720 MCF10A_H3K27ac_peak_4 210 . 10.1154 24.4429 21.0334 503
chr1 778984 779223 MCF10A_H3K27ac_peak_5 44 . 4.87285 7.00538 4.45379 69
Comments:
chromosome name
peak start
peak stop
peak name
int(-10*log10pvalue)
N/A
Fold change at peak summit
-log10 P-value at Peak
-log10 Q-value at Peak
Summit position relative to peak
-
2.3 Quality control via % of reads in key genomic areas
Code:
sample="MCF10A_H3K27ac"
query_bam=~/CourseData/EPI_data/Module1/MCF10A_resources/${sample}.bam
samtools view -@4 -q 10 -F 1028 $query_bam -c
samtools view -@4 -q 10 -F 1028 $query_bam -L ~/CourseData/EPI_data/Module1/QC_resources/encode_enhancers_liftover.bed -c
samtools view -@4 -q 10 -F 1028 $query_bam -L ~/workspace/peaks/${sample}_peaks.blacklistRemoved.narrowPeak -c
Output:
19632094
10752392
2028377
Comments:
samtools view #Viewing entirety or subset of BAM file
-q 10 \ #Parsing for reads with 5>=MAPQ
-F 1028 \ # Remove reads that have the following flags:UNMAP,DUP #https://broadinstitute.github.io/picard/explain-flags.html
$query_bam \ # Bam of interest
-c # Count number of reads instead of displaying
-L # Region of interest. Bed File
Enrichment at key regions:
H3K4me1 | H3K27me3 | H3K27ac | H3K4me3 | |
---|---|---|---|---|
Reads Count | ||||
Total | 71,545,746 | 55,438,916 | 19,632,094 | 26,381,511 |
Enhancer | 48,466,245 | 25,742,977 | 10,752,392 | 12,302,265 |
HOX | 13,866 | 13,085 | 4,504 | 28,674 |
TSS | 7,303,584 | 5,111,631 | 2,702,194 | 18,041,080 |
In Peaks | 27,539,876 | 25,072,097 | 2,028,377 | 20,920,281 |
H3K4me1 | H3K27me3 | H3K27ac | H3K4me3 | |
Percent of Total | ||||
Total | 100.00 | 100.00 | 100.00 | 100.00 |
Enhancer | 67.74 | 46.43 | 54.77 | 46.63 |
HOX | 0.02 | 0.02 | 0.02 | 0.11 |
TSS | 10.21 | 9.22 | 13.76 | 68.39 |
In Peaks | 38.49 | 45.22 | 10.33 | 79.30 |
3. Visualization
-
3.1 Visualization (Coverage)
Code:
mkdir -p ~/workspace/{bigBed,bigWig} sample="MCF10A_H3K27ac" input_bedgraph=~/workspace/peaks/${sample}_treat_pileup.bdg output_bigwig=~/workspace/bigWig/${sample}_treat_pileup.bw chrom_sizes=~/CourseData/EPI_data/Module1/QC_resources/hg38.chrom.sizes sort -k1,1 -k2,2n ${input_bedgraph} > ~/workspace/bigWig/tmp bedGraphToBigWig ~/workspace/bigWig/tmp ${chrom_sizes} ${output_bigwig} rm ~/workspace/bigWig/tmp
input_bedgraph=~/workspace/peaks/${sample}_control_lambda.bdg
output_bigwig=~/workspace/bigWig/${sample}_control_lambda.bw
chrom_sizes=~/CourseData/EPI_data/Module1/QC_resources/hg38.chrom.sizes
sort -k1,1 -k2,2n ${input_bedgraph} > ~/workspace/bigWig/tmp
bedGraphToBigWig ~/workspace/bigWig/tmp ${chrom_sizes} ${output_bigwig}
rm ~/workspace/bigWig/tmp
Output:
N/A
Code Breakdown:
mkdir -p ~/workspace/{bigBed,bigWig}
sample="MCF10A_H3K27ac" # Specify variables
input_bedgraph=~/workspace/peaks/${sample}_treat_pileup.bdg #
output_bigwig=~/workspace/bigWig/${sample}_treat_pileup.bw #
chrom_sizes=~/CourseData/EPI_data/Module1/QC_resources/hg38.chrom.sizes #
sort -k1,1 -k2,2n ${input_bedgraph} > ~/workspace/bigWig/tmp # Sorts peaks as to match alphabetical order in col 1 and numerical in col 2 #$(chrom_sizes) into a temporary file
bedGraphToBigWig ~/workspace/bigWig/tmp ${chrom_sizes} ${output_bigwig} # Conversion step
rm ~/workspace/bigWig/tmp #remove temporary file
-
3.2 Visualization (Peaks)
Code:
mkdir -p ~/workspace/{bigBed,bigWig} sample="MCF10A_H3K27ac" input_bed=~/workspace/peaks/${sample}_peaks.blacklistRemoved.narrowPeak output_bigbed=~/workspace/bigBed/${sample}_peaks.blacklistRemoved.bb chrom_sizes=~/CourseData/EPI_data/Module1/QC_resources/hg38.chrom.sizes cut -f1-3 ${input_bed} | sort -k1,1 -k2,2n > ~/workspace/bigBed/tmp bedToBigBed ~/workspace/bigBed/tmp ${chrom_sizes} ${output_bigbed} rm ~/workspace/bigBed/tmp
Output:
N/A
Code Breakdown:
sample="MCF10A_H3K27ac"
input_bed=~/workspace/peaks/${sample}_peaks.blacklistRemoved.narrowPeak
output_bigbed=~/workspace/bigBed/${sample}_peaks.blacklistRemoved.bb
chrom_sizes=~/CourseData/EPI_data/Module1/QC_resources/hg38.chrom.sizes
cut -f1-3 ${input_bed} | sort -k1,1 -k2,2n > ~/workspace/bigBed/tmp
bedToBigBed ~/workspace/bigBed/tmp ${chrom_sizes} ${output_bigbed} # Largely the same as above, except we trim our narrowPeak file to just 3 columns
rm ~/workspace/bigBed/tmp
-
3.3 Visualization
- Download tracks via public url
- Upload onto IGV
- Explore!
Comments:
```Shell If you’re following via amazon AWS, you should have a public port open to the IPv4 associated with your instance in which you can interact via browser The files you would want to import would be: