Epigenomic Data Analysis 2018 Module 2 Lab

CALLING ENRICHMENTS using MACS2

by Misha Bilenky

Set up our variables

source /home/partage/epigenomics/chip-seq/setup.sh

Load Python2.7.* and MACS2 modules

module load mugqic/python/2.7.11
module load mugqic/MACS2/2.1.1.20160309

The MACS2 tutorial can be found:

https://github.com/taoliu/MACS

Check macs2 version (and see that it is properly loaded)

macs2 --version

Regular peak calling with macs2:

cd
signal=/home/partage/epigenomics/chip-seq/H1.chr19/H3K4me3.H1.chr19.bam
control=/home/partage/epigenomics/chip-seq/H1.chr19/Input.H1.chr19.bam
macs2 callpeak -t $signal -c $control -f BAM -g hs -n H3K4me3.H1.chr19 -B -q 0.01 --outdir ~/H1.macs2.chr19 

Parameters (see https://github.com/taoliu/MACS for more details)

-t treatment
-c control
-f input files format
-g using human genome
-n name
-B generate output as a bedGraph

That is what we see on the screen:

[stud02@workshop103 ~]$ macs2 callpeak -t $signal -c $control -f BAM -g hs -n H3K4me3.H1.chr19 -B -q 0.01 --outdir ~/H1.macs2.chr19 
INFO  @ Tue, 19 Jun 2018 05:51:16: 
# Command line: callpeak -t /home/partage/epigenomics/chip-seq/H1.chr19/H3K4me3.H1.chr19.bam -c /home/partage/epigenomics/chip-seq/H1.chr19/Input.H1.chr19.bam -f BAM -g hs -n H3K4me3.H1.chr19 -B -q 0.01 --outdir /home/stud02/H1.macs2.chr19
# ARGUMENTS LIST:
# name = H3K4me3.H1.chr19
# format = BAM
# ChIP-seq file = ['/home/partage/epigenomics/chip-seq/H1.chr19/H3K4me3.H1.chr19.bam']
# control file = ['/home/partage/epigenomics/chip-seq/H1.chr19/Input.H1.chr19.bam']
# effective genome size = 2.70e+09
# band width = 300
# model fold = [5, 50]
# qvalue cutoff = 1.00e-02
# Larger dataset will be scaled towards smaller dataset.
# Range for calculating regional lambda is: 1000 bps and 10000 bps
# Broad region calling is off
# Paired-End mode is off
 
INFO  @ Tue, 19 Jun 2018 05:51:16: #1 read tag files... 
INFO  @ Tue, 19 Jun 2018 05:51:16: #1 read treatment tags... 
INFO  @ Tue, 19 Jun 2018 05:51:23:  1000000 
INFO  @ Tue, 19 Jun 2018 05:51:27: #1.2 read input tags... 
INFO  @ Tue, 19 Jun 2018 05:51:33:  1000000 
INFO  @ Tue, 19 Jun 2018 05:51:40:  2000000 
INFO  @ Tue, 19 Jun 2018 05:51:43: #1 tag size is determined as 36 bps 
INFO  @ Tue, 19 Jun 2018 05:51:43: #1 tag size = 36 
INFO  @ Tue, 19 Jun 2018 05:51:43: #1  total tags in treatment: 1697605 
INFO  @ Tue, 19 Jun 2018 05:51:43: #1 user defined the maximum tags... 
INFO  @ Tue, 19 Jun 2018 05:51:43: #1 filter out redundant tags at the same location and the same strand by allowing at most 1 tag(s) 
INFO  @ Tue, 19 Jun 2018 05:51:43: #1  tags after filtering in treatment: 1235613 
INFO  @ Tue, 19 Jun 2018 05:51:43: #1  Redundant rate of treatment: 0.27 
INFO  @ Tue, 19 Jun 2018 05:51:43: #1  total tags in control: 2539258 
INFO  @ Tue, 19 Jun 2018 05:51:43: #1 user defined the maximum tags... 
INFO  @ Tue, 19 Jun 2018 05:51:43: #1 filter out redundant tags at the same location and the same strand by allowing at most 1 tag(s) 
INFO  @ Tue, 19 Jun 2018 05:51:43: #1  tags after filtering in control: 2227048 
INFO  @ Tue, 19 Jun 2018 05:51:43: #1  Redundant rate of control: 0.12 
INFO  @ Tue, 19 Jun 2018 05:51:43: #1 finished! 
INFO  @ Tue, 19 Jun 2018 05:51:43: #2 Build Peak Model... 
INFO  @ Tue, 19 Jun 2018 05:51:43: #2 looking for paired plus/minus strand peaks... 
INFO  @ Tue, 19 Jun 2018 05:51:44: #2 number of paired peaks: 13050 
INFO  @ Tue, 19 Jun 2018 05:51:44: start model_add_line... 
INFO  @ Tue, 19 Jun 2018 05:51:44: start X-correlation... 
INFO  @ Tue, 19 Jun 2018 05:51:44: end of X-cor 
INFO  @ Tue, 19 Jun 2018 05:51:44: #2 finished! 
INFO  @ Tue, 19 Jun 2018 05:51:44: #2 predicted fragment length is 297 bps 
INFO  @ Tue, 19 Jun 2018 05:51:44: #2 alternative fragment length(s) may be 297 bps 
INFO  @ Tue, 19 Jun 2018 05:51:44: #2.2 Generate R script for model : /home/stud02/H1.macs2.chr19/H3K4me3.H1.chr19_model.r 
INFO  @ Tue, 19 Jun 2018 05:51:44: #3 Call peaks... 
INFO  @ Tue, 19 Jun 2018 05:51:44: #3 Pre-compute pvalue-qvalue table... 
INFO  @ Tue, 19 Jun 2018 05:51:46: #3 In the peak calling step, the following will be performed simultaneously: 
INFO  @ Tue, 19 Jun 2018 05:51:46: #3   Write bedGraph files for treatment pileup (after scaling if necessary)... H3K4me3.H1.chr19_treat_pileup.bdg 
INFO  @ Tue, 19 Jun 2018 05:51:46: #3   Write bedGraph files for control lambda (after scaling if necessary)... H3K4me3.H1.chr19_control_lambda.bdg 
INFO  @ Tue, 19 Jun 2018 05:51:46: #3   Pileup will be based on sequencing depth in treatment. 
INFO  @ Tue, 19 Jun 2018 05:51:46: #3 Call peaks for each chromosome... 
INFO  @ Tue, 19 Jun 2018 05:51:46: #4 Write output xls file... /home/stud02/H1.macs2.chr19/H3K4me3.H1.chr19_peaks.xls 
INFO  @ Tue, 19 Jun 2018 05:51:46: #4 Write peak in narrowPeak format file... /home/stud02/H1.macs2.chr19/H3K4me3.H1.chr19_peaks.narrowPeak 
INFO  @ Tue, 19 Jun 2018 05:51:46: #4 Write summits bed file... /home/stud02/H1.macs2.chr19/H3K4me3.H1.chr19_summits.bed 
INFO  @ Tue, 19 Jun 2018 05:51:46: Done! 
[stud02@workshop103 ~]$ 

Check the output folder

cd H1.macs2.chr19
ls -l

We see following files:

[stud02@workshop103 H1.macs2.chr19]$ ll
total 82468
-rw-r--r-- 1 stud02 stud 157201685 Jun 19 02:12 H3K4me3.H1.chr19_control_lambda.bdg
-rw-r--r-- 1 stud02 stud     77804 Jun 19 02:12 H3K4me3.H1.chr19_model.r
-rw-r--r-- 1 stud02 stud    150042 Jun 19 02:12 H3K4me3.H1.chr19_peaks.narrowPeak
-rw-r--r-- 1 stud02 stud    167335 Jun 19 02:12 H3K4me3.H1.chr19_peaks.xls
-rw-r--r-- 1 stud02 stud    100876 Jun 19 02:12 H3K4me3.H1.chr19_summits.bed
-rw-r--r-- 1 stud02 stud  55449194 Jun 19 02:12 H3K4me3.H1.chr19_treat_pileup.bdg

Check how many peaks we called

wc -l H3K4me3.H1.chr19_peaks.narrowPeak

1698 H3K4me3.H1.chr19_peaks.narrowPeak

This was only for chr19 (as an example)

Running same command for the whole genome took ~1h. Results are located in

ls -l /home/partage/epigenomics/chip-seq/macs2

You see

total 2614211
-rw-r--r-- 1 stud02 stud 5374836445 Jun 19 01:20 H3K4me3.H1_control_lambda.bdg
-rw-r--r-- 1 stud02 stud      78292 Jun 19 00:55 H3K4me3.H1_model.r
-rw-r--r-- 1 stud02 stud    3550063 Jun 19 01:20 H3K4me3.H1_peaks.narrowPeak
-rw-r--r-- 1 stud02 stud    3988586 Jun 19 01:20 H3K4me3.H1_peaks.xls
-rw-r--r-- 1 stud02 stud    2309723 Jun 19 01:20 H3K4me3.H1_summits.bed
-rw-r--r-- 1 stud02 stud 1038752798 Jun 19 01:20 H3K4me3.H1_treat_pileup.bdg

Total number of H3K4me3 peaks genome wide is

[stud02@workshop103 macs2]$ wc -l H3K4me3.H1_peaks.narrowPeak 
44679 H3K4me3.H1_peaks.narrowPeak

Using ‘awk’ we can count number of bases in enriched peaks genome wide:

[stud02@workshop103 macs2]$ less H3K4me3.H1_peaks.narrowPeak | awk '{s=s+$3-$2+1} END{print s}'
41872307

This is approximatly ~1.6% of mappable genome 41872307/2700000000.