1. Generating data
These form a rough set of instructions on how to re-generate the lab data.
Data derived from:
Wu, F., Zhao, S., Yu, B. et al. A new coronavirus associated with human respiratory disease in China. Nature 579, 265–269 (2020). https://doi.org/10.1038/s41586-020-2008-3
fasterq-dump --outdir reads --split-files SRR10971381
gzip reads/SRR10971381_1.fastq
gzip reads/SRR10971381_2.fastq
1.1. Downsample
I used the seqtk
command to select a random subset of 10% of the reads in each file (the 0.1
value) so that the lab would run faster. The -s 19542
is the seed for the random number generator and needs to be identical when selecting subsets of paired-end reads so that the proper pairs still match each other in both fastq.gz files.
seqtk sample -s 19542 reads/SRR10971381_1.fastq.gz 0.1 | gzip --stdout > reads-downsampled/SRR10971381_1.fastq.gz
seqtk sample -s 19542 reads/SRR10971381_2.fastq.gz 0.1 | gzip --stdout > reads-downsampled/SRR10971381_2.fastq.gz
1.2. Copy final data
mkdir -p module8_workspace/data
cp reads-downsampled/SRR10971381_1.fastq.gz module8_workspace/data/emerging-pathogen-reads_1.fastq.gz
cp reads-downsampled/SRR10971381_2.fastq.gz module8_workspace/data/emerging-pathogen-reads_2.fastq.gz
2. Generate blast database
The update_blastdb.pl
script comes with the BLAST software and can be used to download BLAST databases from NCBI. I then further refined the BLAST database by removing the SARS-CoV-2 genome (otherwise, when working through the lab this would be a perfect match to the data, which wouldn’t be the case for an emerging new pathogen that hasn’t been seen before).
# Download original ref_viruses_rep_genomes
pushd data/blast_db
update_blastdb.pl ref_viruses_rep_genomes --decompress
popd
# Convert database back to fasta
pushd data/
blastdbcmd -db blast_db/ref_viruses_rep_genomes -entry all > ref_viruses.fasta
# (Manually) remove NC_045512.2 (Wuhan-Hu-1 SARS-CoV-2 genome) from fasta file
# Add in MG772933.1 (Bat SARS-like Coronavirus which was the top-match to the unknown sequence in https://www.nature.com/articles/s41586-020-2008-3)
cat ref_viruses.fasta ~/MG772933.1.fasta > blast_db/ref_viruses_rep_genomes_modified.fasta
# Make database
makeblastdb -in blast_db/ref_viruses_rep_genomes_modified.fasta -dbtype nucl -title ref_viruses_rep_genomes_modified -parse_seqids -out blast_db/ref_viruses_rep_genomes_modified -blastdb_version 4
gzip blast_db/ref_viruses_rep_genomes_modified.fasta
popd
3. Generate kraken2 database
I downloaded the existing Kraken2 libraries for collections of organisms and then removed the SARS-CoV-2 genome so that it wouldn’t be a perfect match to the lab data prior to building the Kraken2 database.
export KRAKEN2_USE_FTP=1
kraken2-build --download-taxonomy --db kraken2_db
kraken2-build --download-library bacteria --db kraken2_db
kraken2-build --download-library viral --db kraken2_db
kraken2-build --download-library human --db kraken2_db
# Manually removed "Severe acute respiratory syndrome 2" entry from viral database
kraken2-build --build --db kraken2_db/ --threads 48 --fast-build --max-db-size 8589934592