Microbiome Analysis 2021

This lab is based on the 16S rRNA gene analysis method originally developed by Michael Hall and Robert Beiko, and was updated and modified by Diana Haider for the 2021 Microbiome workshop.

Introduction

The lab in this module is a walkthrough of an end-to-end pipeline using the command line interface for the analysis of high-throughput marker gene data including diversity, taxonomic and statistical explorations. The pipeline described is embedded in the latest version of QIIME2 (version 2021.4, and stands for Quantitative Insights into Microbial Ecology) which is a popular microbiome bioinformatics platform for microbial ecology. It uses software packages called plugins that can be developed by anyone and generates zipped artifacts that can be tracked through a provenance file automatically generated by QIIME2. Two scripts, accessible here, make the whole thing work: “download_Willis_et_al_data.sh” retrieves the sequences from the European Bioinformatics Institute through their relatively awesome API, and “qiime_commands.sh” to execute a chain of QIIME2 commands and generate a number of .qza and .qzv files. These files can be unzipped to reveal the sweet contents inside in case you want to (hypothetically, of course) feed them to a non-QIIME2 application. For this tutorial, we will go through each step individually after the import section, starting with the CBW_Willis_reads.qza file.

Commonly used marker genes for microbiome analysis include the 16S ribosomal RNA (rRNA) for prokaryotes, 18S rRNA for eukaryotes, and the internal transcribed spacer (ITS) for fungi. In this tutorial, we will explore a 16S rRNA dataset from environmental samples collected along three stations from the Halifax Line at four different depths, amplified using two different primers. One interesting question is whether the primer choice under- or overrepresents certain taxa groups, and then impacts diversity measurements. Details on the data collection and processing can be found in the original paper that published the data.

After this lab you will be able to:

  • Conduct an end-to-end microbiome analysis using QIIME2
  • Know of many tools available to conduct statistical and diversity analyses

Getting started

QIIME2 is recommended to be run in an independent conda environment to ensure consistency between different versions of the packages it uses. Here are the instructions for installation. For this instance, a qiime2 environment was already installed, it just needs to be activated.

source activate qiime2-2021.4

You can visualize all available plugins by typing qiime. If it prints out all commands, you’re good to go.

image 1 To start your analysis, two files are required:

  • Sequencing data
  • Metadata

The sequencing data will be in multiple FASTQ files. Each FASTQ file contains the raw DNA sequences, and information on the sequences, such as an identifier, the DNA sequence and the quality score of each called bases in text format. These files are usually very large, and are an intermediate for further analyses. The metadata file contains information about the samples allowing us to interpret the data in their biological context.

The structure of your folders should look like this

<ROOT>
|-- README.md                # Read me file
|-- download_Willis_et_al_data.sh # Bash script to retrieve reads from EBI
|-- qiime_commands.sh        # Bash script to run all qiime commands at once
|-- sequence data/           # Folder with collected data
    `-- raw reads/           # Folder with the raw reads in FASTQ format
        `-- manifest.txt     # Manifest file
        `-- run*_            # Fastqs
    `-- METADATA.txt         # The metadata file
|-- qiime_artifacts/         # Folder containing the results
    `-- CBW_Willis_reads.qza # Reads artifact imported in QIIME2 that you need to download
    `-- DADA2_results/       # Results from denoising the sequencing data
        `-- table.qza
        `-- stats.qza
        `-- representative_sequences.qza
    `-- METADATA.txt         # The metadata file
|-- qiime_solutions/         # Intermediate files of expected outputs

In the interest of time, we will skip the import step, and the denoising step. The files required will be linked in the tutorial, and available in the qiime_artifacts folder.

16S data processing

Import raw reads to QIIME2

You can do this step yourself, but for this live lab, we will start with the already imported CBW_Willis_reads.qza file. If you want to fetch the dataset from EBI yourself, you can run bash download_Willis_et_al_data.sh bash script, but it is time consuming (~30 mins). The script downloads the FASTQ files from the EBI database, and creates a manifest file which is a text file with three columns. It guides the import in QIIME2 by providing the sample name, the file path and the direction of the read if the data has paired reads (either reverse or forward).

Once the manifest file is ready, you can just use the qiime import function:

qiime tools import \
  --type 'SampleData[PairedEndSequencesWithQuality]' \
  --input-path import_to_qiime \
  --output-path CBW_Willis_reads

Output

Since we won’t run the import step ourselves to save time, we can just download the CBW_Willis_reads.qza file using wget in the sequence_data folder:

cd workspace/
ln -s ~/CourseData/MIC_data/Module2/qiime_artifacts/
cd qiime_artifacts/
wget http://quoc.ca/static/CBW_Willis_reads.qza

Quality control

Let’s start here. The sequences first undergo quality control, where we decide on a quality trimming threshold based on the quality profiles of the reads. During sequencing, each base called is given a base calling accuracy metric measured with Phred quality scores. The goal is to retain the most bases so the read is long enough to be informative, while removing bases that have low quality scores since they can increase the risk of false-positives. QIIME2 has a visualizer that helps you decide on a threshold.

This table adapted from Wikipedia can help us interpret quality scores. Generally, a score >20 is acceptable, and >30 is really good.

Phred Quality Score Error Accuracy (1 - Error)
10 1/10 = 10% 90%
20 1/100 = 1% 99%
30 1/1000 = 0.1% 99.9%
40 1/10000 = 0.01% 99.99%
50 1/100000 = 0.001% 99.999%
60 1/1000000 = 0.0001% 99.9999%

To run this, make sure you are in the qiime_artifacts/ directory, that’s where the reads are.

qiime demux summarize \
  --p-n 10000 \
  --i-data CBW_Willis_reads.qza \
  --o-visualization qual_viz.qzv

Output

QIIME has an interface that can view .qza or .qzv files. One motivation to view a qza files is to inspect it’s provenance (telling you from which file it comes from, with which parameters it was generated). Qzv are visualizations that we will generate throughout the tutorial. To see them, you have to download the file to your local machine.

  1. Type your public ip in a browser
  2. Download the .qzv we want to see into a repository in your computer
  3. Open QIIME’s Viewer
  4. Upload the .qzv to QIIME’s Viewer

Another option when the files are in your local machine is to use qiime’s view command which will open a browser. Careful, to run this, you need to be in the correct environment (QIIME2’s) and correct repository with your files of interest.

qiime tools view \
  qual_viz.qzv

This function randomly samples 10 000 of your reads without replacement, and presents the Q scores in a boxplot. Each bar represents the distribution of quality scores from the subsampled reads for one base position. A drop in quality is expected towards the 3’ end of reads due to phasing (Schirmer et al. 2016). Since we have paired-end reads, we need to be cautious about trimming too much since we need to have enough overlapping regions between forward and reverse reads for the joining step. Reads that are shorther than the selected trim length will be discarded, and reads that fail to join are also discarded.

Question : How many samples are there?

Question : Inspect the quality plots, what do you think would be a good trim length for the forward and reverse reads?

Clustering 16S sequences

From the quality score boxplot, we decided to trim our sequences at 240 for both forward and reverse reads. I also ran the table with a different trim length, and will leave the file in the qiime_solutions/f290r240/ if you want to compare it on your own time. While the original authors clustered their sequences into OTUs, we will make ASVs using DADA2. DADA2 clusters sequences based on quality score, and discards reads that have a high probability to be an error.

qiime dada2 denoise-paired \
  --i-demultiplexed-seqs CBW_Willis_reads.qza \
  --p-trunc-len-f 240 \
  --p-trunc-len-r 240 \
  --p-n-threads 4 \
  --o-representative-sequences representative_sequences.qza \
  --o-table unfiltered_table.qza \
  --o-denoising-stats denoise_stats.qza \
  --verbose #verbose can be added to any command line to print the details of what the computer is doing

Output

image 1

Denoising can take ~30mins, so we will use the outputs above that we cooked in advance! The outputs are the foundation for the rest of this analysis.

From the feature table

The feature table is a BIOM table of samples x ASVs. ASVs can be referred as the features, and each one has a strictly positive frequency per sample. Remember, this data is compositional, meaning they have an irrelevant constant sum imposed by the sequencer, independent from the initial microbial load. We can inspect the summary statistics:

qiime feature-table summarize --i-table unfiltered_table.qza --o-visualization table_summary

Output

(Remember to view the exact qzv files we create on our instance, we need to download them to our local machine then upload them to QIIME2’s Viewer, but you can use the visualize link here as a backup)

Question : Did we lose any samples during denoising? In which scenario would you lose a sample during denoising?

Question : What observation can you make on the ratio of abundant vs rare ASVs?

From the representative sequences

Taxonomy classification

It’s hard to make ASVs interesting, so let’s add their taxonomic identification. Naive Bayes Classifiers provide highly accurate taxonomic classification along with low run times for microbiome studies. It is recommended to use a classifier trained on your own data, but it can take a lot of time, and there are pre-trained classifiers available on QIIME2’s resources page, such as the greengenes and SILVA classifiers. Greengenes is a smaller database for Archaea and Bacteria, and was last updated in 2013. SILVA is a is regurlarly updated and includes eukaryotes. If you have a lot of memory, it’s better to use SILVA, but for this tutorial we will use a classifier trained on greengenes.

First, we need to download the classifier from QIIME2’s database.

wget https://data.qiime2.org/2021.4/common/gg-13-8-99-nb-classifier.qza

Then, we can classify our data.

qiime feature-classifier classify-sklearn \
  --i-classifier gg-13-8-99-nb-classifier.qza \
  --i-reads representative_sequences.qza \
  --o-classification taxonomy.qza

Output

This step is another one that takes a while to run, but it shouldn’t be more than 5-10 minutes for the size of our dataset! To view the taxonomy, QIIME generates nice barplots:

qiime taxa barplot \
  --i-table unfiltered_table.qza \
  --i-taxonomy taxonomy.qza \
  --m-metadata-file METADATA.txt \
  --o-visualization taxa-bar-plots

Output

Alignment and phylogeny

Still with the DNA sequences, we can align our sequences using MAFFT to inspect evolutionary relationships between the microbes in the sampled community.

qiime alignment mafft \
  --i-sequences representative_sequences.qza \
  --o-alignment aligned_representative_sequences

Artifact outputs

  • aligned_representative_sequences.qza : download

You can also run a masked alignment if you want to mask certain patterns (eg. repetitions or conserved regions) in the sequences before the alignment:

qiime alignment mask \
  --i-alignment aligned_representative_sequences.qza \
  --o-masked-alignment masked_aligned_representative_sequences

Artifact outputs

  • masked_aligned_representative_sequences.qza : download

We can visualize the alignment using a phylogenetic tree built with FastTree. The tree can be unrooted, or rooted. QIIME doesn’t have a function to visualize the trees, but we can export them and visualize them using other popular programs such as iTOL web interface or ggtree in R.

qiime phylogeny fasttree \
  --i-alignment masked_aligned_representative_sequences.qza \
  --o-tree unrooted_tree

Artifact outputs

We can also make a midpoint-rooted tree which finds the midpoint of the longest path between a pair of leaves (two ASVs) and puts the root right there.

qiime phylogeny midpoint-root \
  --i-tree unrooted_tree.qza \
  --o-rooted-tree rooted_tree

Artifact outputs

Here’s the rooted tree visualized on iTol:

image 1

Diversity analyses (Barplots and heatmaps)

Now that we have phylogenetic information, we can use it to calculate diversity metrics. Before calculating them, we need to correct for unequal sampling effort by rarefying our data. Rarefaction randomly subsamples all samples at the same read depth without replacement, resulting in all samples having the same read depth. We need to be careful at this step, because samples with a read depth smaller than the chosen subsampling depth will be discarded. To choose a the value, we will refer to our summarized table under the Interactive Sample Detail tab. The goal is to keep the most reads, and the most samples.

QIIME2 has a diversity plugin which generates alpha and beta diversity metrics on the data rarefied at a given depth.

qiime diversity core-metrics-phylogenetic \
  --i-phylogeny rooted_tree.qza \
  --i-table unfiltered_table.qza \
  --p-sampling-depth 2000 \
  --output-dir diversity_2000 \
  --m-metadata-file METADATA.txt

Visualization outputs

Artifact outputs

This is A LOT of information. We can start by looking at the beta-diversity plots. Remember, beta-diversity calculates the distance between two communities to tell us how similar (jaccard index) or dissimilar (bray-curtis index) they are. By including taxonomic information, we can calculate the UniFrac distance which uses the branch length between a set of taxa to determine if two communities are significantly different. The unweighted UniFrac distance uses presence/absence information of the taxa, whereas the weighted unifrac takes into consideration the relative abundances of each taxa to measure the differences between the microbial communities (eg. different environments). Each of these metrics are calculated between pairs of samples, therefore, they output a distance matrix. The distance matrix is used to make a PCoA plot that we can now visualize!

Question : Do you see different groupings with different beta diversity metrics (ie. Bray-Curtis, Jaccard and UniFrac)?

We can now investigate alpha diversity! Alpha diversity gives us information on the observed richness (Shannon), evenness (Pielou) or phylogenetic diversity (faith pd). We can compare the alpha diversity between groups by using the alpha-group-significance plugin:

qiime diversity alpha-group-significance \
  --i-alpha-diversity diversity_2000/faith_pd_vector.qza \
  --m-metadata-file METADATA.txt \
  --o-visualization diversity_2000/alpha_PD_significance

Visualization outputs

qiime diversity alpha-group-significance \
  --i-alpha-diversity diversity_2000/shannon_vector.qza \
  --m-metadata-file METADATA.txt \
  --o-visualization diversity_2000/alpha_shannon_significance

Visualization outputs

Question : Are the beta and alpha diversity giving you the same patterns? Why or why not?

Question : Are the groups significantly different respective to any metadata parameter?

We can also inspect the alpha-rarefaction curve which helps us determine whether we sampled sufficiently the environment to represent the true diversity. We expect to see an initial rapid growth as common species are found, then it starts to plateau as rarer species are found.

qiime diversity alpha-rarefaction \
  --i-table unfiltered_table.qza \
  --i-phylogeny rooted_tree.qza \
  --p-max-depth 2000 \
  --m-metadata-file METADATA.txt \
  --o-visualization diversity_2000/alpha_rarefaction.qzv

Visualization outputs

Back to the feature table

Differential abundance

Compositional data can be corrected by rarefaction to calculate alpha and beta diversity metrics like the Shannon diversity index, or the Bray-Curtis dissimilarity index we calculated just above. Another way is to take a compositional approach to the statistical analysis by using Compositional Data Analysis (CoDA) methods which are methods based on ratios instead of proportions. These methods are based on data transformations (CLR, ILR, ALR…) presented in module 5. QIIME2 has two CoDA methods that identify which microbes have significantly different abundances between two communities: ANCOM and ALDEx2. We could try to use ALDEx2, but this plugin has to be updated by the original authors to every new QIIME2 version. Unfortunately, the latest version of the ALEx2 plugin is not compatible with our current version of QIIME2, but you can try it on your own time with QIIME2 version 2019.7.

We will filter our table to remove very rare ASVs (with an abundance <100) because it can really slow down things with ANCOM!

qiime feature-table filter-features \
  --i-table unfiltered_table.qza \
  --p-min-frequency 50 \
  --o-filtered-table abund-filtered-table.qza

Artifact outputs

If you remember from the lecture, log-transformations don’t tolerate 0s so we need to add a small pseudo-counts to our tables.

qiime composition add-pseudocount \
  --i-table abund-filtered-table.qza \
  --o-composition-table comp-table

Artifact outputs

qiime composition ancom \
  --i-table comp-table.qza \
  --m-metadata-file METADATA.txt \
  --m-metadata-column '16S Variable Region' \
  --o-visualization ancom-size.qzv

Visualization outputs

Question : Which taxa are significantly different across different size fraction samples? Can you find their taxonomic ID?

Let’s return the same previous command, but change the metadata-column to Depth this time.

Question : Which taxa are significantly different across depth samples?

Question : If you had to repeat this analysis, which tools would you use?

References

  • Bokulich, N.A., Robeson, M., Dillon, M.R. bokulich-lab/RESCRIPt. Zenodo. http://doi.org/10.5281/zenodo.3891931
  • Bokulich, N.A., Kaehler, B.D., Rideout, J.R. et al. Optimizing taxonomic classification of marker-gene amplicon sequences with QIIME 2’s q2-feature-classifier plugin. Microbiome 6, 90 (2018). https://doi.org/10.1186/s40168-018-0470-z
  • Bolyen E, Rideout JR, Dillon MR, Bokulich NA, Abnet CC, Al-Ghalith GA, Alexander H, Alm EJ, Arumugam M, Asnicar F, Bai Y, Bisanz JE, Bittinger K, Brejnrod A, Brislawn CJ, Brown CT, Callahan BJ, Caraballo-Rodríguez AM, Chase J, Cope EK, Da Silva R, Diener C, Dorrestein PC, Douglas GM, Durall DM, Duvallet C, Edwardson CF, Ernst M, Estaki M, Fouquier J, Gauglitz JM, Gibbons SM, Gibson DL, Gonzalez A, Gorlick K, Guo J, Hillmann B, Holmes S, Holste H, Huttenhower C, Huttley GA, Janssen S, Jarmusch AK, Jiang L, Kaehler BD, Kang KB, Keefe CR, Keim P, Kelley ST, Knights D, Koester I, Kosciolek T, Kreps J, Langille MGI, Lee J, Ley R, Liu YX, Loftfield E, Lozupone C, Maher M, Marotz C, Martin BD, McDonald D, McIver LJ, Melnik AV, Metcalf JL, Morgan SC, Morton JT, Naimey AT, Navas-Molina JA, Nothias LF, Orchanian SB, Pearson T, Peoples SL, Petras D, Preuss ML, Pruesse E, Rasmussen LB, Rivers A, Robeson MS, Rosenthal P, Segata N, Shaffer M, Shiffer A, Sinha R, Song SJ, Spear JR, Swafford AD, Thompson LR, Torres PJ, Trinh P, Tripathi A, Turnbaugh PJ, Ul-Hasan S, van der Hooft JJJ, Vargas F, Vázquez-Baeza Y, Vogtmann E, von Hippel M, Walters W, Wan Y, Wang M, Warren J, Weber KC, Williamson CHD, Willis AD, Xu ZZ, Zaneveld JR, Zhang Y, Zhu Q, Knight R, and Caporaso JG. 2019. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nature Biotechnology 37: 852–857. https://doi.org/10.1038/s41587-019-0209-9
  • Benjamin J Callahan, Paul J McMurdie, Michael J Rosen, Andrew W Han, Amy Jo A Johnson, and Susan P Holmes. Dada2: high-resolution sample inference from illumina amplicon data. Nature methods, 13(7):581, 2016. doi:10.1038/nmeth.3869
  • Kazutaka Katoh and Daron M Standley. Mafft multiple sequence alignment software version 7: improvements in performance and usability. Molecular biology and evolution, 30(4):772–780, 2013. doi:10.1093/molbev/mst010