CBW Infectious Disease Epidemiology 2023 Module 6 Lab

Table of contents

  1. Download Files
  2. Introduction
  3. CARD Website and Antibiotic Resistance Ontology
  4. RGI for Genome Analysis
  5. RGI at the Command Line
  6. RGI for Merged Metagenomics Reads
  7. Metagenomic Sequencing Reads and the KMA Algorithm
  8. Pathogen of Origin Prediction

Download Files

If you are doing this demo live, you can download all the files we will be viewing here: https://github.com/bioinformaticsdotca/IDE_2023/tree/main/module6/downloads_for_demo

Introduction

This module gives an introduction to prediction of antimicrobial resistome and phenotype based on comparison of genomic or metagenomic DNA sequencing data to reference sequence information. While there is a large diversity of reference databases and software, this tutorial is focused on the Comprehensive Antibiotic Resistance Database (CARD) for genomic AMR prediction.

There are several databases (see here for a list) which try and organise information about AMR as well as helping with interpretation of resistome results. Many of these are either specialised on a specific type of resistance gene (e.g., beta-lactamases), organism (e.g., Mycobacterium tuberculosis), or are an automated amalgamation of other databases (e.g., MEGARes). There are also many tools for detecting AMR genes each with their own strengths and weaknesses (see this paper for a non-comprehensive list of tools!).

The “Big 3” databases that are comprehensive (involving many organisms, genes, and types of resistance), regularly updated, have their own gene identification tool(s), and are carefully maintained and curated are:

  1. Comprehensive Antibiotic Resistance Database (CARD) with the Resistance Gene Identifier (RGI).
  2. National Center for Biotechnology Information’s National Database of Antibiotic Resistant Organisms (NDARO) with AMRFinderPlus.
  3. ResFinder database with its associated ResFinder tool.

In this practical we are going to focus on CARD and the associated RGI tool because:

CARD Website and Antibiotic Resistance Ontology

The relationship between AMR genotype and AMR phenotype is complicated and no tools for complete prediction of phenotype from genotype exist. Instead, analyses focus on prediction or catalog of the AMR resistome - the collection of AMR genes and mutants in the sequenced sample. While BLAST and other sequence similarity tools can be used to catalog the resistance determinants in a sample via comparison to a reference sequence database, interpretation and phenotypic prediction are often the largest challenge. To start the tutorial, we will use the Comprehensive Antibiotic Resistance Database (CARD) website to examine the diversity of resistance mechanisms, how they influence bioinformatics analysis approaches, and how CARD’s Antibiotic Resistance Ontology (ARO) can provide an organizing principle for interpretation of bioinformatics results.

CARD’s website provides the ability to:

  • Browse the Antibiotic Resistance Ontology (ARO) and associated knowledgebase.
  • Browse the underlying AMR detection models, reference sequences, and SNP matrices.
  • Download the ARO, reference sequence data, and indices in a number of formats for custom analyses.
  • Perform integrated genome analysis using the Resistance Gene Identifier (RGI).

In this part of the tutorial, your instructor will walk you through the following use of the CARD website to familiarize yourself with its resources:

  1. What are the mechanisms of resistance described in the Antibiotic Resistance Ontology?
  2. Examine the NDM-1 beta-lactamase protein, it’s mechanism of action, conferred antibiotic resistance, it’s prevalence, and it’s detection model.
  3. Examine the AAC(6’)-Iaa aminoglycoside acetyltransferase, it’s mechanism of action, conferred antibiotic resistance, it’s prevalence, and it’s detection model.
  4. Examine the fluoroquinolone resistant gyrB for M. tuberculosis, it’s mechanism of action, conferred antibiotic resistance, and it’s detection model.
  5. Examine the MexAB-OprM efflux complex with MexR mutations, it’s mechanism of action, conferred antibiotic resistance, it’s prevalence, and it’s detection model(s).
Answers: 1. + antibiotic target alteration + antibiotic target replacement + antibiotic target protection + antibiotic inactivation + antibiotic efflux + reduced permeability to antibiotic + resistance by absence + modification to cell morphology + resistance by host-dependent nutrient acquisition 2. NDM-1: antibiotic inactivation; beta-lactams (penam, cephamycin, carbapenem, cephalosporin); over 40 pathogens (lots of ESKAPE pathogens) - note strong association with plasmids; protein homolog model 3. AAC(6')-Iaa: antibiotic inactivation; aminogylcosides; _Salmonella enterica_; protein homolog model 4. gyrB: antibiotic target alteration; fluoroquinolones; _Mycobacterium_; protein variant model 5. MexAB-OprM with MexR mutations: antibiotic efflux; broad range of drug classes; looking at MexA sub-unit: _Pseudomonas_; efflux meta-model

RGI for Genome Analysis

As illustrated by the exercise above, the diversity of antimicrobial resistance mechanisms requires a diversity of detection algorithms and a diversity of detection limits. CARD’s Resistance Gene Identifier (RGI) currently integrates four CARD detection models: Protein Homolog Model, Protein Variant Model, rRNA Variant Model, and Protein Overexpression Model. Unlike naïve analyses, CARD detection models use curated cut-offs, currently based on BLAST/DIAMOND bitscore cut-offs. Many other available tools are based on BLASTN or BLASTP without defined cut-offs and avoid resistance by mutation entirely.

In this part of the tutorial, your instructor will walk you through the following use of CARD’s Resistome Gene Identifier with default settings “Perfect and Strict hits only”, “Exclude nudge”, and “High quality/coverage”:

  • Resistome prediction for the multidrug resistant Acinetobacter baumannii MDR-TJ, complete genome (NC_017847).
  • Resistome prediction for the plasmid isolated from Escherichia coli strain MRSN388634 plasmid (KX276657).
  • Explain the difference in fluoroquinolone resistance MIC between two clinical strains of Pseudomonas aeruginosa that appear clonal based on identical MLST (Pseudomonas1.fasta, Pseudomonas2.fasta - these files can be found in this GitHub repo). Hint, look at SNPs.
Answers: The first two examples list the predicted resistome of the analyzed genome and plasmid, while the third example illustrates that `Pseudomonas2.fasta` contains an extra T83I mutation in gyrA conferring resistance to fluoroquinolones, above that provided by background efflux.

RGI at the Command Line

RGI is a command line tool as well, so we’ll do a demo analysis of 112 clinical multi-drug resistant E. coli from Hamilton area hospitals, sequenced on MiSeq and assembled using SPAdes (an older genome assembler). We’ll additionally try RGI’s heat map tool to compare genomes.

Login into your course account’s working directory and make a module6 directory:

cd ~/workspace
mkdir module6
cd module6

Take a peak at the list of E. coli samples:

ls /home/ubuntu/CourseData/IDE_data/module6/ecoli

RGI has already been installed using Conda, list all the available software in Conda, activate RGI, and then review the RGI help screen:

conda env list
conda activate rgi
rgi -h

First we need to acquire the latest AMR reference data from the CARD website:

rgi load -h
wget https://card.mcmaster.ca/latest/data
tar -xvf data ./card.json
less card.json
rgi load --card_json ./card.json --local
ls

We don’t have time to analyze all 112 samples, so let’s analyze 1 as an example (the course GitHub repo contains an EXCEL version of the resulting C0001.txt file). When analyzing FASTA files we use the main sub-command, here with default settings “Perfect and Strict hits only”, “Exclude nudge”, and “High quality/coverage”:

rgi main -h
rgi main -i /home/ubuntu/CourseData/IDE_data/module6/ecoli/C0001_E_coli.contigs.fasta -o C0001 -t contig -a DIAMOND -n 4 --local --clean
ls
less C0001.json
less C0001.txt
column -t -s $'\t' C0001.txt  | less -S
Discussion Points: Default RGI **main** analysis of C0001 lists 17 Perfect annotations and 52 Strict annotations. Yet, 44 annotations are efflux components common in *E. coli* that may or may not lead to clinical levels of AMR. Nonetheless, outside of efflux there are some antibiotic inactivation, target replacement, or target alteration genes known to be high risk (e.g., sul1, TEM-1, CTX-M-15, APH(6)-Id, and gyrA mutations). This is a MDR isolate of *E. coli*.

What if these results did not explain our observed phenotype? We might want to explore the RGI Loose hits (the course GitHub repo contains an EXCEL version of the resulting C0001_IncludeLoose.txt file), shown here with settings “Perfect, Strict, and Loose hits”, “Include nudge”, and “High quality/coverage”:

rgi main -h
rgi main -i /home/ubuntu/CourseData/IDE_data/module6/ecoli/C0001_E_coli.contigs.fasta -o C0001_IncludeLoose -t contig -a DIAMOND -n 4 --local --clean --include_nudge --include_loose
ls
column -t -s $'\t' C0001_IncludeLoose.txt  | less -S
Discussion Points: An additional 3 nudged Strict annotations (*Escherichia coli* PtsI with mutation conferring resistance to fosfomycin, EC-5 beta-lactamase, *Escherichia coli* EF-Tu mutants conferring resistance to pulvomycin) and 390 Loose annotations have been added to investigate for leads that could explain the observed phenotype. Note this scenario is unlikely for clinical isolates given CARD's reference data, but is possible for environmental isolates.

We have pre-compiled results for all 112 samples under “Perfect and Strict hits only”, “Exclude nudge”, and “High quality/coverage”, so let’s try RGI’s heat map tool (pre-compiled images can be downloaded from the course GitHub repo) (please ignore the FutureWarning):

ls /home/ubuntu/CourseData/IDE_data/module6/ecoli_json
rgi heatmap -h
rgi heatmap -i /home/ubuntu/CourseData/IDE_data/module6/ecoli_json -o genefamily_samples --category gene_family --cluster samples
rgi heatmap -i /home/ubuntu/CourseData/IDE_data/module6/ecoli_json -o drugclass_samples --category drug_class --cluster samples
rgi heatmap -i /home/ubuntu/CourseData/IDE_data/module6/ecoli_json -o cluster_both --cluster both
rgi heatmap -i /home/ubuntu/CourseData/IDE_data/module6/ecoli_json -o cluster_both_frequency --frequency --cluster both
ls
Discussion Points: The last analysis is the most informative, showing that many of these isolates share the same complement of efflux variants, yet most isolates are unique in their resistome, with a subset sharing TEM-1, sul1, and other higher risk genes.

RGI for Merged Metagenomic Reads

The standard RGI tool can be used to analyze metagenomics read data, but only for assembled or merged reads with Prodigal calling of partial open reading frames (ORFs). Here we will demonstrate analysis of merged reads. This is a computationally expensive approach, since each merged read set may contain a partial ORF, requiring RGI to perform massive amounts of BLAST/DIAMOND analyses. While computationally intensive (and thus generally not recommended), this does allow analysis of metagenomic sequences in protein space, including key substitutions, overcoming issues of high-stringency read mapping relative to nucleotide reference databases.

Lanza et al. (Microbiome 2018, 15:11) used AMR gene bait capture to sample human gut microbiomes for AMR genes. Using the online RGI under “Perfect, Strict and Loose hits”, “Include nudge”, and “Low quality/coverage” settings, analyze the first 500 merged metagenomic reads from their analysis (file ResCap_first_500.fasta). Take a close look at the predicted “sul2” and “sul4” hits in the results table. How good is the evidence for these AMR genes in this enriched metagenomics sample?

Discussion Points: There are three merged reads with 100% identity to ~25% of the sul2 gene each, while the 9 merged reads annotated as the sul4 gene encode less than 50% identity to the reference sul2 protein, suggesting they are spurious annotations.

Metagenomic Sequencing Reads and the KMA Algorithm

The most common tools for metagenomic data annotation are based on high-stringency read mapping, such as the KMA read aligner due to its documented better performance for redundant databases such as CARD. Available methods almost exclusively focus on acquired resistance genes (e.g., sequences referenced in CARD’s protein homolog models), not those involving resistance via mutation. However, CARD and other AMR reference databases utilize reference sequences from the published literature with clear experimental evidence of elevated minimum inhibitory concentration (MIC). This has implications for molecular surveillance as sequences in agricultural or environmental samples may differ in sequence from characterized & curated reference sequences, which are predominantly from clinical isolates, creating false negative results for metagenomic reads for these environments. As such, CARD’s tools for read mapping can use either canonical CARD (reference sequences from the literature) or predicted AMR resistance alleles and sequence variants from bulk resistome analyses, i.e. CARD Resistomes & Variants data set.

To demonstrate read mapping using RGI bwt, we will analyze a ~160k paired read subset of the raw sequencing reads from Lanza et al.’s (Microbiome 2018, 15:11) use of AMR gene bait capture to sample human gut microbiomes.

First we need to acquire the additional AMR reference data from the previous CARD website download:

rgi card_annotation -i ./card.json > card_annotation.log 2>&1
rgi load --card_json ./card.json --card_annotation card_database_v3.2.6.fasta --local
ls

Let’s take a look at the raw gut metagenomics data to remind ourselves of the FASTQ format:

ls /home/ubuntu/CourseData/IDE_data/module6/gut_sample
less /home/ubuntu/CourseData/IDE_data/module6/gut_sample/gut_R1.fastq

We can now map the metagenomic reads to the sequences in CARD’s protein homolog models using the KMA algorithm:

rgi bwt -1 /home/ubuntu/CourseData/IDE_data/module6/gut_sample/gut_R1.fastq -2 /home/ubuntu/CourseData/IDE_data/module6/gut_sample/gut_R2.fastq -a kma -n 4 -o gut_sample.kma --local
ls

RGI bwt produces a LOT of output files, see the details at the RGI GitHub repo. First, let’s look at the summary statistics:

cat gut_sample.kma.overall_mapping_stats.txt
ls

However, the file we are most interested in for now is gut_sample.kma.gene_mapping_data.txt and the course GitHub repo contains an EXCEL version for easy viewing, but let’s look at it on the command line:

column -t -s $'\t' gut_sample.kma.gene_mapping_data.txt  | less -S
cut -f 1 gut_sample.kma.gene_mapping_data.txt | sort -u | wc -l
ls
  • Ignoring efflux, which AMR gene had the most mapped reads?
  • Ignoring efflux, which AMR gene had the highest % coverage?
  • How many AMR genes were found in total?
  • From these results and what you know about assembly, what do you think are the advantages/disadvantages of read-based methods?
Answers: Top 5 (non-efflux) for number of mapped reads: * tet(Q) with 40345 reads * tet(X) with 7205 reads * ErmF with 6510 reads * CblA-1 with 4160 reads * tet(O) with 1608 reads Top 5 (non-efflux) for % length coverage (all had 100%): * tet(Q) * tet(X) * ErmF * CblA-1 * tet(O) 90 AMR genes had sequencing reads mapped. Read-based analyses advantages and disadvantages: * Higher sensitivity (we find as many AMR genes as possible) * Lower specificity (we are more likely to make mistakes when identifying AMR genes) * Incomplete data (we are likely to find fragments of genes instead of whole genes, this can lead to confusion between similar genes) * No genomic context (we don't know where a gene we detect comes from in the genome, is it associated with a plasmid?)

We can repeat the read mapping analysis, but include more sequence variants in the reference set by including the CARD Resistomes & Variants data set. First we need to acquire the Resistomes & Variant data from the CARD website:

THE FOLLOWING STEPS TAKE TOO LONG, DO NOT PERFORM DURING DEMO SESSION, INSTEAD PLEASE VIEW PRE-COMPILED RESULTS. FEEL FREE TO TRY THESE STEPS OUTSIDE OF CLASS.

wget -O wildcard_data.tar.bz2 https://card.mcmaster.ca/latest/variants
mkdir -p wildcard
tar -xjf wildcard_data.tar.bz2 -C wildcard
gunzip wildcard/*.gz
rgi wildcard_annotation -i wildcard --card_json ./card.json -v 4.0.0 > wildcard_annotation.log 2>&1
rgi load --card_json ./card.json --wildcard_annotation wildcard_database_v4.0.0.fasta --wildcard_index ./wildcard/index-for-model-sequences.txt --card_annotation card_database_v3.2.6.fasta --local

Map reads to canonical CARD (reference sequences from the literature) plus predicted AMR resistance alleles and sequence variants from bulk resistome analyses, i.e. CARD Resistomes & Variants data set:

THE FOLLOWING STEPS TAKE TOO LONG, DO NOT PERFORM DURING DEMO SESSION, INSTEAD PLEASE VIEW PRE-COMPILED RESULTS. FEEL FREE TO TRY THESE STEPS OUTSIDE OF CLASS.

rgi bwt -1 /home/ubuntu/CourseData/IDE_data/module6/gut_sample/gut_R1.fastq -2 /home/ubuntu/CourseData/IDE_data/module6/gut_sample/gut_R2.fastq -a kma -n 4 -o gut_sample_wildcard.kma --local --include_wildcard
ls

The pre-compiled results can be viewed in the EXCEL version of gut_sample_wildcard.kma.gene_mapping_data.txt in the GitLab repo, but let’s first compare statistics, where you’ll see we aligned some additional reads:

YOU CAN EXECUTE THESE COMMANDS AS WE HAVE PROVIDED PRE-COMPUTED RESULTS.

clear
cat /home/ubuntu/CourseData/IDE_data/module6/kmaresults/gut_sample.kma.overall_mapping_stats.txt
cat /home/ubuntu/CourseData/IDE_data/module6/kmaresults/gut_sample_wildcard.kma.overall_mapping_stats.txt
cut -f 1 /home/ubuntu/CourseData/IDE_data/module6/kmaresults/gut_sample_wildcard.kma.gene_mapping_data.txt | sort -u | wc -l
ls

Looking at the pre-compiled EXCEL spreadsheet, note that we have more information based on CARD Resistomes & Variants data set, such as mappings to multiple alleles, flags for association with plasmids, and taxonomic distribution of the mapped alleles.

  • Ignoring efflux, which AMR gene had the most mapped reads?
  • How many AMR genes were found in total?
  • Which genes associated with plasmids have the most mapped reads?
Answers: Top 5 (non-efflux) for number of mapped reads gives the same list but with more data: * tet(Q) with 42684 mapped reads (up from 40345 reads) * tet(X) with 7393 mapped reads (up from 7205 reads) * ErmF with 6987 mapped reads (up from 6510 reads) * CblA-1 with 4160 mapped reads (no change) * tet(O) with 1870 mapped reads (up from 1608 reads) 114 AMR genes had sequencing reads mapped (up from 90). Top 5 (plasmid associated) for number of mapped reads: * tet(X) with 7393 reads * acrD with 1881 Reads * APH(6)-Id with 1418 reads * sul2 with 961 reads * aad(6) with 99 reads

Pathogen of Origin Prediction

If there is time in the tutorial, we will demonstrate how to predict pathogen-of-origin for the AMR gene reads in the gut metagenomics data using k-mers. Please note this algorithm is not yet published and is currently undergoing validation. It is also slow and has a high memory burden as algorithm optimization has yet to be performed.

First, the reference data needs to be formatted for k-mer analysis (see the details at the RGI GitHub repo):

DO NOT ATTEMPT THESE COMMANDS ON THE CLASS SERVERS, THEY REQUIRE MORE MEMORY

rgi clean --local		
wget https://card.mcmaster.ca/latest/data
tar -xvf data ./card.json
rgi load --card_json ./card.json --local
rgi card_annotation -i ./card.json > card_annotation.log 2>&1		
rgi load -i ./card.json --card_annotation card_database_v3.2.6.fasta --local
wget -O wildcard_data.tar.bz2 https://card.mcmaster.ca/latest/variants
mkdir -p wildcard
tar -xjf wildcard_data.tar.bz2 -C wildcard
gunzip wildcard/*.gz
rgi load --card_json ./card.json --kmer_database ./wildcard/61_kmer_db.json --amr_kmers ./wildcard/all_amr_61mers.txt --kmer_size 61 --local --debug > kmer_load.61.log 2>&1

Now we can predict pathogen-of-origin for our metagenomics analysis that included canonical CARD (reference sequences from the literature) plus predicted AMR resistance alleles and sequence variants from bulk resistome analyses, i.e. CARD Resistomes & Variants data set:

DO NOT ATTEMPT THESE COMMANDS ON THE CLASS SERVERS, THEY REQUIRE MORE MEMORY

rgi kmer_query --bwt --kmer_size 61 --threads 4 --minimum 10 --input ./gut_sample_wildcard.kma.sorted.length_100.bam --output gut_sample_wildcard.pathogen --local

The pre-compiled results can be viewed in the EXCEL version of gut_sample_wildcard.pathogen_61mer_analysis.gene.txt in the GitLab repo, but let’s look at some extracted results for the genes outlined above:

ARO term Mapped reads with kmer DB hits CARD*kmer Prediction
tet(X) 6951 Escherichia coli (chromosome or plasmid): 1; Elizabethkingia anophelis (chromosome or plasmid): 1;
acrD 1860 Escherichia coli (chromosome): 102; Escherichia coli (chromosome or plasmid): 664;
APH(6)-Id 1388 Escherichia coli (chromosome or plasmid): 12; Salmonella enterica (chromosome or plasmid): 2; Vibrio parahaemolyticus (chromosome or plasmid): 1; Enterobacter hormaechei (chromosome or plasmid): 1; Acinetobacter baumannii (chromosome or plasmid): 1; Escherichia coli (plasmid): 3;
sul2 898 Escherichia coli (chromosome or plasmid): 3; Bacillus anthracis (chromosome or plasmid): 2; Klebsiella pneumoniae (chromosome or plasmid): 1; Pseudomonas aeruginosa (chromosome or plasmid): 1; Salmonella enterica (chromosome or plasmid): 1;
EC-8 517 Escherichia coli (chromosome): 127; Shigella boydii (chromosome): 1; Escherichia coli (chromosome or plasmid): 26;
APH(3’’)-Ib 387 Escherichia coli (chromosome or plasmid): 3; Enterobacter hormaechei (chromosome or plasmid): 1;
aad(6) 97 none
CblA-1 0 none

Note that those AMR genes associated with plasmids according to the CARD Resistomes & Variants data set cannot easily be assigned to a specific pathogen, while those like acrD and EC-8 that are predominantly known from chromosomes have a reliable pathogen-of-origin prediction.