Module 4 Outbreak Analysis

Author: Finlay Maguire

Last Modified: 2025-11-11

Lecture

Lab

In this lab practical we will be using microbial genomes (and associated contextual metadata) to investigate a suspected nosocomial (hospital-acquired) outbreak of Methicillin resistance Staphylococcus aureus (MRSA) in a UK-based Neonatal Intensive Care Unit (NICU). The data we are using comes from a classic study which demonstrated the utility of whole genome sequencing for investigating an MRSA outbreak within a Special Care Baby Unit (also known as a NICU) of the Cambridge University Hospitals NHS Foundation Trust (CUH).

Harris SR, Cartwright EJP, Török ME, et al. Whole-genome sequencing for analysis of an outbreak of methicillin-resistant Staphylococcus aureus: a descriptive study. Lancet Infect Dis 2012 10.1016/S1473-3099(12)70268-2

This practical will step you through the genomic components of a typical nosocomial bacterial outbreak investigation, specifically:

Identifying potential outbreak clusters
Determining SNP distances
Inferring outbreak phylogenies
Performing transmission inference

Background

Staphylococcus aureus is a Gram-positive coccoidal bacteria that forms a normal commensal member of the microbiome in 20-40% of people (Karsten, 2017). However, when skin and mucosal barriers are disrupted (e.g., through medical procedures such as catheterisation, tracheal tubes, surgeries) S. aureus can cause opportunistic invasive infections (Lee, 2018).

Since the 1960s, variants have emerged and spread globally that are resistant to the majority of β-lactam antibiotics through the independent acquisitions of the staphylococcal cassette chromosome mec (SCCmec) (IWG-SSC, 2009. This combination of commensal carriage, antimicrobial resistance, and healthcare-associated infection opportunities (along with additional virulence factors) have led to MRSA becoming a leading cause of hospital-associated mortality and morbidity globally (GBD, 2024). Therefore, tracking and preventing MRSA outbreaks is a major priority for infection prevention and control (IPAC or IPC) within hospital settings especially within the NICU due to the vulnerability of immunocompromised preterm infants and the high frequency of invasive procedures. This has led to adoption of genomic epidemiological approaches to detect MRSA outbreaks and track transmission links not apparent from traditional approaches alone (Blane, 2024). These sort of outbreak investigations are generally led by the infection prevention & control (IPAC or IPC) team in a clinical setting or by the relevant public health team depending on the suspected infection source, geographic scale, and regional resource availability (e.g., Public Health Agency of Canada/Canadian Food Inspection Agency or provincial bodies like Public Health Ontario).

Within the SCBU/NICU of Cambridge University Hopitals NHS Trust (and most hospitals) all inpatients are screened for MRSA carriage upon admission and once per week thereafter using culture-based or nucleic-acid based test. Specific IPAC policies and guidelines will determine criteria for initiating an outbreak investigation (e.g., a certain number of positive screens within a specific time and/or location).

Routine surveillance has identified that 3 infants within the SCBU/NICU (P11-P13) are positive for MRSA carriage at the same time. Isolates from these 3 infants have also been shown to have the same pattern of antibiotic resistances. The IPAC team has therefore been activated to further investigate this as a suspected outbreak in the SCBU/NICU at CUH. They have performed a systematic review of all MRSA isolates from the SCBU/NICU over the preceding 6-12 months and identified a series of overlapping MRSA carriages with this same set of resistances. Due to multi-week gaps in the positive SCBU/NICU isolates they have also collected potentially matching MRSA isolates from parents and the wider hospital/community. To gain better insight into this potential outbreak and identify which of these isolates are linked by direct transmission chains, CUH have sequenced these isolates using 150bp paired-end reads via an Illumina MiSeq platform.

You are a clinical bioinformatician who has been asked to support the investigation and analyses these genomes

Set-Up

This lab practical will involve running the following software:

Data

Using ssh connect to the provided analysis server instance:

ssh -i YOUR_KEY.pem YOUR_USER@XX.uhn-hpc.ca

Now create a folder in your ~/workspace and copy/link over the files you will need:

mkdir -p ~/workspace/module4
cd ~/workspace/module4
cp -r ~/CourseData/module4/contextual_data.csv  ~/CourseData/module4/SASCBU26_reference.fna .
ln -s ~/CourseData/module4/assemblies .

When you are finished with these steps you should be inside your work directory ~/workspace/module4. You can verify this by running the command pwd. s Output after running pwd

~/workspace/module4

You now see an assemblies/ folder in the current directory along with a contextual_metadata.tsv (all the contextual data needed for this analysis) and SASCBU26_reference.fna (the reference genome we will use later):

Output after running lss

SASCBU26_reference.fna  assemblies  contextual_data.csv

Activate environment

Next we will activate the pre-installed conda environment, which will have all the tools needed by this tutorial pre-installed. To do this please run the following:

Commands

conda activate outbreak

You should see the command-prompt (where you type commands) switch to include (outbreak) at the beginning, showing you are inside this environment. You should also be able to run the mlst command like mlst --version and see output:

Output after running mlst --version

mlst 2.23.0

Find your IP address

Similar to yesterday, we will want to either use the assigned hostname (e.g., xx.uhn-hpc.ca where xx is your instance number) or find the IP address of your machine on AWS so we can access some files from your machine on the web browser. To find your IP address you can run:

Commands

curl http://checkip.amazonaws.com

This should print a number like XX.XX.XX.XX. Once you have your address, try going to http://xx.uhn-hpc.ca or http://IP-ADDRESS and clicking the link for module4. This page will be referred to later to easily download/view some of our output files.

Cluster Identification

All isolates that are putatively linked to the outbreak have been done so on the basis of time, location, and phenotype (specifically antibiotic susceptibilities). However, just because 2 isolates have the same resistance pattern does not mean they are closely related. Although this is a relatively small set of isolates we are often evaluating very large numbers of genomes for their potential connection to an outbreak. Additionally, most downstream outbreak analyses (such as transmission inference) perform best when applied to as little diversity as possible (i.e., just genomes from a single outbreak event).

This means our first task is to identify and group which isolates are actually connect

are likely to be linked using the sort of typing methods you encountered in the previous lab practical. This helps us to eliminate unrelated genomes and determine whether 1 or more outbreaks are likely to be taking place.

To make things faster for this practical, we have provided pre-inferred genome assemblies (generated using shovill with the skesa assembler).

You can infer the 7-gene MLST using the pubMLST Staphylococcus aureus scheme by running the mlst tool on each of the genome assemblies.

mkdir -p outbreak_cluster_identification; cd outbreak_cluster_identification
mlst --scheme saureus ../assemblies/*.fa > mlst.tsv

Now download the mlst.tsv output file (using either scp or the IP address listed above) and open it in your tabular data tool of choice (e.g., pandas/python, R, excel).

Based on these results, which genomes do you suspect may not be part of this outbreak?

Why might genomes with a different MLST still be closely related or those with the same MLST be relatively unrelated?

We could solve this challenge using a more fine-grained scheme like cg/wgMLST (as discussed in the previous module) but we are going to use poppunk in this lab.

poppunk is a rapid k-mer based genome clustering tool that groups genomes together based on both their core and accessory genome distances. This allows differential clustering of similar genomes which have acquired novel plasmids and so on.

The developers of poppunk provide some pre-computed databases (although it is relatively easy to create your own) for different species which are available here.

We have already downloaded a reference database for you, so all you have to do is prepare a 2-column file genomes.txt which has a series of names in the first column and the path to the related genome assembly in the other. We can do this using a quick bash one-line loop:

for i in ../assemblies/*.fa; do isolate=$(echo $i | cut -d '/' -f3 | cut -d '.' -f1); echo -e "$isolate\t$i" >> genomes.txt; done

Then we can run the following to get poppunk cluster assignments:

poppunk_assign --db ~/CourseData/module4/staphylococcus_aureus_v1_full --query genomes.txt --output poppunk

Now download/inspect poppunk/poppunk_clusters.csv file

Which genomes does this analysis suggest should be excluded? Do they match the MLST results?

SNP Analysis

Now that we have eliminated some isolates that are unlikely to be connected to our main outbreak we can do a deeper analysis of the evolutionary relationships between our isolates. For this we are going to perform a phylogenetic analysis. There are several way this could be done:

Inferring and aligning the core genome of our isolates by running a tool like panaroo on the genome assembly annotation files (generated with prokka or bakta).
Mapping individual reads against a reference genome with a tool like snippy.
Mapping SNPs between genomes using ska (then using a reference to order these SNPs)

For SNP analyses, why might we want use a reference derived from an outbreak isolate instead of standard species reference genome?

For this lab, we are going to do the last option using a complete high quality reference genome generated from this outbreak subsequent to the original manuscript: SASCBU26

First let’s generate the input file for SKA using the poppunk results

cd ~/workspace/module4; mkdir -p ska_phylogeny; cd ska_phylogeny
paste <(grep "3$" ../outbreak_cluster_identification/poppunk/poppunk_clusters.csv | cut -d, -f1) <(grep "3$" ../outbreak_cluster_identification/poppunk/poppunk_clusters.csv | cut -d, -f1 | perl -ne 'chomp; print "../assemblies/$_.fa\n"' ) > ska_input
echo -e "SASCBU26\t../SASCBU26_reference.fna" >> ska_input

Now we need to generate an SKA index:

ska build -f ska_input -k 31 -o ska_index --threads 4

Then map these split k-mers against the reference (required to order them for the next step):

ska map -o ska.aln --ambig-mask ../SASCBU26_reference.fna ska_index.skf

This gives us an alignment but due the distorting effects of recombination we typically want to try and remove any potential recombinant sites from the alignment. There are several tools that can help us do this (verticall, ClonalFrameML, gubbins) and today we are going to use gubbins.

mkdir -p gubbins
run_gubbins.py --prefix gubbins/gubbins ska.aln
mask_gubbins_aln.py --aln ska.aln --gff gubbins/gubbins.recombination_predictions.gff --out gubbins.masked.aln

Now that we have a masked alignment gubbins.masked.aln we can analyse the SNP network using GraphSNP to further refine our potential outbreak samples. First, generate a SNP distance matrix using snp-dists:

snp-dists gubbins.masked.aln | sed 's/\t/,/g' > raw_outbreak_matrix.csv

Now download the raw_outbreak_matrix.csv, navigate to GraphSNP in your browser, and upload the raw_outbreak_matrix.csv as your Alignment/matrix.

Then click on “Graph” at the top of the page (may be hidden behind 3 horizontal lines on smaller screens), increase your “Cutoff number” to 50, and then hit the “Create Graph” button.

This will create a minimum spanning tree grouped using a 50 SNP cut-off.

Which genomes do you think you can eliminate from being part of these immediate outbreak cluster?

Let’s also look at these isolates using a traditional phylogenetic approach. Go back to the server and run the following command to build a maximum-likelihood tree using iqtree:

iqtree -s gubbins.masked.aln

This will generate a lot of outputs but if you download the newick formatted phylogeny gubbins.masked.aln.treefile and go to microreact we can visualise the tree.

Hit the “Upload” button and select your treefile.

Then hit the settings slider icon at the top right

Finally, select radial/unrooted tree.

Does the phylogeny show the same genomes as being outside the main outbreak cluster

Now we have refined our final set of suspected outbreak isolates we can create our final alignment of just these sequences.

Create a SNP distance matrix and phylogeny for just the final outbreak isolates by repeating the above steps with a modified ska_input (if you create a new folder you’ll avoid getting mixed up with your new datas).

Transmission Inference

With our refined outbreak distance matrix and phylogeny we are going to perform some transmission analyses by combining these with some of our context data.

First we are going to use the SeqTrack method as implemented in GraphSNP.

To do this download contextual_metadata.csv and your refined distance matrix (e.g., refined_outbreak_matrix.csv or whatever you have named it) and then navigate to GraphSNP.
This time you must upload the new distance matrix under Alignment/matrix and the metadata table under Metadata.

Then navigate to “Graph” but this time select analysis type “transmission” and click “Create Graph”

Which person is inferred to have infected the largest number of other patients?

Which baby/babies was/were inferred to be infected by a parent?

Congratulations you’ve performed your first transmission analysis!

The next step would be to try and perform a more sophisticated transmission inference method using TransPhylo. However, this would require us to infer a time-scaled phylogeny and you’ll discover more about how to do that in the next module!