Module 3 Mobile Genetic Elements

Lecture

Lab

November 21, 2025

Overview

This tutorial has three parts. In the first two parts we explore two tools for mobile genetic element prediction:

MOB-suite (command line) — for plasmid prediction (Part 1)
IslandCompare (online) — for genomic island prediction (Part 2)

Additionally, RGI predictions were generated using the ARETE pipeline. With these results, we will perform interactive visualization with Microreact. (Part 3)

By the end of this tutorial, you will be able to do the following: - Run MOB-suite to generate plasmid predictions from input sequence files. - Generate, visualize, and interpret genomic island predictions in IslandCompare. - Upload data to Microreact and examine the phylogenetic distribution of different types of predicted genomic features.

Estimated time: ~20 minutes per part.

“Do:” indicates actions you should perform.

Dataset

Using NCBI Datasets, 45 Salmonella genomes were retrieved with the following requirements:

Annotated as “complete”
Having at least two annotated contigs (to increase chance of plasmids)

Two key files in module3-MGE/Part1-MOB_suite:

Module3-SalmonellaGenomes.tar.gz — all 45 genomes (.fna)
Module3-MOBsuite-example-GCF_003325255.tar.gz — example MOB-suite results

NCBI Datasets: https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install/

See the Appendix for details on how the datasets were retrieved.

Part 1: MOB-suite

MOB-suite GitHub: https://github.com/phac-nml/mob-suite

We will use MOB-suite version 3.19 in this tutorial.

Running all 45 genomes is impractical during a short tutorial, so we will examine one genome.

Questions to answer:

How many predicted plasmids are there in the selected genome?
How large are they?
What is their predicted host range?

Activate the environment

To run MOB-suite, you first need to activate the Mamba environment on the server. You do this with the following command:

Do:

mamba activate mob_suite3

(capitalization and underscores matter!)

You can check the availability of the MOB-suite tools by typing in mob_ and hitting twice; this should bring up the four tools at your disposal.

Extract the genome files

If you’d like to have a go at a genome, you’ll want to extract an .fna file to work with. First, you can list the files in the archive:

Do:

tar tfz Module3-SalmonellaGenomes.tar.gz

Instead of extracting everything from the archive, choose a single genome to work with. You have two options:

Option 1: Choose a genome (prescriptive way):

Choose the genome with the ID ‘GCF_003325255’ as your example.

Do:

tar xvfz Module3-SalmonellaGenomes.tar.gz GCF_003325255.1/GCF_003325255.1_ASM332525v1_genomic.fna

The corresponding directory is ‘GCF_003325255.1’ (the ‘.1’ is the version number of the assembly in RefSeq).

Option 2: Choose a genome (fun, random way):

Choose a genome at random by listing the directories in a shuffled order and taking the last one. You can do this with the following command:

Do:

tar xvfz Module3-SalmonellaGenomes.tar.gz "$(tar tzf Module3-SalmonellaGenomes.tar.gz | shuf -n 1)"

This will give you a single directory with your lucky match. When doing a test run of the tutorial I landed on ‘GCF_001831985.2’ so I will use that in the commands below, but please substitute your directory/genome file.

Now Do:

cd GCF_001831985.2
ls -l *.fna

There’s your genome!

Let’s predict some plasmids!

Finally! So we’re ready to use the MOB-suite tools. Specifically, we’re going to use mob_recon to assign contigs to either the chromosome or one of the plasmid clusters defined in the reference database. There are two programs that are relevant to us here: mob_recon and mob_typer. mob_recon is meant to do assignment of contigs from draft assemblies using the reference database, whereas mob_typer reports the plasmid types, host range, and other information. However, mob_recon also produces the results generated by mob_typer. We’ll use mob_recon here; although these are complete genomes which should not have fragmented assemblies, you will likely want to run it If you’re working with draft genomes.

Do:

mob_recon --help

This will list a dizzying array of options, including different sets of source files and various similarity / distance / coverage thresholds for deciding where to assign a contig. We could play with some of these settings to watch plasmids come and go, but we’re just going to use the default parameters.

Do:

mob_recon --infile GCF_001831985.2_ASM183198v2_genomic.fna --outdir results

(substituting your genome name as appropriate)

Note that the run will gack if the output directory exists, if this is the case just do rm -rf results/ and try again.

This sets the process in motion – it will likely take a couple of minutes to complete. As with most other tasks in bioinformatics homology search (or some other sequence-comparison-y thing) is the rate-limiting step.

[time passes…]

When the run finishes, Do:

cd results/
ls -l

to see the output files. You should see the following: - One (chromosome) and possibly more (plasmid) FASTA-formatted files - biomarkers.blast.txt: biomarkers identified in the plasmids - contig_report.txt: assignment information for each contig - mge.report.txt: information about signature genes of mobile genetic elements (insertion sequences, etc) - mobtyper_results.txt: various typing statistics including reference plasmid IDs and accessions

Coming back to our original questions: - How many plasmids are there? That is simply the number of ‘plasmid_’ FASTA files that are returned by MOB-suite.

We can answer our next questions by looking at mobtyper.results.txt. This file has a header line, followed by one line per predicted plasmid. There are over 20 fields (which are documented on the MOB-suite website), but we can quickly hone in on the ones we want by using the cut command:

Do:

cut -f1,2,3,14,18,19,20,21 mobtyper_results.txt

Here we’re asking for specific columns from a tab-separated file. This command gives us the sample ID, number of contigs associated with the plasmid, the size of the plasmid in nucleotides, predicted mobility, primary cluster ID, predicted host range (taxonomic rank), and predicted host range (taxonomic group). For my genome, there is one plasmid with the following statistics: - Number of contigs: 1 - Plasmid size: 26,158 nt - Predicted mobility: non-mobilizable - Plasmid cluster ID: AD306 - Host range rank: order - Host range taxonomic group: Enterobacterales

Part 2: IslandCompare

IslandCompare is a software package and online tool for prediction and visualization of genomic islands (GIs) developed by Fiona Brinkman, Kristen Gray, and others. The software uses a combination of tools to generate GI predictions; you can investigate subsets of predictions or consider them all, and download results locally.

In this section, we will ask the following questions: - How dense is the coverage of genomic islands in our Salmonella genomes? - What is the balance between predicted versus curated genomic islands? - How well conserved is genomic island order and content across the genomes? - How conserved is gene content within genomic islands? - Which GIs show evidence of AMR genes and other signature elements?

To avoid hammering the IslandCompare server and to keep us to time, I have uploaded all 45 genomes along with two subsets and generated complete sets of predictions. You can see these by following the link below.

Note that IslandCompare requires Genbank-formatted flat files (.gbff), which contain gene annotations as well as sequences. Those are not included in the package of files on the student instance but the command I used to retrieve them is given at the end of this document.

Navigate to https://islandcompare.ca/analysis?id=9d05ae20-0caf-11ef-8455-fd7d6262f3b3

This will default to the “Analyze” page. If you click on the “PROJECTS” tab on the left-hand side of the screen you will see three projects, each with their own associated datasets: - MIG-2025: The full set of 45 genomes. - MIG-2025-SIMILAR: Eight closely related genomes sampled from the larger set. - MIG-2025-DIVERSE: Eight more-distantly related genomes.

In principle I could have used a single project and done all three analyses based on the same source files, but it was easier in this case to have the subset files as separate projects.

For each project, the right-hand side of the screen will show you the full list of uploaded genomes. You can generate specific analyses by selecting any set or subset of these genomes, optionally uploading a phylogenetic tree, and clicking “Submit”. We will not do this right now, but you can try it later using the uploaded datasets.

Do:

Navigate to the Job History page. This page shows you the completion status and lets you visualize and analyze the analysis results. The “All” dataset is more than we need, so we will focus on the “Diverse” and “Similar” subsets and see how their conservation differs. Click on Visualize on the right-hand side of the screen. This will bring you to the visualization page (not surprisingly), where each genome is shown as a horizontal line, with a SNP tree constructed using ParSNP (Treangen et al., 2014) relating the genomes to each other on the left. The default view has genomic islands that are likely to be homologous assigned the same colour (that’s the “compare” part of IslandCompare).

Hovering over a predicted island will also highlight related islands in other genomes, which immediately gives you a sense of the phylogenetic distribution of any island, as well as similarity in length and relative position in the genome.

Do the following: - Drag a rectangle over some region of the genome alignment to zoom into that part of the genomes. If you zoom in closely enough you can see individual annotated genes. - Switch from “Colour genes by similarity” to “Colour genes by predictor”. By choosing different predictive tools you can investigate the total number and length of islands predicted by IslandPath, Sigi-HMM, and BLAST, as well as the manually curated islands. - Click on a genomic island to view a close-up of it and its putative homologs in other genomes on a new page. See if you can find GIs with length variations or inversions relative to others. - Note that the phylogenetic tree is an unrooted cladogram, which shows relationships but lacks informative branch lengths. Click on “Toggle Branches” and you might be able to make a pretty good guess as to where the root of the tree should lie. The degrees of relatedness might help explain some of the phylogenetic distributions of islands you are seeing as well.

You should see some significant differences in conservation between the two subsets: with greater variation in GI presence and orientation in the Diverse set. Viewing the Diverse set in IslandCompare provides an excellent qualitative view of how lateral gene transfer impacts the diversity and functional capabilities of Salmonella.

Once you have finished exploring the complement of GIs in these genomes you can download the predictions as text files or a publication-quality image.

Part 3: Microreact

Parts 1 and 2 both give you results that can be interpreted as presence/absence values across a given set of genomes: these are often referred to as phylogenetic profiles. These profiles are useful because features (plasmids, genes, etc) with similar presence and absence distributions may have something else in common: function, genomic localization, etc. There are a million caveats on this, but that’s the basic idea. In Part 3 we will investigate and compare the distribution of several sets of features to see if we can find any links.

Microreact (https://microreact.org/) is a platform that supports the visualization of phylogenomic data, particularly in the context of genomic epidemiology. The strength of Microreact is in its ability to integrate phylogenetic, temporal, geographic, and contextual information about pathogen isolates. There are many impressive data visualization examples on the front page; here we will use their visualization environment to examine the distribution of an “interesting” subset of features in the isolates related by our phylogenetic tree. We will use Microreact to explore questions about our Salmonella genomes, such as: - Can we identify any similarities in distribution that might imply some kind of functional or other connection? - Can we see such linkages between different types of features?

Inputs

To carry out this analysis, we need two files. The first is a .tsv file that contains a matrix where the rows are genomes, and each column represents a feature predicted by either MOB-suite, IslandCompare, or the Resistance Gene Identifier (that’s this afternoon!). A ‘0’ in the matrix means that this feature is absent from the corresponding genome, whereas a ‘1’ indicates its presence.

Matrix file: Here is how I generated the MicroReactMatrix.tsv file: - Format the three sets of outputs into three tab-separated files (RGI, MOB-suite, IslandCompare) with column1 = genome ID and column2 = feature name. Each row indicates the presence of one feature in one genome. - Used a Python script to merge these files into the matrix file.

I used our ARETE pipeline to generate the RGI predictions and the phylogenetic tree you will upload into Microreact.

I used an additional filter to include only features that are present in between 3 and 43 out of the 45 genomes. Why? At the lower end, one-offs (or two-offs) aren’t super-interesting for identifying interesting distributions of feature combinations across our tree. Also, there are a LOT of them so cutting them out reduces the size of our matrix and makes the rest easier to work with. Similarly, features that are present in all or nearly all genomes again have kinda uninteresting distributions, and can be misleading if they represent consistent false positives across the entire set. So I removed any feature that was present in 1, 2, 44, or 45 genomes.

Tree file: You can use any phylogenetic analysis tool you like to generate a reference tree for comparative purposes. I used the phylogenomic analysis pipeline in ARETE, which does the following: - Uses the PpanGOLiN software (Gautreau et al., 2020) to identify core genes present in all or nearly all of the input genomes; - Builds a concatenated reference alignment that includes all these core genes; - Uses Fasttree (Price et al., 2010) to build an unrooted phylogenetic tree from this alignment.

The result is the file core_gene_alignment.tre.

Do:

Retrieve the matrix and tree file from the course Github page.

Using Microreact

Do:

Navigate to https://microreact.org/upload

Drag the tree file and matrix file onto the page. Microreact should correctly recognize the format of these files. Click “Continue”.

You will then be prompted to adjust properties of the data table if you wish. You need to confirm that “genome_id” is indeed the correct ID column. You can also change the datatype of different attributes and choose different colour schemes. Click “Continue”.

You should now see two panels: a phylogenetic tree at the top, and the metadata frame (i.e., the matrix) at the bottom. Like the tree IslandCompare generated for us, this tree is unrooted. - For the sake of aesthetics I recommend you right-click on the canvas and select “Midpoint Root”. - Things can get crowded in the tree view; you can focus on a more manageable set by right-clicking on an edge and selecting “View subtree”. - You can reset everything by right-clicking on the canvas and selecting “Redraw Original Tree”.

We can adjust all manner of things by clicking on the slider button, which brings up options to change the tree visualization, show metadata, and muck around with the node/label visuals.

I recommend you start with the “Nodes & Labels” button to change the appearance to your liking.

Now, select “Metadata blocks”. You will have the option to select one or more attributes, or all of them if you like. Start by turning them all on (except for “Genome ID”, which is not very informative!). - MOB-suite identifiers have two letters followed by three numbers, e.g. “AA474” - IslandCompare identifiers start with “SPI-”, for “Salmonella Pathogenicity Island” - RGI uses gene names: sometimes these follow the standard nomenclature (e.g., “emrA”), and sometimes they don’t. Sometimes they REALLY don’t.

By default the colour scheme is green for absent and yellow for present. We can adjust this by clicking the three horizontal lines next to the slider icon and choosing “Edit Tree”. Click on “Metadata” in the resulting window, select any column, click on “Categorical” and you’ll be able to choose a custom palette.

Back to those questions: - Can we identify any similarities in distribution that might imply some kind of functional or other connection? An obvious one is the mdsABC genes: turns out these are subunits of an antibiotic efflux pump (https://card.mcmaster.ca/ontology/37169) so that makes sense. - Can we see such linkages between different types of features? If you keep looking you might notice that many other attributes show a similar distribution pattern to the mds genes. This may or may not be meaningful, though! Can you think of a reason why we might see similar distributional patterns for different features that have no functional connection?

Understanding these patterns of correlation can involve a deep and fascinating dive into the literature (or maybe you discovered a new association!)

References

Bertelli C, Gray KL, Woods N, Lim AC, Tilley KE, Winsor GL, Hoad GR, Roudgar A, Spencer A, Peltier J, Warren D. Enabling genomic island prediction and comparison in multiple genomes to investigate bacterial evolution and outbreaks. Microbial genomics. 2022 May 18;8(5):000818.

Gautreau G, Bazin A, Gachet M, Planel R, Burlot L, Dubois M, Perrin A, Médigue C, Calteau A, Cruveiller S, Matias C. PPanGGOLiN: depicting microbial diversity via a partitioned pangenome graph. PLoS computational biology. 2020 Mar 19;16(3):e1007732.

Price MN, Dehal PS, Arkin AP. FastTree 2–approximately maximum-likelihood trees for large alignments. PloS one. 2010 Mar 10;5(3):e9490.

Robertson J, Bessonov K, Schonfeld J, Nash JH. Universal whole-sequence-based plasmid typing and its utility to prediction of host range and epidemiological surveillance. Microbial Genomics. 2020 Oct;6(10):e000435.

Robertson J, Nash JH. MOB-suite: software tools for clustering, reconstruction and typing of plasmids from draft assemblies. Microbial genomics. 2018 Aug;4(8):e000206.

Treangen TJ, Ondov BD, Koren S, Phillippy AM. The Harvest suite for rapid core-genome alignment and visualization of thousands of intraspecific microbial genomes. Genome biology. 2014 Nov;15:1-5.

Appendix

NCBI Datasets: https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install/

MOB-suite Github repo: https://github.com/phac-nml/mob-suite

MOB-suite database download link: https://zenodo.org/records/10304948/files/data.tar.gz?download=1

IslandCompare: https://islandcompare.ca/

Microreact: https://microreact.org/

Command for downloading Salmonella genomes in .fna format for MOB-suite:

datasets summary genome taxon "Salmonella" --assembly-level complete --limit 100 | jq '.reports[] | select(.assembly_stats.number_of_contigs != 1) | .accession + " Contigs:" + (.assembly_stats.number_of_contigs|tostring)' | sed "s/\"\(.*\) .*/datasets download genome accession \1 --filename \1.zip/" | sh

Command for downloading Salmonella genomes in .gbff format for IslandCompare:

I followed the instructions on the NCBI documentation page to set up the required conda environment.

In case you’re interested, column 11 contains the RefSeq accessions for the plasmids in the reference cluster. You can look at these by pasting one of the IDs into the search bar at the NCBI home page and following the links. For example, if I paste ‘NC_013284’ I get a 22,448 bp plasmid from Cronobacter turicensis, similar but not identical.

Novel plasmids: Predicted plasmids that do not match sufficiently well to any in the reference database will be assigned a label of ‘novel_’ followed by an md5 hash. These can provide useful avenues of further investigation but will not be consistent between runs.