Module 5
Clustering
Lab
Preprocessing
Before we do any clustering, we need to perform QC, Normalization, and Dimensionality Reduction (HVGs + PCA). We already covered these steps for the Liver dataset, so we’ll simply load that object from when we saved it:
Recall: to make our analysis reproducible we need to set the random seed.
Clustering
Once we’ve done those steps it’s time to cluster our data. The most common approach used for single-cell data is “Louvain” or “Leiden” clustering, which is used by Seurat and scanpy. These methods are just two different algorithms for optimizing the modularity score of clusters in a graph. Thus we first need to turn our single-cell data into a graph before we can use these algorithms.
Cell-cell graph
To turn our data into a graph we will link together cells with weighted edges based on their number of shared nearest neighbours. To do this we need to choose a number of nearest neighbours to consider (k). By default Seurat uses k=20, which seems to work well on human and mouse data sets with ~2,000-10,000 cells.
The NN/SNN graph is stored within the Seurat object in the “graphs” slot.
Here a “1” means that two cells have identical sets of neighbours, while 0.25 means that those two cells share 25% of their neighbours, and a “.” or “0” means those two cells share no neighbours. These are treated as “weights” for the edges in the graph.
Exercise 1
1) Use rowSums() and mean() to find average across cells of the total weight between that cell and all other cells.
- Change k.param to 40, how does this affect the average weight for each cell?
Seurat uses an approximation heuristic to calculate nearest neighbours in a faster & more scalable way but does not have guaranteed accuracy thus it’s advisable to use k >= 10. For small data sets (i.e. < 750 cells) we recommend using other clustering methods such as SC3
Clustering
Now we can run the Louvain/Leiden clustering algorithm on our cell-cell graph. Note that this is a stochastic algorithm meaning each time it is run we can get a slightly different solution. By default Seurat runs 10 iterations of the algorithm and keeps the solution with the highest modularity which improves the stability of the solution. In addition, there is a crucial resolution parameter resolution which influences the size of the cluster the algorithm will find. By default resolution is set at 0.8.
Here you can see that by default when we plot out data using DimPlot Seurat plots a UMAP coloured by the most recent set of clusters you have generated. Here we set label=TRUE so that the name of the cluster is plotted on the UMAP since it can be hard to distinguish colours when we have many clusters.
Note: Seurat always numbers clusters in order of number of cells from most cells to least cells.
We can find these cluster labels for each cell in our metadata table.
We can use table() to see how many cells are in each cluster.
The seurat_clusters column holds the most recent clusters, but all clusters you generate for a particular object will be stored in columns labelled with the assay, graph and resolution you used. Let’s check this by clustering at a different resolution.
Note that the “k” of the snn is not recorded in the name of these columns, so if we want to explore the effect of different “k” we’ll need save the clustering results in a column with our own name.
Exercise 2 Cluster these data with resolution = 0.2, 0.5, 1, 1.5, or 2. How many clusters do you get at different resolutions? Cluster these data using resolution 0.8 and k=10, 20, 50. How many clusters do you get with different ks?
Finding optimal clusters
Once we have a bunch of different clusterings at different resolutions we want to find the “best” one. We will show you three ways to compare your clusterings to make an informed decision on what parameters to use: the silhouette index, adjusted Rand index, and marker genes.
Silhouette Index
For the Silhouette Index we need a matrix of all pair-wise distances between cells. We will use the same PCA dimensions as we used for clustering to calculate these distances because PCA preserves distances between datapoints. Plus, we need a set of clusters. We will get a score for each cell for how close it is to its cluster vs other clusters (high = distinct clusters, low = overlapping clusters), we can aggregate these scores across each cluster or across the entire clustering.
require(cluster) # contains silhouette function
dist.matrix <- dist(x = Embeddings(object = seur_obj[["pca"]])[, 1:20])
sil <- silhouette(as.numeric(as.character(seur_obj@meta.data$seurat_clusters)), dist=dist.matrix)
head(sil)
mean(sil[,3])Overall we have a silhouette score of 0.35 for this set of clusters. The Silhouette score depends on a lot of factors so it is difficult to interpret the magnitude of the score in isolation, but positive silhouette scores are generally “good” and negative scores are “bad”.
Here we can see that most of the clusters have silhouette scores > 0 except for cluster 1. This means that when we go to annotate our clusters we may want to merge cluster 1 with one of its neighbouring clusters, or that we may want to change the clustering resolution.
Exercise 3 Pick 5 clusterings you have calculated, using the silhouette score what is the best clustering out of them?
Adjusted Rand Index
We can also look at how stable our clusters are across different clustering parameters. We usually assume that clusters that stay the same across different resolutions are more likely to represent real cell-types than those that change. We can use the Adjusted Rand Index (ARI) to measure how similar two sets of clusters are. Identical clusters will get ARI = 1, random cluster labels will get ARI = 0, generally ARI > 0.8 is considered “good”.
require(igraph) # package containing ARI function
ARI <- compare(seur_obj@meta.data$SCT_snn_res.0.8, seur_obj@meta.data$SCT_snn_res.0.5, method="adjusted.rand")
ARIExercise 4 Calculate ARI between your sets of 5 clusterings, which are the most similar?
Marker Genes
Lastly, we can look at the number of significant marker genes we see for our clusters, more significant markers means better defined clusters. Note that by default Seurat uses several arguments to increase the speed of calculation, however these violate the assumptions of p-value calculations. This is designed for quickly annotating your cluster not for validating that your clusters are good. Fortunately, we can change these arguments to make it suitable for evaluating our clustering.
Note: We use max.cells.per.ident here to speed up calculations, this is not necessary in the ordinary use of this approach
markers <- FindAllMarkers(seur_obj, group.by="SCT_snn_res.0.8", logfc.threshold = -Inf, only.pos=TRUE, max.cells.per.ident=100)
markers <- markers[ markers[,"p_val_adj"] < 0.05, ]
table(markers$cluster)You can see that cluster #2 is poorly defined vs the other clusters.
Seurat’s marker gene tool does not account for biases due to the nature of clustering which means these DE gene results should be considered as over estimates of the true number of DE genes. And this overestimate will be greater for smaller clusters than larger ones. See: DOI: 10.1016/j.cels.2019.07.012 for accurate marker gene identification after clustering accounting for these biases.
Exercise 5 For your 5 clusterings use marker genes to determine which is the best clustering.
Bonus Exercise Compare your clustering results to those of the people around you, out of all the clusterings which do you think is the best?
Annotating Clusters
Marker-based Annotation
Once we have chosen our best clustering we need to annotate each of our clusters. The most common way to do this is to examine the top marker genes for each cluster and search various databases or the literature to determine which cell-type they most likely represent. We can check the top marker genes for each of our clusters from the results above like so:
Then we can look them up in CellMarker 2.0 to see what cell-type this might be.
CellMarker Instructions: 1. Go to: http://117.50.127.228/CellMarker/ 2. Click “Cell Tools” 3. Select “Cell annotation” 4. Type marker genes into the Input box separated by just a “,” e.g. “IL32,CD3D,CD2” 5. Set Species to “Human” (optional set Tissue to “Liver”) 6. Press “Submit”
You should be able to find that this cluster would be annotated as some kind of “T cell”.
Let’s save the marker genes we used to come up with this annotation and the annotation itself so we can plot them later.
nclusters <- length(unique(seur_obj@meta.data$SCT_snn_res.0.8))
# Create a vector where the names of the vector are the cluster number
# and the values of the vector are the annotated names.
cluster_annotation <- rep("unannotated", nclusters)
names(cluster_annotation) <- levels(seur_obj@meta.data$SCT_snn_res.0.8)
# Fill in the annotation for the first cluster
cluster_annotation["0"] <- "T cell"
# Save the genes we used for annotation
anno_genes <- list("cluster0"=c("CD3D","CD3G", "TRAC", "CCL5", "GZMK", "RUNX3"))Exercise 6 Annotate clusters 2,4,6,9 and 10 using CellMarker.
Once all the clusters have been annotated we can rename our clusters in our seurat object:
seur_obj = SetIdent(seur_obj, value=seur_obj@meta.data$SCT_snn_res.0.8)
seur_obj = RenameIdents(seur_obj, cluster_annotation)
seur_obj@meta.data$manual_anno <- Idents(seur_obj)
head(seur_obj@meta.data)We can visualize our marker gene annotation using a DotPlot:
Exercise 7 How many of your marker genes were shared between multiple clusters?
Reference based annotation
There are many tools to compare single-cell data to a reference but here we’ll cover the most simple approach which is simply looking at correlations between the average gene expression in your clusters vs the average gene expression from a reference dataset.
First let’s load our reference dataset:
Next we calculate the average expression of each gene in each of our clusters.
query = AverageExpression(seur_obj, group.by = "SCT_snn_res.0.8", assays="RNA") # we use RNA b/c our reference was not normalized using SCT.
head(query$RNA)Now we need to select a subset of genes that are biologically relevant (i.e. HVGs) and are present in both datasets, and ensure both the reference and our data are subset to just those genes and the genes are in the same order.
genes = VariableFeatures(seur_obj);
genes = genes[genes %in% rownames(ref)]
query2 = query$RNA[match(genes, rownames(query$RNA)),]
ref2 = ref[match(genes, rownames(ref)),]
identical(rownames(ref2), rownames(query2))Now we can correlate them with each other.
cross_cors = cor(as.matrix(query2), as.matrix(ref2))
heatmap(cross_cors, scale="none", margins = c(12,4))Using these correlations we can quickly annotate most of our clusters to at least a general cell-type.
# Create a vector where the names of the vector are the cluster number
# and the values of the vector are the annotated names.
cluster_annotation2 <- rep("unannotated", nclusters)
names(cluster_annotation2) <- levels(seur_obj@meta.data$SCT_snn_res.0.8)
cluster_annotation2["0"] = "T cell"
cluster_annotation2["1"] = "T cell"
cluster_annotation2["2"] = "T cell"
cluster_annotation2["3"] = "InfMac"
cluster_annotation2["4"] = "NK cell"
cluster_annotation2["5"] = "gdT cell"
cluster_annotation2["6"] = "CD3T cell"
cluster_annotation2["7"] = "B/T cell"
cluster_annotation2["8"] = "B cell"
cluster_annotation2["9"] = "Endothelial"
cluster_annotation2["10"] = "Macrophage"
cluster_annotation2["11"] = "Hepatocyte"
cluster_annotation2["12"] = "AntiB"
cluster_annotation2["13"] = "cvLSEC"
cluster_annotation2["14"] = "Erythroid"
cluster_annotation2["15"] = "Cholangio"
seur_obj = SetIdent(seur_obj, value=seur_obj@meta.data$SCT_snn_res.0.8)
seur_obj = RenameIdents(seur_obj, cluster_annotation2)
seur_obj@meta.data$ref_anno <- Idents(seur_obj)
head(seur_obj@meta.data)Exercise 8 How many genes did we use for our correlations? How do your manual annotations compare to the annotations you got from using the reference data?
Exercise 9 Repeat the cell-type x cluster correlations but this time use the marker genes you collected during manual annotation. Do your annotation change?
Differential Expression
Data
So far we’ve only been using one sample for our analysis. To perform differential expression we will need multiple samples. For this we will use some multiplexed mouse brain data from 10X Genomics (see: https://www.10xgenomics.com/resources/datasets) : Mouse E18 Combined Cortex, Hippocampus and Subventricular Zone Nuclei Multiplexed, 12 CMOs: 3’v3.1 Targeted, Custom Neuroscience Panel
This data contains single nucleus RNAseq data from 3 multiplexed replicates of mouse brain tissue from 4 different mice. We have already QCed, Normalized, Clustered and Annotated these data for you (see: 10XMouseBrain.Rhistory for how we processed the data).
This data will allow us to examine how much variability there is between individual mice, compared to technical variability between our replicates.
Pseudobulks
To calculate pseudobulk expression we will use our package: MultiPath (currently in beta-testing). Let’s load it now:
Now let’s choose a cell-type will will perform DE for. First check which cell-types are present in the data.
We’ll use Astrocytes for this demonstration.
Now we need to choose how to aggregate the data. Typically we would aggregate by individual if we were doing case-control (i.e. “Mouse”), however in this experiment we don’t have case-control rather we have multiple replicates (“feature_call”) and different mice as our biological condition.
table( subset_obj@meta.data$feature_call, subset_obj@meta.data$Mouse)
#CMOs are sample-identifying molecular barcodes used to multiplex multiple biological samples in one sequencing runHere you can see the replicate x biological condition for our experiment. We will use our group_rowmeans function to calculate the pseudobulks.
We need to give the function an expression matrix (MAT), a vector of group labels (group_labs) and tell it whether to calculate the mean or the sum within each group. We can use it to calculate pseudobulks like this:
pseudobulks <- group_rowmeans(subset_obj@assays$RNA@counts, subset_obj@meta.data$feature_call, type="sum")
head(pseudobulks)Let’s remove unexpressed genes to save us some time when we do the differential expression:
Exercise 10 What is the condition we are using above to filter genes? How many genes are filtered out?
Now lets set up our differential expression. First we need to create a vector of the biological condition for each column of our pseudobulks. We can do this manually using by looking at our metadata.
table( subset_obj@meta.data$feature_call, subset_obj@meta.data$Mouse)
colnames(pseudobulks_filt)
bio_condition <- c("Mouse1","Mouse1","Mouse1",
"Mouse2","Mouse2","Mouse2",
"Mouse3","Mouse3","Mouse3",
"Mouse4","Mouse4","Mouse4")We could also do this automatically using apply function but that’s beyond the scope of this course.
Bonus Exercise
Create the bio_condition vector from the colnames() of pseudobulk_filt computationally using the metadata table.
EdgeR
Now that we have our pseudobulks and biological condition vector we only need to run edgeR (or other bulk RNAseq DE tool.)
First define our model:
Here you can see this has defined our four “B”s for us. B0 = Intercept (Mous1), B1 = Mouse2-Mouse1, B2 = Mouse3-Mouse1, B3=Mouse4-Mouse1.
Next we create a “DGEList” for edgeR, and estimate the library size that will be used by the model to correct for the library size.
pseudobulk_dge <- DGEList(counts=pseudobulks_filt, group=bio_condition)
pseudobulk_dge <- calcNormFactors(pseudobulk_dge)Then we fit the model:
And obtain the DE genes for each mouse vs the reference mouse (our “Bs”)
res_2vs1 <- glmQLFTest(fit, coef=2) #You are testing Mouse2 vs Mouse1 (reference)
topTags(res_2vs1, 20)Exercise 11 Which of the four mice are male? which are female? (hint: Xist is only highly expressed in females)
Exercise 12 How many genes are differentially expressed (FDR5%) between our mice?
Bonus Exercise Do Inhibitory Neurons have the same differentially expressed genes between Mouse 1 and Mouse 2 as Astrocytes?
Overview
KEY POINTS - There are two key parameters for Louvain clustering: - k for the sNN - res for the clustering algorithm
We can use a variety of metrics to determine the best clustering parameter, and/or to guide our annotation
Clusters can be annotated by comparing to a reference dataset, or by cross referencing marker genes with the literature or databases.
Pseudobulk approaches allow us to use bulk RNAseq differential expression tools to identify cell-type specific differential expression between biological conditions.