Module 1
Intro to R & Seurat Objects
Lab
scRNA-seq Analysis: datastructures
Load the required packages.
library(knitr)
opts_chunk$set(cache = TRUE, fig.align = "center", echo=TRUE)
require(Seurat)
require(DropletUtils)
require(SingleCellExperiment)
require(Matrix)
Basic R Datastructures
There are four basic data structures used in R: - vector : a 1D ordered set of 1 or more values that are all the same type - matrix : a 2D grid of values that are all the same type - data.frame : a 2D grid of values where each column can store data of a different type. - list : a 1D ordered set of 1 or more values where each value can be a different type.
x = c(1,-5,2,10,7) ###vector of numbers
x[2]
y = c("canadian", "bioinformatics", "workshop", "single", "cell") ###vector of strings
y[1]
We can combine the vectors above into a matrix.
mat1 = cbind(x,x)
mat1
However, if we try to mix data-types R will “coerce” them into the same type. In this case strings, as you can see from the quotation marks that are now around the numbers.
mat2 = rbind(x,y)
mat2
To properly store numbers and text together we want a data.frame:
df = data.frame(num=x, text=y)
df
If we want to store matrices, dataframes, and vectors together we can use a list:
lt = list(mat=mat2, vec=x, df=df)
lt
To access data within these different objects we use square brackets with the “index” i.e. the item number or row+column number for the data we want to access. If our data is named we can also use the name to access it.
###the "vec" in our list can be accessed in all of these ways
lt[[2]]
lt[["vec"]]
lt$vec
###the column "num" in our dataframe can be accessed in all of these ways
df$num
df[,"num"]
df[,colnames(df) == "num"]
R “Objects” Single-cell RNAseq data involves many different data structures for different parts of the data. Metadata is generally a dataframe. Gene expression is stored as a matrix. Lower dimensional projections are paired matrices, and cell-cell graphs require special datastructures.
We could combine these together using a list but a list is so flexible when we give a list to a function it can be hard to find the right data in it. e.g. is the normalized data called “norm”, “lognorm”, “normdata”, “expr_mat” etc… To solve this problem we use “Objects” that act like a list but with pre-defined slots for specific types of data.
You can tell when you are dealing with an object by the “@” symbol in the Global Environment or when using “str()” on the variable. This “@” indicated a pre-defined slot that must always exist in the object with a specific name and may to limited to storing a particular type of data. E.g. a “obj@counts” slot likely must contain either a matrix or a sparse matrix.
Common Components of Single-cell Data
Single cell data & analysis uses several distinct types of information, each of which can be stored in different ways: - genes x cells gene expression (numbers), this can be raw counts or after one or two levels of normalization / scaling - metadata about cells (numbers, text), this often involves various categorical classifications of cells, e.g. processing batch, patient ID, disease status, cell-type, but can also involve various quality control information. - lower dimension representation (numbers), this includes a cells x dimensions matrix for each representation, and may include a genes x dimensions matrix of weights.
Different single-cell analysis packages use certain objects to organize these data, so functions can always find what they need.
For more details on SingleCellObject, refer to: Amezquita, R.A., Lun, A.T.L., Becht, E. et al. Orchestrating single-cell analysis with Bioconductor. Nat Methods 17, 137–145 (2020). https://doi.org/10.1038/s41592-019-0654-x
###Example Data
Single cells isolated from the stria vascularis (SV) from the inner ear of lab mice (Korrapati et al. 2019). Technology: 10X Chromium Database: Gene Expression Omnibus GEO accession GSE136196. Preprocessing: cellranger(7.1.0)
The SingleCellExperiment Object (SCE)
We have data from a 10X experiment as a sparse matrix, where the data is stored as a triplet (row, column, value), so that only non-zero values are stored. Since our matrix 98% zero, this greatly reduces the amount of information we need to store!
Sparse matrices are defined in the Matrix package and it provides behind-the-scenes code that let us work with sparse matrices in the exact same way as regular matrices.
Since one letter require the same amount of space to store as any number, the text portion of the matrix (row and column names) are stored once separately in features.tsv.gz and barcodes.tsv.gz. While the triplets are stored in matrix.mtx as (row number, column number, umi count).
We could use Matrix::readMM() to read in matrix.mtx and read.delim() to read in the text files and then put them together, but many packages provide a function to automatically do this for us already.
To make a SingleCellExperiment object automatically:
sce <- DropletUtils::read10xCounts("/home/ubuntu/CourseData/Korrapati_MouseEar/filtered_feature_bc_matrix")
sce
To make a SingleCellExperiment object manually:
sparsematrix <- Matrix::readMM("~/CourseData/Korrapati_MouseEar/filtered_feature_bc_matrix/matrix.mtx.gz")
my_rownames <- read.delim("~/CourseData/Korrapati_MouseEar/filtered_feature_bc_matrix/features.tsv.gz", header=F)
my_colnames <- read.delim("~/CourseData/Korrapati_MouseEar/filtered_feature_bc_matrix/barcodes.tsv.gz", header=F)
head(my_rownames)
We need to pick which gene IDs we’ll use, we will stick with Ensembl IDs for now.
rownames(sparsematrix) <- my_rownames[,1]
colnames(sparsematrix) <- my_colnames[,1]
class(sparsematrix)
dim(sparsematrix)
sce2 <- SingleCellExperiment(
assays = list(counts = sparsematrix),
rowData = my_rownames
)
sce2
Here we can see we have: - 32285 genes (rows) - 552 cells (columns) - 1 assay called “counts” - no lower dimensional representations - DropletUtils has automatically added information about each cell that is the “Sample” and “Barcode” which we don’t have when we made it manually.
Let’s check what DropletUtils did:
head(colData(sce), 10)
Here we retrieve the column data (cells) from the SCE using colData() then print the top 10 rows using head(). We can see that “Barcode” has stored the cell barcode and “Sample” has stored the path to the folder we read in. This is a bit messy to look at so let’s replace that with a shorter name for this sample.
colData(sce)$Sample <- "Korrapati_exons"
head(colData(sce))
How many rows did head() print by default?
If we look at the structure of our sce in the Environment browser or using str() we can see all the different slots in our SCE object.
Exercise 1 - Working with SingleCellExperiment Objects Using the following “accessor” functions obtain the following information from the SCE object:
- the gene symbol of the 1013th row.
- the expression level for the 19108th gene in the 271th cell.
- the cellbarcodes of the first 25 cells.
| Function | Description |
|---|---|
rowData(sce) |
Table of gene metadata. |
colData(sce) |
Table of cell metadata. |
assay(sce, "counts") |
The assay named “counts”. |
reducedDim(sce, "PCA") |
The reduced dimensionality table named “PCA” |
sce$colname |
Shortcut to access the column “colname” from colData. This is equivalent to colData(sce)$colname |
sce[<rows>, <columns>] |
We can use the square brackets to subset the SCE object by rows or columns, similarly to how we subset a matrix or data.frame |
:::
Naming Assays
Assays can have any name we wish. However, there are some conventions we can follow:
counts: Raw count data, e.g. number of reads or transcripts for a particular gene.normcounts: Normalized values on the same scale as the original counts. For example, counts divided by cell-specific size factors that are centred at unity.logcounts: Log-transformed counts or count-like values. In most cases, this will be defined as log-transformed normcounts, e.g. using log base 2 and a pseudo-count of 1.cpm: Counts-per-million. This is the read count for each gene in each cell, divided by the library size of each cell in millions.tpm: Transcripts-per-million. This is the number of transcripts for each gene in each cell, divided by the total number of transcripts in that cell (in millions).
Each of these has a function, so that we can access the “counts” assay using the counts() function.
Therefore, these two are equivalent:
While this is simply conventions, many packages we use will assume the assay names above and you may get errors if your assays are incorrectly named. Fortunately, many functions will automatically create the appropriately named assay.
The Seurat Object
The other commonly used software package for 10X Chromium data is Seurat. Like with SingleCellExperiment there are some provided functions to make it easier to read data from cellranger.
sparsematrix <- Seurat::Read10X("~/CourseData/Korrapati_MouseEar/filtered_feature_bc_matrix/")
seur_obj <- Seurat::CreateSeuratObject(sparsematrix)
seur_obj
Seurat has expanded its data object in order to store multi-modal data : e.g. RNA & ATACseq. So here we can see we have one assay called “RNA”, and we will have to look inside of “RNA” to find the count matrix and the normalized matrix (once we generate one).
Seurat::GetAssayData(seur_obj, assay="RNA", "counts")[1:5,1:5]
seur_obj@assays$RNA@layers$counts[1:5,1:5]
If we look at the structure of the Seurat Object as before, we can see that we have 3 slots for assays within our “RNA” slot: - “counts” : raw umi counts - “data” : normalized gene expression - “scale.data” : scaling each row of “data” to have a mean of 0 and standard deviation of 1
We also don’t have rowData or colData, instead Seurat puts information about the cells into meta.data and doesn’t store gene information at all - it uses gene symbols by default.
Seurat also automatically calculates some quality measures for us and puts them in the metadata. Let’s have a look:
head(seur_obj@meta.data)
Here we see the total UMI counts for each cell (“nCount_RNA”) and the number of different genes detected per cell (“nFeature_RNA”). As with DropletUtils, Seurat has automatically added a sample id orig.ident but this time it’s even less helpful.
Exercise 2 - Working with Seurat Object
A) Fix the orig.ident column to contain a more useful name.
- Obtain/Calculate the following information:
- the UMI counts for the 501st gene in the 322nd cell.
- the mean detected genes per cell.
- the total UMIs detected in the whole dataset
Hints:
sum() calculates the sum of a vector.
mean() calculates the mean of a vector.
GetAssayData() accessed the expression matrices in a Seurat Object.
Rather than fix the sample ID afterwards, we can set it when we create the Seurat Object:
seur_obj <- Seurat::CreateSeuratObject(sparsematrix, project="Korrapati_exons")
Modifying Objects
We’ve already seen that we can modify parts of the metadata. We can use a similar approach to modify all parts of an object. However, each slot has specific requirements that must be fulfilled to assign a new value. You can use class() to help figure out what is required. These are to ensure the appropriate data is present in each slot so that other functions that use the object can run correctly.
Exercise 3
Manually read in the introns_and_exons.mtx matrix. It contains counts for the number of intronic reads plus the exonic reads for our data. Replace the counts in our seur_obj with the intronic + exonic counts.
We could also create a new assay for our intronic + exonic counts and add it to the Seurat Object like so:
introns <- Seurat::CreateAssayObject(sparsematrix)
seur_obj@assays$Introns <- introns
Some more useful functions
To create your own analysis pipelines and explore your data it is helpful to have some quick ways to summarize data stored in matrices:
| Function | Description |
|---|---|
rowMeans(sparsematrix) |
mean of each row. |
colMeans(sparsematrix) |
mean of each column. |
rowSums(sparsematrix) |
sum of each row. |
colSums(sparsematrix) |
sum of each column. |
t() |
swaps rows and columns of a matrix. |
Let’s use these functions to calculate the same summary statistics we have in our SeuratObject for our SingleCellExperiment object.
colData(sce)$nCount <- colSums(counts(sce))
colData(sce)$nFeatures <- colSums(counts(sce) > 0)
When we do math with TRUE/FALSE values, the are swapped to numbers: TRUE = 1, FALSE = 0.
Subsetting Single Cell Objects
Similar to the standard data.frame and matrix objects in R, we can use the [ operator to subset our SingleCellExperiment or SeuratObject either by rows (genes) or columns (cells).
The general syntax is: sce[rows_of_interest, columns_of_interest]. We can use gene/column names, logical values, or row/column numbers.
For example, we might want to select only the cells with >1000 detected UMIs:
filtered_sce <- sce[,colData(sce)$nCount > 1000]
dim(filtered_sce)
Or we may want to remove mitochondrial genes (which start with “mt-” in the mouse) from the matrix:
filtered_sce <- sce[!grepl("^mt-", rowData(sce)$Symbol),]
dim(filtered_sce)
grepl("^mt-") uses a regular expression to find all the genes that start with (“^”), a particular pattern (“mt-”). It returns TRUE if the rownames matches the pattern and FALSE if it doesn’t. The details of all the options for regular expressions is beyond the scope of this course but use ?grepl to see the related R functions and see this website for more information on regular expressions: Intro to Regular Expressions
Exercise 4 - Subsetting objects
1. Subset our Seurat object using the same criteria as the SCE.
2. Subset our SCE and Seurat objects to contain only genes with total UMIs of at least 100 (hint: look at the rowSums() function)
3. Subset our Seurat object to remove ribosomal genes (hint: their gene names start with “Rps” or “Rpl” for mice)
Overview
KEY POINTS
The SingleCellExperiment (SCE) object is used to store expression data as well as information about our cells (columns) and genes (rows).
To create a new SCE object we can use the
SingleCellExperiment()function. To read the output from cellranger we can use the dedicated functionDropletUtils::read10xCounts().The main parts of this object are:
- assay - one or more matrices of expression quantification.
- There is one essential assay named “counts”, which contains the raw counts on which all other analyses are based on.
- rowData - information about our genes.
- colData - information about our cells.
- reducedDim - one or more reduced dimension representations of our data.
- assay - one or more matrices of expression quantification.
We can add/modify parts of this object using the
The Seurat Object can store gene expression as well as other types of single-cell data such as scATAC or Multiome data. It stores count matrices as well as information about cells in a meta.data dataframe.
The main parts of this object are:
- assays - counts, log transformed + normalized counts (data), scaled data (scale.data)
- meta.data - information about the cells
- reductions - one or more reduced dimension representations of our data.
We can access parts of these object using ‘accessor’ functions or directly using
@and$notation. For exampleseur_obj@assays$RNA@countsWe can add/modify part of these objects using the assignment operator
<-. For exampleassay(sce, "logcounts") <- log2(counts(sce) + 1)would add a new assay named “logcounts” to our object.Matrix summary metrics are very useful to explore the properties of our expression data. Some of the more useful functions include
rowSums()/colSums()androwMeans()/colMeans().Combining matrix summaries with conditional operators (
>,<,==,!=) can be used for conditional subsetting using[.
sessionInfo()
View session info
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-apple-darwin20
## Running under: macOS Sequoia 15.7.7
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Toronto
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] htmltools_0.5.9 jsonlite_2.0.0 readr_2.2.0
##
## loaded via a namespace (and not attached):
## [1] crayon_1.5.3 vctrs_0.7.3 cli_3.6.6 knitr_1.51
## [5] rlang_1.2.0 xfun_0.57 otel_0.2.0 bit_4.6.0
## [9] glue_1.8.1 sass_0.4.10 hms_1.1.4 rmarkdown_2.31
## [13] evaluate_1.0.5 jquerylib_0.1.4 tibble_3.3.1 tzdb_0.5.0
## [17] fastmap_1.2.0 yaml_2.3.12 lifecycle_1.0.5 bookdown_0.46
## [21] compiler_4.4.2 pkgconfig_2.0.3 rstudioapi_0.18.0 digest_0.6.39
## [25] R6_2.6.1 tidyselect_1.2.1 parallel_4.4.2 vroom_1.7.1
## [29] pillar_1.11.1 magrittr_2.0.5 bslib_0.10.0 bit64_4.8.0
## [33] tools_4.4.2 cachem_1.1.0