Introduction to Single Cell RNA-Seq Part 2: QA and filtering
Set up workspace
First, we need to load the required libraries.
library(Seurat)
library(ggplot2)
library(tidyr)
library(dplyr)
library(kableExtra)
If you are continuing directly from part 1, the experiment.aggregate object is likely already in your workspace. In case you cleared your workspace at the end of the previous section, or are working on this project at a later date after re-starting R, you can use the readRDS function to read your saved Seurat object from part 1.
experiment.aggregate <- readRDS("scRNA_workshop-01.rds")
experiment.aggregate
## An object of class Seurat
## 38606 features across 60697 samples within 1 assay
## Active assay: RNA (38606 features, 0 variable features)
## 1 layer present: counts
First lets look at the number of valid cells in each batch
| Sample | Number of Cells |
|---|---|
| LRTI_WRK1 | 28124 |
| LRTI_WRK2 | 14933 |
| LRTI_WRK3 | 3343 |
| LRTI_WRK4 | 14297 |
A seed is used to initialize pseudo-random functions. Some of the functions we will be using have pseudo-random elements. Setting a common seed ensures that all of us will get the same results, and that the results will remain stable when re-run.
set.seed(12345)
Display metadata QA/QC
Using a few nested functions, we can produce prettier, more detailed, versions of the simple exploratory summary statistics we generated for the available metadata in the last section. In the code below sections below,
Further, Seurat has a number of convenient built-in functions for visualizing metadata. These functions produce ggplot objects, which can easily be modified using ggplot2. Of course, all of these visualizations can be reproduced with custom code as well, and we will include some examples of both modifying Seurat plots and generating plots from scratch as the analysis continues.
1) 10% quantile tables are produced for each metadata value, separated by sample identity. 2) Ridge Plot 3) Violin Plot
Each is a different way of looking at the data as to get a better understanding of how each sample compares
Feature counts (genes) per cell
| LRTI_WRK1 | LRTI_WRK2 | LRTI_WRK3 | LRTI_WRK4 | |
|---|---|---|---|---|
| 0% | 300.00 | 300.0 | 300.0 | 300.0 |
| 5% | 305.00 | 387.0 | 337.1 | 701.4 |
| 10% | 311.00 | 522.2 | 375.0 | 1671.0 |
| 15% | 320.00 | 993.6 | 435.3 | 2686.4 |
| 20% | 330.00 | 1650.0 | 499.4 | 3509.0 |
| 25% | 345.00 | 2104.0 | 585.0 | 3982.0 |
| 30% | 368.00 | 2789.0 | 727.6 | 4310.0 |
| 35% | 401.00 | 3472.0 | 908.0 | 4543.0 |
| 40% | 441.00 | 3981.8 | 1186.8 | 4732.0 |
| 45% | 494.00 | 4398.4 | 1699.6 | 4913.0 |
| 50% | 602.00 | 4725.0 | 2184.0 | 5066.0 |
| 55% | 900.30 | 5011.0 | 2617.1 | 5212.8 |
| 60% | 1386.00 | 5260.0 | 3196.0 | 5347.0 |
| 65% | 1886.90 | 5496.0 | 3785.0 | 5493.0 |
| 70% | 2616.10 | 5691.0 | 4287.2 | 5630.0 |
| 75% | 3352.00 | 5894.0 | 4799.5 | 5793.0 |
| 80% | 4076.40 | 6099.0 | 5387.4 | 5960.0 |
| 85% | 4682.00 | 6311.2 | 5961.0 | 6178.6 |
| 90% | 5221.70 | 6580.0 | 6703.8 | 6460.0 |
| 95% | 5836.85 | 7010.0 | 7645.3 | 6982.2 |
| 100% | 13205.00 | 10859.0 | 13321.0 | 9953.0 |


UMI counts per cell
| LRTI_WRK1 | LRTI_WRK2 | LRTI_WRK3 | LRTI_WRK4 | |
|---|---|---|---|---|
| 0% | 333.00 | 341.0 | 337.0 | 329.0 |
| 5% | 374.00 | 633.0 | 548.0 | 1550.4 |
| 10% | 387.00 | 989.0 | 663.0 | 3475.0 |
| 15% | 402.00 | 1989.2 | 822.0 | 7046.8 |
| 20% | 421.00 | 3332.4 | 999.0 | 11645.6 |
| 25% | 448.00 | 4581.0 | 1217.5 | 15134.0 |
| 30% | 489.00 | 7493.6 | 1642.2 | 17535.8 |
| 35% | 542.00 | 10957.2 | 2123.2 | 19545.8 |
| 40% | 614.20 | 14282.8 | 2790.8 | 21091.8 |
| 45% | 736.00 | 17231.0 | 3963.5 | 22609.4 |
| 50% | 1051.00 | 20197.0 | 5167.0 | 23955.0 |
| 55% | 1660.65 | 23045.8 | 6994.1 | 25328.0 |
| 60% | 2496.00 | 25809.2 | 9699.0 | 26753.6 |
| 65% | 3615.85 | 28400.8 | 12977.6 | 28163.8 |
| 70% | 6011.20 | 30822.4 | 16366.8 | 29885.4 |
| 75% | 8947.50 | 33341.0 | 21187.5 | 31696.0 |
| 80% | 13408.60 | 36139.0 | 27606.0 | 33712.6 |
| 85% | 18554.55 | 38939.4 | 35278.6 | 36513.2 |
| 90% | 23435.40 | 42897.6 | 46873.4 | 40754.0 |
| 95% | 29076.05 | 49827.2 | 67186.0 | 49096.0 |
| 100% | 170061.00 | 113899.0 | 374591.0 | 130605.0 |


Percentage of Mitochondria per cell
| LRTI_WRK1 | LRTI_WRK2 | LRTI_WRK3 | LRTI_WRK4 | |
|---|---|---|---|---|
| 0% | 0.000 | 0.000 | 0.000 | 0.000 |
| 5% | 0.368 | 0.525 | 0.180 | 1.614 |
| 10% | 0.721 | 1.464 | 0.357 | 2.055 |
| 15% | 1.018 | 2.116 | 0.584 | 2.297 |
| 20% | 1.311 | 2.472 | 0.796 | 2.477 |
| 25% | 1.606 | 2.737 | 1.075 | 2.629 |
| 30% | 1.887 | 2.963 | 1.302 | 2.752 |
| 35% | 2.123 | 3.156 | 1.570 | 2.879 |
| 40% | 2.356 | 3.327 | 1.863 | 2.996 |
| 45% | 2.583 | 3.511 | 2.103 | 3.118 |
| 50% | 2.820 | 3.675 | 2.336 | 3.239 |
| 55% | 3.100 | 3.839 | 2.642 | 3.364 |
| 60% | 3.434 | 4.017 | 2.907 | 3.494 |
| 65% | 3.909 | 4.196 | 3.312 | 3.639 |
| 70% | 4.605 | 4.391 | 3.755 | 3.806 |
| 75% | 5.650 | 4.615 | 4.394 | 4.004 |
| 80% | 7.262 | 4.895 | 5.388 | 4.275 |
| 85% | 9.610 | 5.273 | 7.478 | 4.672 |
| 90% | 14.019 | 5.942 | 10.352 | 5.566 |
| 95% | 30.216 | 9.451 | 25.616 | 10.212 |
| 100% | 94.769 | 87.811 | 91.625 | 88.044 |


Percentage of Ribosomal (protein) per cell
| LRTI_WRK1 | LRTI_WRK2 | LRTI_WRK3 | LRTI_WRK4 | |
|---|---|---|---|---|
| 0% | 0.000 | 0.000 | 0.000 | 0.144 |
| 5% | 0.882 | 0.611 | 0.192 | 1.498 |
| 10% | 1.907 | 1.135 | 0.301 | 5.342 |
| 15% | 2.961 | 3.782 | 0.393 | 6.252 |
| 20% | 3.648 | 5.220 | 0.502 | 6.804 |
| 25% | 4.128 | 5.671 | 0.639 | 7.213 |
| 30% | 4.556 | 6.018 | 0.833 | 7.565 |
| 35% | 4.923 | 6.312 | 1.142 | 7.902 |
| 40% | 5.263 | 6.581 | 1.899 | 8.217 |
| 45% | 5.600 | 6.836 | 3.208 | 8.521 |
| 50% | 5.974 | 7.089 | 4.098 | 8.824 |
| 55% | 6.364 | 7.354 | 4.817 | 9.118 |
| 60% | 6.806 | 7.620 | 5.400 | 9.440 |
| 65% | 7.345 | 7.912 | 6.147 | 9.783 |
| 70% | 8.078 | 8.245 | 6.969 | 10.176 |
| 75% | 9.089 | 8.659 | 7.845 | 10.635 |
| 80% | 10.385 | 9.250 | 8.859 | 11.157 |
| 85% | 11.861 | 10.313 | 10.125 | 11.868 |
| 90% | 13.561 | 12.554 | 11.672 | 13.130 |
| 95% | 15.965 | 16.720 | 14.853 | 16.216 |
| 100% | 45.035 | 38.339 | 38.631 | 52.027 |


Modifying Seurat plots
Modifying the ggplot objects produced by a Seurat plotting function works best on individual panels. Therefore, to recreate the function above with modifications, we can use lapply to create a list of plots. In some cases it may be more appropriate to create the plots individually so that different modifications can be applied to each plot.
VlnPlot(experiment.aggregate, features = "nCount_RNA", pt.size = 0.01) +
scale_y_continuous(trans = "log10") +
scale_fill_viridis_d(option = "mako") +
ggtitle("log10(nCount_RNA)")

These can later be stitched together with another library, like patchwork, or cowplot.
Custom plots
The Seurat built-in functions are useful and easy to interact with, but sometimes you may wish to visualize something for which a plotting function does not already exist. For example, we might want to see how many cells are expressing each gene over some UMI threshold.
The code below produces a ranked plot similar to the barcode inflection plots from the last section. On the x-axis are the genes arranged from least ubiquitously expressed to most. In a single cell dataset, many genes are expessed in a relatively small number of cells, or not at all. The y-axis displays the number of cells in which each gene is expressed.
Note: This function is SLOW. You may want to skip this code block or run it while you take a break for a few minutes.
plot(sort(Matrix::rowSums(GetAssayData(experiment.aggregate) >= 3)) , xlab="gene rank", ylab="number of cells", main="Cells per genes (reads/gene >= 3 )")

Scatter plots
Scatter plots allow us to visualize the relationships between the metadata variables.
Gene Plot, scatter plot of gene expression across cells, (colored by sample)



Cell filtering
We use the information above to filter out cells. Here we choose those that have percent mitochondrial genes max of 10% and unique UMI counts under 50,000 or greater than 500. Further requiring the number of features persent per cell to be 1000 genes.

Cell Complexity
The standard way of calculating this is log10(genes)/log10(counts) however this gives absolute values which are difficult to judge. A possibly better approach is to fit a line through the cloud and then calculate the difference from the observed value to the expected.
##
## Call:
## lm(formula = log10(qc.metrics$nFeature_RNA) ~ log10(qc.metrics$nCount_RNA))
##
## Coefficients:
## (Intercept) log10(qc.metrics$nCount_RNA)
## 0.8111 0.6615
And we can add the information to the scatter plot

Histograms of metadata
Histograms can also be useful to look at the distribution over all the cells

Displaying the number of genes per cell

With Mitochondrial expression

Cell filtering
The goal of cell filtering is to remove cells with anomolous expression profiles, typically low UMI cells, which may correspond to low-quality cells or background barcodes. It may also be appropriate to remove outlier cells with extremely high UMI counts. In this case, the proposed cut-offs on the high end of the distributions are quite conservative, in part to reduce the size of the object and speed up analysis during the workshop.
These filters can be put in place with the subset function.
Prefiltered data
| Sample | Number of Cells |
|---|---|
| LRTI_WRK1 | 28124 |
| LRTI_WRK2 | 14933 |
| LRTI_WRK3 | 3343 |
| LRTI_WRK4 | 14297 |
Post filtered data
| Sample | Number of Cells |
|---|---|
| LRTI_WRK1 | 12372 |
| LRTI_WRK2 | 12480 |
| LRTI_WRK3 | 2084 |
| LRTI_WRK4 | 13116 |
Play with the filtering parameters, and see how the results change. Is there a set of parameters you feel is more appropriate? Why?
Feature filtering
When creating the base Seurat object, we had the opportunity filter out some genes using the “min.cells” argument. At the time, we set the min feature to keep a cell to 300. Since we didn’t filter out any features then (set to 0), we can apply a filter at this point. If we had filtered when the object was created, this would be an opportunity to be more aggressive. The custom code below provides a function that filters genes requiring a min.umi in at least min.cells, or takes a user-provided list of genes.
# define function
FilterGenes <- function(object, min.umi = NA, min.cells = NA, genes = NULL) {
genes.use = NA
if (!is.null(genes)) {
genes.use = intersect(rownames(object), genes)
} else if (min.cells & min.umi) {
num.cells = Matrix::rowSums(GetAssayData(object) >= min.umi)
genes.use = names(num.cells[which(num.cells >= min.cells)])
}
object = object[genes.use,]
object = LogSeuratCommand(object = object)
return(object)
}
# apply filter
experiment.filter <- FilterGenes(object = experiment.aggregate.filtered, min.umi = 1, min.cells = 2)
# filtering results
experiment.filter
## An object of class Seurat
## 31828 features across 40052 samples within 1 assay
## Active assay: RNA (31828 features, 0 variable features)
## 1 layer present: counts
We are going to choose to not filter the genes at this time, but know you can

Prepare for the next section
Save object
saveRDS(experiment.aggregate.filtered, file="scRNA_workshop-02.rds")
Download Rmd
download.file("https://raw.githubusercontent.com/ucsf-cat-bioinformatics/2024-08-SCRNA-Seq-Analysis/main/data_analysis/03-normalize_scale.Rmd", "03-normalize_scale.Rmd")
Session Information
sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sonoma 14.6.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Los_Angeles
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] kableExtra_1.4.0 dplyr_1.1.4 tidyr_1.3.1 ggplot2_3.5.1
## [5] Seurat_5.1.0 SeuratObject_5.0.2 sp_2.1-4
##
## loaded via a namespace (and not attached):
## [1] deldir_2.0-4 pbapply_1.7-2 gridExtra_2.3
## [4] rlang_1.1.4 magrittr_2.0.3 RcppAnnoy_0.0.22
## [7] spatstat.geom_3.3-2 matrixStats_1.3.0 ggridges_0.5.6
## [10] compiler_4.4.1 systemfonts_1.1.0 png_0.1-8
## [13] vctrs_0.6.5 reshape2_1.4.4 stringr_1.5.1
## [16] pkgconfig_2.0.3 fastmap_1.2.0 labeling_0.4.3
## [19] utf8_1.2.4 promises_1.3.0 rmarkdown_2.28
## [22] purrr_1.0.2 xfun_0.47 cachem_1.1.0
## [25] jsonlite_1.8.8 goftest_1.2-3 highr_0.11
## [28] later_1.3.2 spatstat.utils_3.1-0 irlba_2.3.5.1
## [31] parallel_4.4.1 cluster_2.1.6 R6_2.5.1
## [34] ica_1.0-3 spatstat.data_3.1-2 bslib_0.8.0
## [37] stringi_1.8.4 RColorBrewer_1.1-3 reticulate_1.38.0
## [40] spatstat.univar_3.0-0 parallelly_1.38.0 lmtest_0.9-40
## [43] jquerylib_0.1.4 scattermore_1.2 Rcpp_1.0.13
## [46] knitr_1.48 tensor_1.5 future.apply_1.11.2
## [49] zoo_1.8-12 sctransform_0.4.1 httpuv_1.6.15
## [52] Matrix_1.7-0 splines_4.4.1 igraph_2.0.3
## [55] tidyselect_1.2.1 abind_1.4-5 rstudioapi_0.16.0
## [58] yaml_2.3.10 spatstat.random_3.3-1 codetools_0.2-20
## [61] miniUI_0.1.1.1 spatstat.explore_3.3-2 listenv_0.9.1
## [64] lattice_0.22-6 tibble_3.2.1 plyr_1.8.9
## [67] withr_3.0.1 shiny_1.9.1 ROCR_1.0-11
## [70] evaluate_0.24.0 Rtsne_0.17 future_1.34.0
## [73] fastDummies_1.7.4 survival_3.7-0 polyclip_1.10-7
## [76] xml2_1.3.6 fitdistrplus_1.2-1 pillar_1.9.0
## [79] KernSmooth_2.23-24 plotly_4.10.4 generics_0.1.3
## [82] RcppHNSW_0.6.0 munsell_0.5.1 scales_1.3.0
## [85] globals_0.16.3 xtable_1.8-4 glue_1.7.0
## [88] lazyeval_0.2.2 tools_4.4.1 data.table_1.15.4
## [91] RSpectra_0.16-2 RANN_2.6.2 leiden_0.4.3.1
## [94] dotCall64_1.1-1 cowplot_1.1.3 grid_4.4.1
## [97] colorspace_2.1-1 nlme_3.1-166 patchwork_1.2.0
## [100] cli_3.6.3 spatstat.sparse_3.1-0 spam_2.10-0
## [103] fansi_1.0.6 viridisLite_0.4.2 svglite_2.1.3
## [106] uwot_0.2.2 gtable_0.3.5 sass_0.4.9
## [109] digest_0.6.37 progressr_0.14.0 ggrepel_0.9.5
## [112] farver_2.1.2 htmlwidgets_1.6.4 htmltools_0.5.8.1
## [115] lifecycle_1.0.4 httr_1.4.7 mime_0.12
## [118] MASS_7.3-61