Home
Discussion and Lectures
Intro to the Workshop
What is Bioinformatics/Genomics?
Experimental Design and Cost Estimation
Closing thoughts
Data Reduction
Files and Filetypes
Generating Expression Matrix
Prerequisites
CLI
R
Data analysis
Prepare scRNAseq Analysis
Part 1- Create object
Part 2- Filtering
Part 3- Normalization
Part 4- Dimensionality reduction
Part 6- Enrichment and DE
Support
Cheat Sheets
Software and Links
Scripts
ETC
CAT website
Github page
Report Errors

Introduction to Single Cell RNA-Seq Part 3: Normalize and scale

Set up workspace

First, we need to load the required libraries.

library(Seurat)
library(kableExtra)
library(ggplot2)
library(dplyr)

We will be continuing the work from Part 1 and so need to load in the RDS file

experiment.aggregate <- readRDS("scRNA_workshop-02.rds")
experiment.aggregate

## An object of class Seurat 
## 38606 features across 40052 samples within 1 assay 
## Active assay: RNA (38606 features, 0 variable features)
##  1 layer present: counts

Lets go ahead and set that common seed for everyone

set.seed(12345)

Normalize the data

After filtering, the next step is to normalize the data. We employ a global-scaling normalization method, LogNormalize, that normalizes the gene expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and then log-transforms the data.

?NormalizeData

experiment.aggregate <- NormalizeData(
  object = experiment.aggregate,
  normalization.method = "LogNormalize",
  scale.factor = 10000)

experiment.aggregate

## An object of class Seurat 
## 38606 features across 40052 samples within 1 assay 
## Active assay: RNA (38606 features, 0 variable features)
##  2 layers present: counts, data

The function produces a new layer in the object called “data”. We can now access the normalised data using LayerData. We can use this to show that we can get a list of the most highly expressed genes overall.

norm.expression <- LayerData(experiment.aggregate,layer="data")

mean.expression <- apply(norm.expression, 1, mean)

names(mean.expression) <- rownames(experiment.aggregate)

mean.expression <- sort(mean.expression, decreasing = TRUE)

head(mean.expression, n=50)

##     FTH1   MT-CO1      FTL     CD74      B2M   TMSB4X   MALAT1   MT-CO2 
## 4.646231 4.539242 4.525198 4.489474 4.458964 4.284744 4.198105 3.860866 
##     ACTB   EEF1A1  HLA-DRA     TPT1   MT-CO3    HLA-B      VIM    HLA-A 
## 3.855886 3.711285 3.653941 3.587522 3.534209 3.503641 3.407929 3.285413 
## HLA-DRB1   TMSB10  S100A11    RPLP1   MT-CYB    IFI30    HLA-C    RPL28 
## 3.276857 3.240642 3.203091 3.174771 3.137407 3.085733 3.068303 3.030152 
##    RPL41    RPL10     PFN1  MT-ND4L  MT-ATP6   TYROBP HLA-DPA1   MT-ND4 
## 2.953827 2.952570 2.910644 2.886496 2.880725 2.875505 2.870495 2.859882 
##   S100A6   MT-ND5     MYL6   RPL13A   MT-ND2     RPS8    RPL13     PSAP 
## 2.846236 2.844457 2.785106 2.785023 2.764556 2.721428 2.716863 2.712893 
##    RPS19   MT-ND1      LYZ    RPS12     CST3     CTSD     RPS2    RPLP2 
## 2.709430 2.688671 2.657034 2.656320 2.655618 2.648870 2.647233 2.607146 
## HLA-DPB1 HLA-DRB5 
## 2.599841 2.597642

We see a number of Mitochondrial genes that are highly expressed as well as some of our Ribosomal products.

Explore a couple more of these genes

Cell cycle assignment

Cell cycle phase can be a significant source of variation in single cell and single nucleus experiments. There are a number of automated cell cycle stage detection methods available for single cell data. For this workshop, we will be using the built-in Seurat cell cycle function, CellCycleScoring. This tool compares gene expression in each cell to a list of cell cycle marker genes and scores each barcode based on marker expression. The phase with the highest score is selected for each barcode. Seurat includes a list of cell cycle genes in human single cell data.

Seurat comes with the human gene lists needed for Cell Cycle determination.

cc.genes.updated.2019

## $s.genes
##  [1] "MCM5"     "PCNA"     "TYMS"     "FEN1"     "MCM7"     "MCM4"    
##  [7] "RRM1"     "UNG"      "GINS2"    "MCM6"     "CDCA7"    "DTL"     
## [13] "PRIM1"    "UHRF1"    "CENPU"    "HELLS"    "RFC2"     "POLR1B"  
## [19] "NASP"     "RAD51AP1" "GMNN"     "WDR76"    "SLBP"     "CCNE2"   
## [25] "UBR7"     "POLD3"    "MSH2"     "ATAD2"    "RAD51"    "RRM2"    
## [31] "CDC45"    "CDC6"     "EXO1"     "TIPIN"    "DSCC1"    "BLM"     
## [37] "CASP8AP2" "USP1"     "CLSPN"    "POLA1"    "CHAF1B"   "MRPL36"  
## [43] "E2F8"    
## 
## $g2m.genes
##  [1] "HMGB2"   "CDK1"    "NUSAP1"  "UBE2C"   "BIRC5"   "TPX2"    "TOP2A"  
##  [8] "NDC80"   "CKS2"    "NUF2"    "CKS1B"   "MKI67"   "TMPO"    "CENPF"  
## [15] "TACC3"   "PIMREG"  "SMC4"    "CCNB2"   "CKAP2L"  "CKAP2"   "AURKB"  
## [22] "BUB1"    "KIF11"   "ANP32E"  "TUBB4B"  "GTSE1"   "KIF20B"  "HJURP"  
## [29] "CDCA3"   "JPT1"    "CDC20"   "TTK"     "CDC25C"  "KIF2C"   "RANGAP1"
## [36] "NCAPD2"  "DLGAP5"  "CDCA2"   "CDCA8"   "ECT2"    "KIF23"   "HMMR"   
## [43] "AURKA"   "PSRC1"   "ANLN"    "LBR"     "CKAP5"   "CENPE"   "CTCF"   
## [50] "NEK2"    "G2E3"    "GAS2L3"  "CBX5"    "CENPA"

For other species, a user-provided gene list may be substituted, or the orthologs of the human gene list used instead. Do not run the code below for human experiments!

# mouse code DO NOT RUN for human data
library(biomaRt)
convertHumanGeneList <- function(x){
  require("biomaRt")
  human = useEnsembl("ensembl",
                     dataset = "hsapiens_gene_ensembl",
                     mirror = "uswest")
  mouse = useEnsembl("ensembl",
                     dataset = "mmusculus_gene_ensembl",
                     mirror = "uswest")
  genes = getLDS(attributes = c("hgnc_symbol"),
                 filters = "hgnc_symbol",
                 values = x ,
                 mart = human,
                 attributesL = c("mgi_symbol"),
                 martL = mouse,
                 uniqueRows=T)
  humanx = unique(genes[, 2])
  print(head(humanx)) # print first 6 genes found to the screen
  return(humanx)
}
# convert lists to mouse orthologs
s.genes <- convertHumanGeneList(cc.genes.updated.2019$s.genes)
g2m.genes <- convertHumanGeneList(cc.genes.updated.2019$g2m.genes)

Once an appropriate gene list has been identified, the CellCycleScoring function can be run.

experiment.aggregate <- CellCycleScoring(experiment.aggregate,
                                         s.features = cc.genes.updated.2019$s.genes,
                                         g2m.features = cc.genes.updated.2019$g2m.genes,
                                         set.ident = TRUE)

table(experiment.aggregate[["Phase"]]) %>%
  kable(caption = "Number of Cells in each Cell Cycle Stage",
        col.names = c("Stage", "Count"),
        align = "c") %>%
  kable_styling()

Number of Cells in each Cell Cycle Stage
Stage	Count
G1	24873
G2M	4501
S	10678

We can visualize how the algorithm assigns state based on the computed S.Score and G2M.Score

Because the “set.ident” argument was set to TRUE (this is also the default behavior), the active identity of the Seurat object was changed to the phase. To return the active identity to the sample identity, use the Idents function.

table(Idents(experiment.aggregate))

## 
##    G1     S   G2M 
## 24873 10678  4501

Idents(experiment.aggregate) <- "orig.ident"
table(Idents(experiment.aggregate))

## 
## LRTI_WRK1 LRTI_WRK2 LRTI_WRK3 LRTI_WRK4 
##     12372     12480      2084     13116

Identify variable genes

The function FindVariableFeatures identifies the most highly variable genes (default 2000 genes) by fitting a line to the relationship of log(variance) and log(mean) using loess smoothing, uses this information to standardize the data, then calculates the variance of the standardized data. This helps avoid selecting genes that only appear variable due to their expression level.

?FindVariableFeatures

experiment.aggregate <- FindVariableFeatures(
  object = experiment.aggregate,
  selection.method = "vst") ## vst having issues??

length(VariableFeatures(experiment.aggregate))

## [1] 2000

The function places data in an object called HVFInfo, we can extract and look at the results of find variable features for each gene.

##               mean    variance variance.expected variance.standardized
## SPP1     3.1527514  2686.06256        13.3812444              70.68009
## WFDC2    2.3125936  1828.29703         8.0168136              70.56587
## SERPINB3 0.9510137   328.98015         2.0923136              68.68482
## SLPI     1.8165635  1393.96303         5.4465838              64.16529
## BPIFB1   2.5344552  3206.84176         9.3129384              61.54649
## CCL17    1.7835314   377.07781         5.2919254              60.07977
## TPSAB1   0.4262459   112.10639         0.7843354              60.06400
## PTGDS    0.4777789    98.86088         0.8900317              58.90872
## CPA3     0.3148157    56.25267         0.5730767              57.78090
## LCN2     1.0441676   712.20201         2.3810364              57.29496
## CCL20    1.0298362   136.89489         2.3354899              56.27489
## RIMS1    0.6606162    87.75036         1.3035991              54.89394
## CCL2     2.0587486   499.79263         6.6439926              50.74984
## SCGB3A1  2.6130031  6191.75650         9.7929258              50.11783
## PIGR     0.5649905    78.83029         1.0793266              49.02960
## KRT19    0.6037152    81.37270         1.1681873              47.55942
## CXCL10   3.1525267   624.45735        13.3796467              46.31380
## MSMB     0.9065215   886.44963         1.9604775              43.78314
## PPBP     0.3517677    81.13333         0.6405779              43.03506
## SCGB1A1  3.1295316 13863.77504        13.2166073              40.59522

We can plot this data, while marking the top 20

Goes mostly flat at around 100, so we will keep our list of variable genes

How do the results change if you use selection.method = “dispersion” or selection.method = “mean.var.plot”?

FindVariableFeatures isn’t the only way to set the “variable features” of a Seurat object. Another reasonable approach is to select a set of “minimally expressed” genes.

min.value <- 2
min.cells <- 10

num.cells <- Matrix::rowSums(GetAssayData(experiment.aggregate, slot = "count") > min.value)
genes.use <- names(num.cells[which(num.cells >= min.cells)])
length(genes.use)

## [1] 15841

VariableFeatures(experiment.aggregate) <- genes.use

Prepare for the next section

Save object

saveRDS(experiment.aggregate, file = "scRNA_workshop-03.rds")

Download Rmd

download.file("https://raw.githubusercontent.com/ucsf-cat-bioinformatics/2024-08-SCRNA-Seq-Analysis/main/data_analysis/04-dimensionality_reduction.Rmd", "04-dimensionality_reduction.Rmd")

Session Information

sessionInfo()

## R version 4.4.1 (2024-06-14)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sonoma 14.6.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Los_Angeles
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] dplyr_1.1.4        ggplot2_3.5.1      kableExtra_1.4.0   Seurat_5.1.0      
## [5] SeuratObject_5.0.2 sp_2.1-4          
## 
## loaded via a namespace (and not attached):
##   [1] deldir_2.0-4           pbapply_1.7-2          gridExtra_2.3         
##   [4] rlang_1.1.4            magrittr_2.0.3         RcppAnnoy_0.0.22      
##   [7] spatstat.geom_3.3-2    matrixStats_1.3.0      ggridges_0.5.6        
##  [10] compiler_4.4.1         systemfonts_1.1.0      png_0.1-8             
##  [13] vctrs_0.6.5            reshape2_1.4.4         stringr_1.5.1         
##  [16] pkgconfig_2.0.3        fastmap_1.2.0          labeling_0.4.3        
##  [19] utf8_1.2.4             promises_1.3.0         rmarkdown_2.28        
##  [22] purrr_1.0.2            xfun_0.47              cachem_1.1.0          
##  [25] jsonlite_1.8.8         goftest_1.2-3          highr_0.11            
##  [28] later_1.3.2            spatstat.utils_3.1-0   irlba_2.3.5.1         
##  [31] parallel_4.4.1         cluster_2.1.6          R6_2.5.1              
##  [34] ica_1.0-3              spatstat.data_3.1-2    bslib_0.8.0           
##  [37] stringi_1.8.4          RColorBrewer_1.1-3     reticulate_1.38.0     
##  [40] spatstat.univar_3.0-0  parallelly_1.38.0      lmtest_0.9-40         
##  [43] jquerylib_0.1.4        scattermore_1.2        Rcpp_1.0.13           
##  [46] knitr_1.48             tensor_1.5             future.apply_1.11.2   
##  [49] zoo_1.8-12             sctransform_0.4.1      httpuv_1.6.15         
##  [52] Matrix_1.7-0           splines_4.4.1          igraph_2.0.3          
##  [55] tidyselect_1.2.1       abind_1.4-5            rstudioapi_0.16.0     
##  [58] yaml_2.3.10            spatstat.random_3.3-1  codetools_0.2-20      
##  [61] miniUI_0.1.1.1         spatstat.explore_3.3-2 listenv_0.9.1         
##  [64] lattice_0.22-6         tibble_3.2.1           plyr_1.8.9            
##  [67] withr_3.0.1            shiny_1.9.1            ROCR_1.0-11           
##  [70] evaluate_0.24.0        Rtsne_0.17             future_1.34.0         
##  [73] fastDummies_1.7.4      survival_3.7-0         polyclip_1.10-7       
##  [76] xml2_1.3.6             fitdistrplus_1.2-1     pillar_1.9.0          
##  [79] KernSmooth_2.23-24     plotly_4.10.4          generics_0.1.3        
##  [82] RcppHNSW_0.6.0         munsell_0.5.1          scales_1.3.0          
##  [85] globals_0.16.3         xtable_1.8-4           glue_1.7.0            
##  [88] lazyeval_0.2.2         tools_4.4.1            data.table_1.15.4     
##  [91] RSpectra_0.16-2        RANN_2.6.2             leiden_0.4.3.1        
##  [94] dotCall64_1.1-1        cowplot_1.1.3          grid_4.4.1            
##  [97] tidyr_1.3.1            colorspace_2.1-1       nlme_3.1-166          
## [100] patchwork_1.2.0        cli_3.6.3              spatstat.sparse_3.1-0 
## [103] spam_2.10-0            fansi_1.0.6            viridisLite_0.4.2     
## [106] svglite_2.1.3          uwot_0.2.2             gtable_0.3.5          
## [109] sass_0.4.9             digest_0.6.37          progressr_0.14.0      
## [112] ggrepel_0.9.5          farver_2.1.2           htmlwidgets_1.6.4     
## [115] htmltools_0.5.8.1      lifecycle_1.0.4        httr_1.4.7            
## [118] mime_0.12              MASS_7.3-61

☰ Menu

Single Cell RNA-Seq Analysis