☰ Menu

      Single Cell RNA-Seq Analysis

Home
Discussion and Lectures
Intro to the Workshop
What is Bioinformatics/Genomics?
Experimental Design and Cost Estimation
Closing thoughts
Data Reduction
Files and Filetypes
Generating Expression Matrix
Prerequisites
CLI
R
Data analysis
Prepare scRNAseq Analysis
Part 1- Create object
Part 2- Filtering
Part 3- Normalization
Part 4- Dimensionality reduction
Part 6- Enrichment and DE
Support
Cheat Sheets
Software and Links
Scripts
ETC
CAT website
Github page
Report Errors

Introduction to Single Cell RNA-Seq Part 6: Enrichment and model-based differential expression

Set up workspace

library(Seurat)
library(limma)
library(topGO)
library(dplyr)
library(kableExtra)

We will be continuing the work from Part 1 and so need to load in the RDS file

experiment.aggregate <- readRDS("scRNA_workshop-05.rds")
experiment.aggregate
## An object of class Seurat 
## 38606 features across 14018 samples within 1 assay 
## Active assay: RNA (38606 features, 2000 variable features)
##  3 layers present: counts, data, scale.data
##  2 dimensional reductions calculated: pca, umap

Lets go ahead and set that common seed for everyone

set.seed(12345)

1. Gene Ontology (GO) Enrichment of Genes Expressed in a Cluster

Gene Ontology provides a controlled vocabulary for describing gene products. Here we use enrichment analysis to identify GO terms that are over-represented among the gene expressed in cells in a given cluster.

cluster10 <- subset(experiment.aggregate, idents = '10')
expr <- as.matrix(GetAssayData(cluster10))

# Select genes that are expressed > 0 in at least half of cells
n.gt.0 <- apply(expr, 1, function(x)length(which(x > 0)))
expressed.genes <- rownames(expr)[which(n.gt.0/ncol(expr) >= 0.5)]
all.genes <- rownames(expr)

# define geneList as 1 if gene is in expressed.genes, 0 otherwise
geneList <- ifelse(all.genes %in% expressed.genes, 1, 0)
names(geneList) <- all.genes

# Create topGOdata object
	GOdata <- new("topGOdata",
		ontology = "BP", # use biological process ontology
		allGenes = geneList,
		geneSelectionFun = function(x)(x == 1),
              annot = annFUN.org, mapping = "org.Hs.eg.db", ID = "symbol")
# Test for enrichment using Fisher's Exact Test
	resultFisher <- runTest(GOdata, algorithm = "elim", statistic = "fisher")
	GenTable(GOdata, Fisher = resultFisher, topNodes = 20, numChar = 60) %>%
	  kable() %>%
	  kable_styling("striped", fixed_thead = TRUE)
GO.ID Term Annotated Significant Expected Fisher
GO:0002181 cytoplasmic translation 159 134 38.18 < 1e-30
GO:0042776 proton motive force-driven mitochondrial ATP synthesis 57 53 13.69 1.6e-28
GO:0043161 proteasome-mediated ubiquitin-dependent protein catabolic pr... 388 190 93.18 2.0e-24
GO:0051301 cell division 640 284 153.70 1.7e-20
GO:0006120 mitochondrial electron transport, NADH to ubiquinone 42 37 10.09 2.4e-18
GO:0032543 mitochondrial translation 130 76 31.22 4.2e-17
GO:0000398 mRNA splicing, via spliceosome 304 200 73.01 5.1e-15
GO:0006338 chromatin remodeling 655 285 157.30 9.6e-14
GO:0032981 mitochondrial respiratory chain complex I assembly 55 39 13.21 2.6e-13
GO:0045893 positive regulation of DNA-templated transcription 1677 556 402.74 9.4e-13
GO:1903241 U2-type prespliceosome assembly 25 23 6.00 9.6e-13
GO:0070936 protein K48-linked ubiquitination 82 53 19.69 2.3e-12
GO:0006886 intracellular protein transport 671 307 161.14 1.1e-11
GO:0006281 DNA repair 598 296 143.61 1.7e-11
GO:0006406 mRNA export from nucleus 71 48 17.05 3.0e-11
GO:0006325 chromatin organization 798 358 191.64 8.3e-11
GO:0050821 protein stabilization 211 91 50.67 6.6e-10
GO:0090148 membrane fission 44 30 10.57 7.0e-10
GO:0006446 regulation of translational initiation 84 57 20.17 1.2e-09
GO:0008380 RNA splicing 451 280 108.31 1.5e-09

2. Model-based DE analysis in limma

limma is an R package for differential expression analysis of bulk RNASeq and microarray data. We apply it here to single cell data.

Limma can be used to fit any linear model to expression data and is useful for analyses that go beyond two-group comparisons. A detailed tutorial of model specification in limma is available here and in the limma User’s Guide.

# filter genes to those expressed in at least 10% of cells
keep <- rownames(expr)[which(n.gt.0/ncol(expr) >= 0.1)]
expr2 <- expr[keep,]

# Set up "design matrix" with statistical model
cluster10$proper.group <- make.names(cluster10$orig.ident)
mm <- model.matrix(~0 + proper.group + S.Score + G2M.Score + percent.mito + nFeature_RNA, data = cluster10[[]])
head(mm) %>%
	  kable() %>%
	  kable_styling("striped", fixed_thead = TRUE)
proper.groupLRTI_WRK1 proper.groupLRTI_WRK2 proper.groupLRTI_WRK3 proper.groupLRTI_WRK4 S.Score G2M.Score percent.mito nFeature_RNA
AAACCAGCACATTAGC+LRTI_WRK1 1 0 0 0 0.1586883 1.0371359 3.202255 5479
AAACCCATCAATGTAC+LRTI_WRK1 1 0 0 0 0.7347704 0.6561995 3.305649 6910
AAACCGTGTCGTACCT+LRTI_WRK1 1 0 0 0 0.8925801 0.4061249 3.362421 4621
AAACGGCCAGGTACAA+LRTI_WRK1 1 0 0 0 0.5515437 0.8579274 4.211121 4638
AAACTTATCGGTTCGG+LRTI_WRK1 1 0 0 0 0.5691890 0.9509525 5.031228 7667
AAAGGGTTCATCAGGC+LRTI_WRK1 1 0 0 0 0.8256540 0.3402400 6.204631 6241
tail(mm) %>%
	  kable() %>%
	  kable_styling("striped", fixed_thead = TRUE)
proper.groupLRTI_WRK1 proper.groupLRTI_WRK2 proper.groupLRTI_WRK3 proper.groupLRTI_WRK4 S.Score G2M.Score percent.mito nFeature_RNA
GCCCATTGTCACGGGT+LRTI_WRK2 0 1 0 0 0.0361409 -0.0412531 2.910205 2253
GGCTTAATCGGCGATT+LRTI_WRK2 0 1 0 0 -0.0639132 -0.0778717 4.119639 1131
GGTACTATCGTAGCGT+LRTI_WRK2 0 1 0 0 -0.0276898 0.0288580 2.727273 2167
CTCGCCAGTACTATGA+LRTI_WRK3 0 0 1 0 -0.0426008 -0.0503667 2.546757 6586
AATCCGTCAAAGCGTA+LRTI_WRK4 0 0 0 1 0.4301881 0.2368817 2.721987 7958
CACACTCGTACCACCT+LRTI_WRK4 0 0 0 1 0.0814435 -0.0556294 5.963636 836
# Fit model in limma
fit <- lmFit(expr2, mm)
head(coef(fit)) %>%
	  kable() %>%
	  kable_styling("striped", fixed_thead = TRUE)
proper.groupLRTI_WRK1 proper.groupLRTI_WRK2 proper.groupLRTI_WRK3 proper.groupLRTI_WRK4 S.Score G2M.Score percent.mito nFeature_RNA
LINC01128 0.0104547 0.1828878 0.0116083 -0.0300109 0.0698463 0.0062522 0.0064823 -3.80e-06
NOC2L -0.0822753 0.1396044 -0.0935686 -0.0203252 0.1025678 -0.0104968 0.0069699 5.26e-05
KLHL17 0.0713555 0.0281549 0.0366542 0.1126129 0.0452909 -0.0207237 -0.0065486 -2.90e-06
ISG15 1.6378853 0.6244244 0.2260651 1.9633809 -0.3275326 -0.0656142 -0.0763379 3.31e-05
C1orf159 0.0279527 0.0208602 0.8150750 -0.0640302 0.1261940 0.0248038 0.0134276 -6.60e-06
TNFRSF18 1.1823790 1.4462475 1.4931473 0.1115185 -0.2780431 -0.2435467 -0.0291735 2.46e-05
# Test 'Normal' - 'Colorectal.Cancer'
contr <- makeContrasts(proper.groupLRTI_WRK1	 - proper.groupLRTI_WRK2	, levels = colnames(coef(fit)))
contr %>%
	  kable() %>%
	  kable_styling("striped", fixed_thead = TRUE)
proper.groupLRTI_WRK1 - proper.groupLRTI_WRK2
proper.groupLRTI_WRK1 1
proper.groupLRTI_WRK2 -1
proper.groupLRTI_WRK3 0
proper.groupLRTI_WRK4 0
S.Score 0
G2M.Score 0
percent.mito 0
nFeature_RNA 0
fit2 <- contrasts.fit(fit, contrasts = contr)
fit2 <- eBayes(fit2)
out <- topTable(fit2, n = Inf, sort.by = "P")
head(out, 30) %>%
	  kable() %>%
	  kable_styling("striped", fixed_thead = TRUE)
logFC AveExpr t P.Value adj.P.Val B
VAV3 -1.7508024 0.3110052 -13.236840 0 0e+00 62.96266
TBC1D4 -1.3900664 0.2833838 -11.568702 0 0e+00 48.94408
HDAC9 -1.2517491 0.1753274 -10.333661 0 0e+00 39.11020
IL1R1 -0.9995498 0.1300682 -10.226861 0 0e+00 38.28639
CCR8 -0.8353974 0.1163754 -10.036890 0 0e+00 36.83248
EML4 1.5052329 1.7012424 9.561668 0 0e+00 33.26267
ARPC2 1.1548512 2.4380833 9.464631 0 0e+00 32.54606
MT-ND2 -1.0490250 2.3930152 -8.952967 0 0e+00 28.84073
LAYN -0.7911447 0.1180913 -8.823339 0 0e+00 27.92232
RPS4Y1 1.0730097 1.1304680 8.153916 0 0e+00 23.31908
ENSG00000227240 -1.2591894 0.2575334 -7.784149 0 0e+00 20.88238
CAPZB 1.0079794 1.8103635 7.756462 0 0e+00 20.70311
TXNIP -1.2963896 0.6787166 -7.580633 0 0e+00 19.57527
CAPZA1 0.9017427 1.2495491 7.516820 0 0e+00 19.17053
FTL -1.2627828 2.1523660 -7.179795 0 0e+00 17.07449
FRYL 0.9189332 0.8948729 7.026423 0 0e+00 16.14428
UTY 0.9005716 0.5991670 6.983129 0 0e+00 15.88444
SGPP1 -0.6447435 0.2190291 -6.961139 0 0e+00 15.75292
IL2RG 1.0240127 1.8196712 6.898402 0 0e+00 15.37944
GIMAP7 1.4491604 1.4094892 6.831560 0 0e+00 14.98436
PTPRM -0.7061117 0.1497160 -6.752201 0 0e+00 14.51912
IRF1 0.9660324 1.1877894 6.728966 0 0e+00 14.38370
MCF2L2 -0.7044998 0.2238412 -6.688666 0 0e+00 14.14968
DEF6 0.8446568 0.9440441 6.655879 0 0e+00 13.96009
TUT7 0.8658212 0.7549669 6.647806 0 0e+00 13.91351
TTN -0.7275733 0.1723645 -6.544020 0 1e-07 13.31871
UTRN 1.0748347 1.2238449 6.437446 0 1e-07 12.71557
CELF2 1.1100942 1.7269740 6.408570 0 2e-07 12.55350
PPP1R18 0.7727838 1.0353629 6.363937 0 2e-07 12.30411
SYNRG 0.8090564 0.8666973 6.330475 0 2e-07 12.11806

Output columns:

Save files

write.csv(GenTable(GOdata, Fisher = resultFisher), file = "cluster10_GOdata.csv")
write.csv(out, file = "cluster10_Normal-Colorectal.Cancer_topTable.csv")

A note on pseudobulk DE

Pseudobulk differential expression uses count data summed across all cells in each sample (typically within each cell type or cluster). Unlike cell-level DE, pseudobulk DE requires biological replicates so we won’t perform it on this dataset.

Once counts are summed, pseudobulk data are analyzed like bulk RNASeq data.

Pseudobulk DE may result in better false discovery rate control than cell-level DE, as shown here.

The Seurat function AggregateExpression() can be used to sum counts as described here.

Prepare for the next section

Session Information

sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sonoma 14.6.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Los_Angeles
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] org.Hs.eg.db_3.19.1  kableExtra_1.4.0     dplyr_1.1.4         
##  [4] topGO_2.56.0         SparseM_1.84-2       GO.db_3.19.1        
##  [7] AnnotationDbi_1.66.0 IRanges_2.38.1       S4Vectors_0.42.1    
## [10] Biobase_2.64.0       graph_1.82.0         BiocGenerics_0.50.0 
## [13] limma_3.60.4         Seurat_5.1.0         SeuratObject_5.0.2  
## [16] sp_2.1-4            
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3      rstudioapi_0.16.0       jsonlite_1.8.8         
##   [4] magrittr_2.0.3          spatstat.utils_3.1-0    rmarkdown_2.28         
##   [7] zlibbioc_1.50.0         vctrs_0.6.5             ROCR_1.0-11            
##  [10] memoise_2.0.1           spatstat.explore_3.3-2  htmltools_0.5.8.1      
##  [13] sass_0.4.9              sctransform_0.4.1       parallelly_1.38.0      
##  [16] KernSmooth_2.23-24      bslib_0.8.0             htmlwidgets_1.6.4      
##  [19] ica_1.0-3               plyr_1.8.9              plotly_4.10.4          
##  [22] zoo_1.8-12              cachem_1.1.0            igraph_2.0.3           
##  [25] mime_0.12               lifecycle_1.0.4         pkgconfig_2.0.3        
##  [28] Matrix_1.7-0            R6_2.5.1                fastmap_1.2.0          
##  [31] GenomeInfoDbData_1.2.12 fitdistrplus_1.2-1      future_1.34.0          
##  [34] shiny_1.9.1             digest_0.6.37           colorspace_2.1-1       
##  [37] patchwork_1.2.0         tensor_1.5              RSpectra_0.16-2        
##  [40] irlba_2.3.5.1           RSQLite_2.3.7           progressr_0.14.0       
##  [43] fansi_1.0.6             spatstat.sparse_3.1-0   httr_1.4.7             
##  [46] polyclip_1.10-7         abind_1.4-5             compiler_4.4.1         
##  [49] bit64_4.0.5             DBI_1.2.3               fastDummies_1.7.4      
##  [52] highr_0.11              MASS_7.3-61             tools_4.4.1            
##  [55] lmtest_0.9-40           httpuv_1.6.15           future.apply_1.11.2    
##  [58] goftest_1.2-3           glue_1.7.0              nlme_3.1-166           
##  [61] promises_1.3.0          grid_4.4.1              Rtsne_0.17             
##  [64] cluster_2.1.6           reshape2_1.4.4          generics_0.1.3         
##  [67] gtable_0.3.5            spatstat.data_3.1-2     tidyr_1.3.1            
##  [70] data.table_1.16.0       xml2_1.3.6              XVector_0.44.0         
##  [73] utf8_1.2.4              spatstat.geom_3.3-2     RcppAnnoy_0.0.22       
##  [76] ggrepel_0.9.5           RANN_2.6.2              pillar_1.9.0           
##  [79] stringr_1.5.1           spam_2.10-0             RcppHNSW_0.6.0         
##  [82] later_1.3.2             splines_4.4.1           lattice_0.22-6         
##  [85] bit_4.0.5               survival_3.7-0          deldir_2.0-4           
##  [88] tidyselect_1.2.1        Biostrings_2.72.1       miniUI_0.1.1.1         
##  [91] pbapply_1.7-2           knitr_1.48              gridExtra_2.3          
##  [94] svglite_2.1.3           scattermore_1.2         xfun_0.47              
##  [97] statmod_1.5.0           matrixStats_1.3.0       UCSC.utils_1.0.0       
## [100] stringi_1.8.4           lazyeval_0.2.2          yaml_2.3.10            
## [103] evaluate_0.24.0         codetools_0.2-20        tibble_3.2.1           
## [106] cli_3.6.3               uwot_0.2.2              systemfonts_1.1.0      
## [109] xtable_1.8-4            reticulate_1.38.0       munsell_0.5.1          
## [112] jquerylib_0.1.4         GenomeInfoDb_1.40.1     Rcpp_1.0.13            
## [115] globals_0.16.3          spatstat.random_3.3-1   png_0.1-8              
## [118] spatstat.univar_3.0-0   parallel_4.4.1          blob_1.2.4             
## [121] ggplot2_3.5.1           dotCall64_1.1-1         listenv_0.9.1          
## [124] viridisLite_0.4.2       scales_1.3.0            ggridges_0.5.6         
## [127] crayon_1.5.3            leiden_0.4.3.1          purrr_1.0.2            
## [130] rlang_1.1.4             KEGGREST_1.44.1         cowplot_1.1.3