Home
Discussion and Lectures
Intro to the Workshop
What is Bioinformatics/Genomics?
Experimental Design and Cost Estimation
Closing thoughts
Data Reduction
Files and Filetypes
Generating Expression Matrix
Prerequisites
CLI
R
Data analysis
Prepare scRNAseq Analysis
Part 1- Create object
Part 2- Filtering
Part 3- Normalization
Part 4- Dimensionality reduction
Part 6- Enrichment and DE
Support
Cheat Sheets
Software and Links
Scripts
ETC
CAT website
Github page
Report Errors

Introduction to Single Cell RNA-Seq Part 6: Enrichment and model-based differential expression

Set up workspace

library(Seurat)
library(limma)
library(topGO)
library(dplyr)
library(kableExtra)

We will be continuing the work from Part 1 and so need to load in the RDS file

experiment.aggregate <- readRDS("scRNA_workshop-05.rds")
experiment.aggregate

## An object of class Seurat 
## 38606 features across 14018 samples within 1 assay 
## Active assay: RNA (38606 features, 2000 variable features)
##  3 layers present: counts, data, scale.data
##  2 dimensional reductions calculated: pca, umap

Lets go ahead and set that common seed for everyone

set.seed(12345)

1. Gene Ontology (GO) Enrichment of Genes Expressed in a Cluster

Gene Ontology provides a controlled vocabulary for describing gene products. Here we use enrichment analysis to identify GO terms that are over-represented among the gene expressed in cells in a given cluster.

cluster10 <- subset(experiment.aggregate, idents = '10')
expr <- as.matrix(GetAssayData(cluster10))

# Select genes that are expressed > 0 in at least half of cells
n.gt.0 <- apply(expr, 1, function(x)length(which(x > 0)))
expressed.genes <- rownames(expr)[which(n.gt.0/ncol(expr) >= 0.5)]
all.genes <- rownames(expr)

# define geneList as 1 if gene is in expressed.genes, 0 otherwise
geneList <- ifelse(all.genes %in% expressed.genes, 1, 0)
names(geneList) <- all.genes

# Create topGOdata object
	GOdata <- new("topGOdata",
		ontology = "BP", # use biological process ontology
		allGenes = geneList,
		geneSelectionFun = function(x)(x == 1),
              annot = annFUN.org, mapping = "org.Hs.eg.db", ID = "symbol")
# Test for enrichment using Fisher's Exact Test
	resultFisher <- runTest(GOdata, algorithm = "elim", statistic = "fisher")
	GenTable(GOdata, Fisher = resultFisher, topNodes = 20, numChar = 60) %>%
	  kable() %>%
	  kable_styling("striped", fixed_thead = TRUE)

GO.ID	Term	Annotated	Significant	Expected	Fisher
GO:0002181	cytoplasmic translation	159	134	38.18	< 1e-30
GO:0042776	proton motive force-driven mitochondrial ATP synthesis	57	53	13.69	1.6e-28
GO:0043161	proteasome-mediated ubiquitin-dependent protein catabolic pr...	388	190	93.18	2.0e-24
GO:0051301	cell division	640	284	153.70	1.7e-20
GO:0006120	mitochondrial electron transport, NADH to ubiquinone	42	37	10.09	2.4e-18
GO:0032543	mitochondrial translation	130	76	31.22	4.2e-17
GO:0000398	mRNA splicing, via spliceosome	304	200	73.01	5.1e-15
GO:0006338	chromatin remodeling	655	285	157.30	9.6e-14
GO:0032981	mitochondrial respiratory chain complex I assembly	55	39	13.21	2.6e-13
GO:0045893	positive regulation of DNA-templated transcription	1677	556	402.74	9.4e-13
GO:1903241	U2-type prespliceosome assembly	25	23	6.00	9.6e-13
GO:0070936	protein K48-linked ubiquitination	82	53	19.69	2.3e-12
GO:0006886	intracellular protein transport	671	307	161.14	1.1e-11
GO:0006281	DNA repair	598	296	143.61	1.7e-11
GO:0006406	mRNA export from nucleus	71	48	17.05	3.0e-11
GO:0006325	chromatin organization	798	358	191.64	8.3e-11
GO:0050821	protein stabilization	211	91	50.67	6.6e-10
GO:0090148	membrane fission	44	30	10.57	7.0e-10
GO:0006446	regulation of translational initiation	84	57	20.17	1.2e-09
GO:0008380	RNA splicing	451	280	108.31	1.5e-09

Annotated: number of genes (out of all.genes) that are annotated with that GO term
Significant: number of genes that are annotated with that GO term and meet our criteria for “expressed”
Expected: Under random chance, number of genes that would be expected to be annotated with that GO term and meeting our criteria for “expressed”
Fisher: (Raw) p-value from Fisher’s Exact Test

2. Model-based DE analysis in limma

limma is an R package for differential expression analysis of bulk RNASeq and microarray data. We apply it here to single cell data.

Limma can be used to fit any linear model to expression data and is useful for analyses that go beyond two-group comparisons. A detailed tutorial of model specification in limma is available here and in the limma User’s Guide.

# filter genes to those expressed in at least 10% of cells
keep <- rownames(expr)[which(n.gt.0/ncol(expr) >= 0.1)]
expr2 <- expr[keep,]

# Set up "design matrix" with statistical model
cluster10$proper.group <- make.names(cluster10$orig.ident)
mm <- model.matrix(~0 + proper.group + S.Score + G2M.Score + percent.mito + nFeature_RNA, data = cluster10[[]])
head(mm) %>%
	  kable() %>%
	  kable_styling("striped", fixed_thead = TRUE)

	proper.groupLRTI_WRK1	S.Score	G2M.Score	percent.mito	nFeature_RNA
AAACCAGCACATTAGC+LRTI_WRK1	1	0.1586883	1.0371359	3.202255	5479
AAACCCATCAATGTAC+LRTI_WRK1	1	0.7347704	0.6561995	3.305649	6910
AAACCGTGTCGTACCT+LRTI_WRK1	1	0.8925801	0.4061249	3.362421	4621
AAACGGCCAGGTACAA+LRTI_WRK1	1	0.5515437	0.8579274	4.211121	4638
AAACTTATCGGTTCGG+LRTI_WRK1	1	0.5691890	0.9509525	5.031228	7667
AAAGGGTTCATCAGGC+LRTI_WRK1	1	0.8256540	0.3402400	6.204631	6241

tail(mm) %>%
	  kable() %>%
	  kable_styling("striped", fixed_thead = TRUE)

	proper.groupLRTI_WRK2	proper.groupLRTI_WRK3	proper.groupLRTI_WRK4	S.Score	G2M.Score	percent.mito	nFeature_RNA
GCCCATTGTCACGGGT+LRTI_WRK2	1	0	0	0.0361409	-0.0412531	2.910205	2253
GGCTTAATCGGCGATT+LRTI_WRK2	1	0	0	-0.0639132	-0.0778717	4.119639	1131
GGTACTATCGTAGCGT+LRTI_WRK2	1	0	0	-0.0276898	0.0288580	2.727273	2167
CTCGCCAGTACTATGA+LRTI_WRK3	0	1	0	-0.0426008	-0.0503667	2.546757	6586
AATCCGTCAAAGCGTA+LRTI_WRK4	0	0	1	0.4301881	0.2368817	2.721987	7958
CACACTCGTACCACCT+LRTI_WRK4	0	0	1	0.0814435	-0.0556294	5.963636	836

# Fit model in limma
fit <- lmFit(expr2, mm)
head(coef(fit)) %>%
	  kable() %>%
	  kable_styling("striped", fixed_thead = TRUE)

	proper.groupLRTI_WRK1	proper.groupLRTI_WRK2	proper.groupLRTI_WRK3	proper.groupLRTI_WRK4	S.Score	G2M.Score	percent.mito	nFeature_RNA
LINC01128	0.0104547	0.1828878	0.0116083	-0.0300109	0.0698463	0.0062522	0.0064823	-3.80e-06
NOC2L	-0.0822753	0.1396044	-0.0935686	-0.0203252	0.1025678	-0.0104968	0.0069699	5.26e-05
KLHL17	0.0713555	0.0281549	0.0366542	0.1126129	0.0452909	-0.0207237	-0.0065486	-2.90e-06
ISG15	1.6378853	0.6244244	0.2260651	1.9633809	-0.3275326	-0.0656142	-0.0763379	3.31e-05
C1orf159	0.0279527	0.0208602	0.8150750	-0.0640302	0.1261940	0.0248038	0.0134276	-6.60e-06
TNFRSF18	1.1823790	1.4462475	1.4931473	0.1115185	-0.2780431	-0.2435467	-0.0291735	2.46e-05

# Test 'Normal' - 'Colorectal.Cancer'
contr <- makeContrasts(proper.groupLRTI_WRK1	 - proper.groupLRTI_WRK2	, levels = colnames(coef(fit)))
contr %>%
	  kable() %>%
	  kable_styling("striped", fixed_thead = TRUE)

	proper.groupLRTI_WRK1 - proper.groupLRTI_WRK2
proper.groupLRTI_WRK1	1
proper.groupLRTI_WRK2	-1
proper.groupLRTI_WRK3	0
proper.groupLRTI_WRK4	0
S.Score	0
G2M.Score	0
percent.mito	0
nFeature_RNA	0

fit2 <- contrasts.fit(fit, contrasts = contr)
fit2 <- eBayes(fit2)
out <- topTable(fit2, n = Inf, sort.by = "P")
head(out, 30) %>%
	  kable() %>%
	  kable_styling("striped", fixed_thead = TRUE)

	logFC	AveExpr	t	adj.P.Val	B
VAV3	-1.7508024	0.3110052	-13.236840	0e+00	62.96266
TBC1D4	-1.3900664	0.2833838	-11.568702	0e+00	48.94408
HDAC9	-1.2517491	0.1753274	-10.333661	0e+00	39.11020
IL1R1	-0.9995498	0.1300682	-10.226861	0e+00	38.28639
CCR8	-0.8353974	0.1163754	-10.036890	0e+00	36.83248
EML4	1.5052329	1.7012424	9.561668	0e+00	33.26267
ARPC2	1.1548512	2.4380833	9.464631	0e+00	32.54606
MT-ND2	-1.0490250	2.3930152	-8.952967	0e+00	28.84073
LAYN	-0.7911447	0.1180913	-8.823339	0e+00	27.92232
RPS4Y1	1.0730097	1.1304680	8.153916	0e+00	23.31908
ENSG00000227240	-1.2591894	0.2575334	-7.784149	0e+00	20.88238
CAPZB	1.0079794	1.8103635	7.756462	0e+00	20.70311
TXNIP	-1.2963896	0.6787166	-7.580633	0e+00	19.57527
CAPZA1	0.9017427	1.2495491	7.516820	0e+00	19.17053
FTL	-1.2627828	2.1523660	-7.179795	0e+00	17.07449
FRYL	0.9189332	0.8948729	7.026423	0e+00	16.14428
UTY	0.9005716	0.5991670	6.983129	0e+00	15.88444
SGPP1	-0.6447435	0.2190291	-6.961139	0e+00	15.75292
IL2RG	1.0240127	1.8196712	6.898402	0e+00	15.37944
GIMAP7	1.4491604	1.4094892	6.831560	0e+00	14.98436
PTPRM	-0.7061117	0.1497160	-6.752201	0e+00	14.51912
IRF1	0.9660324	1.1877894	6.728966	0e+00	14.38370
MCF2L2	-0.7044998	0.2238412	-6.688666	0e+00	14.14968
DEF6	0.8446568	0.9440441	6.655879	0e+00	13.96009
TUT7	0.8658212	0.7549669	6.647806	0e+00	13.91351
TTN	-0.7275733	0.1723645	-6.544020	1e-07	13.31871
UTRN	1.0748347	1.2238449	6.437446	1e-07	12.71557
CELF2	1.1100942	1.7269740	6.408570	2e-07	12.55350
PPP1R18	0.7727838	1.0353629	6.363937	2e-07	12.30411
SYNRG	0.8090564	0.8666973	6.330475	2e-07	12.11806

Output columns:

logFC: log fold change (since we are working with Seurat’s natural log transformed data, will be natural log fold change)
AveExpr: Average expression across all cells in expr2
t: logFC divided by its standard error
P.Value: Raw p-value (based on t) from test that logFC differs from 0
adj.P.Val: Benjamini-Hochberg false discovery rate adjusted p-value
B: log-odds that gene is DE

Save files

write.csv(GenTable(GOdata, Fisher = resultFisher), file = "cluster10_GOdata.csv")
write.csv(out, file = "cluster10_Normal-Colorectal.Cancer_topTable.csv")

A note on pseudobulk DE

Pseudobulk differential expression uses count data summed across all cells in each sample (typically within each cell type or cluster). Unlike cell-level DE, pseudobulk DE requires biological replicates so we won’t perform it on this dataset.

Once counts are summed, pseudobulk data are analyzed like bulk RNASeq data.

Pseudobulk DE may result in better false discovery rate control than cell-level DE, as shown here.

The Seurat function AggregateExpression() can be used to sum counts as described here.

Prepare for the next section

Session Information

sessionInfo()

## R version 4.4.1 (2024-06-14)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sonoma 14.6.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Los_Angeles
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] org.Hs.eg.db_3.19.1  kableExtra_1.4.0     dplyr_1.1.4         
##  [4] topGO_2.56.0         SparseM_1.84-2       GO.db_3.19.1        
##  [7] AnnotationDbi_1.66.0 IRanges_2.38.1       S4Vectors_0.42.1    
## [10] Biobase_2.64.0       graph_1.82.0         BiocGenerics_0.50.0 
## [13] limma_3.60.4         Seurat_5.1.0         SeuratObject_5.0.2  
## [16] sp_2.1-4            
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3      rstudioapi_0.16.0       jsonlite_1.8.8         
##   [4] magrittr_2.0.3          spatstat.utils_3.1-0    rmarkdown_2.28         
##   [7] zlibbioc_1.50.0         vctrs_0.6.5             ROCR_1.0-11            
##  [10] memoise_2.0.1           spatstat.explore_3.3-2  htmltools_0.5.8.1      
##  [13] sass_0.4.9              sctransform_0.4.1       parallelly_1.38.0      
##  [16] KernSmooth_2.23-24      bslib_0.8.0             htmlwidgets_1.6.4      
##  [19] ica_1.0-3               plyr_1.8.9              plotly_4.10.4          
##  [22] zoo_1.8-12              cachem_1.1.0            igraph_2.0.3           
##  [25] mime_0.12               lifecycle_1.0.4         pkgconfig_2.0.3        
##  [28] Matrix_1.7-0            R6_2.5.1                fastmap_1.2.0          
##  [31] GenomeInfoDbData_1.2.12 fitdistrplus_1.2-1      future_1.34.0          
##  [34] shiny_1.9.1             digest_0.6.37           colorspace_2.1-1       
##  [37] patchwork_1.2.0         tensor_1.5              RSpectra_0.16-2        
##  [40] irlba_2.3.5.1           RSQLite_2.3.7           progressr_0.14.0       
##  [43] fansi_1.0.6             spatstat.sparse_3.1-0   httr_1.4.7             
##  [46] polyclip_1.10-7         abind_1.4-5             compiler_4.4.1         
##  [49] bit64_4.0.5             DBI_1.2.3               fastDummies_1.7.4      
##  [52] highr_0.11              MASS_7.3-61             tools_4.4.1            
##  [55] lmtest_0.9-40           httpuv_1.6.15           future.apply_1.11.2    
##  [58] goftest_1.2-3           glue_1.7.0              nlme_3.1-166           
##  [61] promises_1.3.0          grid_4.4.1              Rtsne_0.17             
##  [64] cluster_2.1.6           reshape2_1.4.4          generics_0.1.3         
##  [67] gtable_0.3.5            spatstat.data_3.1-2     tidyr_1.3.1            
##  [70] data.table_1.16.0       xml2_1.3.6              XVector_0.44.0         
##  [73] utf8_1.2.4              spatstat.geom_3.3-2     RcppAnnoy_0.0.22       
##  [76] ggrepel_0.9.5           RANN_2.6.2              pillar_1.9.0           
##  [79] stringr_1.5.1           spam_2.10-0             RcppHNSW_0.6.0         
##  [82] later_1.3.2             splines_4.4.1           lattice_0.22-6         
##  [85] bit_4.0.5               survival_3.7-0          deldir_2.0-4           
##  [88] tidyselect_1.2.1        Biostrings_2.72.1       miniUI_0.1.1.1         
##  [91] pbapply_1.7-2           knitr_1.48              gridExtra_2.3          
##  [94] svglite_2.1.3           scattermore_1.2         xfun_0.47              
##  [97] statmod_1.5.0           matrixStats_1.3.0       UCSC.utils_1.0.0       
## [100] stringi_1.8.4           lazyeval_0.2.2          yaml_2.3.10            
## [103] evaluate_0.24.0         codetools_0.2-20        tibble_3.2.1           
## [106] cli_3.6.3               uwot_0.2.2              systemfonts_1.1.0      
## [109] xtable_1.8-4            reticulate_1.38.0       munsell_0.5.1          
## [112] jquerylib_0.1.4         GenomeInfoDb_1.40.1     Rcpp_1.0.13            
## [115] globals_0.16.3          spatstat.random_3.3-1   png_0.1-8              
## [118] spatstat.univar_3.0-0   parallel_4.4.1          blob_1.2.4             
## [121] ggplot2_3.5.1           dotCall64_1.1-1         listenv_0.9.1          
## [124] viridisLite_0.4.2       scales_1.3.0            ggridges_0.5.6         
## [127] crayon_1.5.3            leiden_0.4.3.1          purrr_1.0.2            
## [130] rlang_1.1.4             KEGGREST_1.44.1         cowplot_1.1.3

☰ Menu

Single Cell RNA-Seq Analysis