Contents


Based on Gilis et al. (2023)

Gilis J, Perin L, Malfait M, Van den Berge K,
Assefa AT, Verbist B, Risso D, and Clement L:
Differential detection workflows for
multi-sample single-cell RNA-seq data.
bioRxiv (2023). DOI: 10.1101/2023.12.17.572043

Load packages

library(dplyr)
library(purrr)
library(tidyr)
library(scater)
library(muscat)
library(ggplot2)
library(patchwork)

1 Introduction

Single-cell RNA-sequencing (scRNA-seq) has improved our understanding of complex biological processes by elucidating cell-level heterogeneity in gene expression. One of the key tasks in the downstream analysis of scRNA-seq data is studying differential gene expression (DE). Most DE analysis methods aim to identify genes for which the average expression differs between biological groups of interest, e.g., between cell types or between diseased and healthy cells. As such, most methods allow for assessing only one aspect of the gene expression distribution: the mean. However, in scRNA-seq data, differences in other characteristics between count distributions can commonly be observed.

One such characteristic is gene detection, i.e., the number of cells in which a gene is (detectably) expressed. Analogous to a DE analysis, a differential detection (DD) analysis aims to identify genes for which the average fraction of cells in which the gene is detected changes between groups. In Gilis et al. (2023), we show how DD analysis contain information that is biologically relevant, and that is largely orthogonal to the information obtained from DE analysis on the same data.

In this vignette, we display how muscat can be used to perform DD analyses in multi-sample, multi-group, multi-(cell-)subpopulation scRNA-seq data. Furthermore, we show how DD and DS analysis results on the same data can be effectively combined using a two-stage testing approach. This workflow thus allows users to jointly assess two biological hypotheses containing orthogonal information, which thus can be expected to improve their understanding of complex biological phenomena, at no extra cost.

2 Setup

We will use the same data as in the differential state (DS) analyses described in muscat, namely, scRNA-seq data acquired on PBMCs from 8 patients before and after IFN-\(\beta\) treatment. For a more detailed description of these data and subsequent preprocessing, we refer to muscat.

library(ExperimentHub)
eh <- ExperimentHub()
query(eh, "Kang")
## ExperimentHub with 3 records
## # snapshotDate(): 2024-11-13
## # $dataprovider: NCI_GDC, GEO
## # $species: Homo sapiens
## # $rdataclass: character, SingleCellExperiment, BSseq
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## #   rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["EH1661"]]' 
## 
##            title                                               
##   EH1661 | Whole Genome Bisulfit Sequencing Data for 47 samples
##   EH1662 | Whole Genome Bisulfit Sequencing Data for 47 samples
##   EH2259 | Kang18_8vs8
(sce <- eh[["EH2259"]])
## class: SingleCellExperiment 
## dim: 35635 29065 
## metadata(0):
## assays(1): counts
## rownames(35635): MIR1302-10 FAM138A ... MT-ND6 MT-CYB
## rowData names(2): ENSEMBL SYMBOL
## colnames(29065): AAACATACAATGCC-1 AAACATACATTTCC-1 ... TTTGCATGGTTTGG-1
##   TTTGCATGTCTTAC-1
## colData names(5): ind stim cluster cell multiplets
## reducedDimNames(1): TSNE
## mainExpName: NULL
## altExpNames(0):

We further apply some minimal filtering to remove low-quality genes and cells, and use prepSCE() to standardize cell metadata such that slots specifying cluster (cell), sample (stim+ind), and group (stim) identifiers conform with the muscat framework:

sce <- sce[rowSums(counts(sce) > 0) > 0, ]
qc <- perCellQCMetrics(sce)
sce <- sce[, !isOutlier(qc$detected, nmads=2, log=TRUE)]
sce <- sce[rowSums(counts(sce) > 1) >= 10, ]
sce$id <- paste0(sce$stim, sce$ind)
sce <- prepSCE(sce, "cell", "id", "stim")
table(sce$cluster_id, sce$group_id)
##                    
##                     ctrl stim
##   B cells           1422 1313
##   CD14+ Monocytes   2855 2662
##   CD4 T cells       5874 5897
##   CD8 T cells       1369 1188
##   Dendritic cells     90  130
##   FCGR3A+ Monocytes  698  791
##   Megakaryocytes     117  130
##   NK cells          1038 1246
table(sce$sample_id)
## 
##  ctrl101 ctrl1015 ctrl1016 ctrl1039  ctrl107 ctrl1244 ctrl1256 ctrl1488 
##      912     2880     2062      436      586     2091     2267     2229 
##  stim101 stim1015 stim1016 stim1039  stim107 stim1244 stim1256 stim1488 
##     1197     2494     1901      639      578     1642     2127     2779

2.1 Aggregation

In general, aggregateData() will aggregate the data by the colData variables specified with argument by, and return a SingleCellExperiment containing pseudobulk data.

To perform a pseudobulk-level analysis, measurements must be aggregated at the cluster-sample level (default by = c("cluster_id", "sample_id"). In this case, the returned SingleCellExperiment will contain one assay per cluster, where rows = genes and columns = samples. Arguments assay and fun specify the input data and summary statistic, respectively, to use for aggregation.

In a differential detection (DD) analysis, the default choice of the summary statistic used for aggregation is fun = "num.detected". This strategy can be thought of as first binarizing the gene expression values (1: expressed, 0: not expressed), and subsequently performing a simple summation of the binarized gene expression counts for cells belonging to the same cluster-sample level. Hence, the resulting pseudobulk-level expression count reflects the total number of cells in a particular cluster-sample level with a non-zero gene expression value.

In a differential state (DS) analysis, the default choice for aggregation is fun = "sum", which amounts to the simple summation of the raw gene expression counts of cells belonging to the same cluster-sample level.

pb_sum <- aggregateData(sce,
    assay="counts", fun="sum",
    by=c("cluster_id", "sample_id"))
pb_det <- aggregateData(sce,
    assay="counts", fun="num.detected",
    by=c("cluster_id", "sample_id"))
t(head(assay(pb_det)))
##          NOC2L HES4 ISG15 TNFRSF18 TNFRSF4 SDF4
## ctrl101     11    0    10        7       1   10
## ctrl1015    37    4    86       43      10   39
## ctrl1016    12    2    30        7       0   15
## ctrl1039     2    1    10        3       1    1
## ctrl107      5    0     4        3       1    5
## ctrl1244    17    5    14       17       4    6
## ctrl1256    21    1    45       22       6   16
## ctrl1488    16    1    28       18       3   20
## stim101     12    8   142        3       1    8
## stim1015    32   26   356       16       2   16
## stim1016     5    7   129        1       3    6
## stim1039     2    2    39        1       1    1
## stim107      3    2    56        1       0    3
## stim1244    11    9    93        5       2    5
## stim1256    14    8   210        8       1    6
## stim1488    16    4   282        9       1   15

Qiu (2020) demonstrated that binarizing scRNA-seq counts generates expression profiles that still accurately reflect biological variation. This finding was confirmed by Bouland, Mahfouz, and Reinders (2021), who showed that the frequencies of zero counts capture biological variability, and further claimed that a binarized representation of the single-cell expression data allows for a more robust description of the relative abundance of transcripts than counts.

pbMDS(pb_sum) + ggtitle("Σ counts") +
pbMDS(pb_det) + ggtitle("# detected") +
plot_layout(guides="collect") +
plot_annotation(tag_levels="A") &
theme(legend.key.size=unit(0.5, "lines"))
Pseudobulk-level multidimensional scaling (MDS) plot based on (A) sum of counts and (B) sum of binarized counts (i.e., counting the number of detected features) in each cluster-sample.

Figure 1: Pseudobulk-level multidimensional scaling (MDS) plot based on (A) sum of counts and (B) sum of binarized counts (i.e., counting the number of detected features) in each cluster-sample

2.2 Analysis

Once we have assembled the pseudobulk data, we can test for DD using pbDD(). By default, a \(\sim\)group_id model is fit, and the last coefficient of the linear model is tested to be equal to zero.

res_DD <- pbDD(pb_det, min_cells=0, filter="none", verbose=FALSE)

2.3 Handling and visualizing results

Inspection, manipulation, and visualization of DD analysis results follows the same principles as for a DS analysis. For a detailed description, we refer to the DS analysis vignettemuscat. Below, some basic functionalities are being displayed.

tbl <- res_DD$table[[1]]
# one data.frame per cluster
names(tbl)
## [1] "B cells"           "CD14+ Monocytes"   "CD4 T cells"      
## [4] "CD8 T cells"       "Dendritic cells"   "FCGR3A+ Monocytes"
## [7] "Megakaryocytes"    "NK cells"
# view results for 1st cluster
k1 <- tbl[[1]]
head(format(k1[, -ncol(k1)], digits = 2))
##       gene cluster_id    logFC logCPM       F    p_val p_adj.loc p_adj.glb
## 1    NOC2L    B cells -3.0e-01    7.6 2.4e+00  1.2e-01   2.9e-01   3.1e-01
## 2     HES4    B cells  2.2e+00    6.5 3.5e+01  1.1e-08   2.4e-07   2.8e-07
## 3    ISG15    B cells  2.6e+00   10.3 1.1e+03  9.3e-88   1.1e-84   8.6e-84
## 4 TNFRSF18    B cells -1.4e+00    7.3 3.4e+01  1.6e-08   3.5e-07   4.0e-07
## 5  TNFRSF4    B cells -1.1e+00    5.8 5.7e+00  1.8e-02   7.5e-02   8.0e-02
## 6     SDF4    B cells -8.4e-01    7.3 1.5e+01  1.7e-04   1.7e-03   1.8e-03
# filter FDR < 5%, |logFC| > 1 & sort by adj. p-value
tbl_fil <- lapply(tbl, \(u)
    filter(u, 
        p_adj.loc < 0.05, 
        abs(logFC) > 1) |>
        arrange(p_adj.loc))

# nb. of DS genes & % of total by cluster
n_de <- vapply(tbl_fil, nrow, numeric(1))
p_de <- format(n_de / nrow(sce) * 100, digits = 3)
data.frame("#DD" = n_de, "%DD" = p_de, check.names = FALSE)
##                    #DD   %DD
## B cells            715 10.04
## CD14+ Monocytes   1982 27.84
## CD4 T cells        498  7.00
## CD8 T cells        333  4.68
## Dendritic cells    277  3.89
## FCGR3A+ Monocytes 1110 15.59
## Megakaryocytes     109  1.53
## NK cells           423  5.94
library(UpSetR)
de_gs_by_k <- map(tbl_fil, "gene")
upset(fromList(de_gs_by_k))

3 Stagewise anaysis

While DD analysis results may contain biologically relevant information in their own right, we show in Gilis et al. (2023) that combing DD and DS analysis results on the same data can further improve our understanding of complex biological phenomena. In the remainder of this vignette, we show how DD and DS analysis results on the same data can be effectively combined.

For this, we build on the two-stage testing paradigm proposed by Van den Berge et al. (2017). In the first stage of this testing procedure, we identify differential genes by using an omnibus test for differential detection and differential expression (DE). The null hypothesis for this test is that the gene is neither differentially detected, nor differentially expressed.

In the second stage, we perform post-hoc tests on the differential genes from stage one to unravel whether they are DD, DE or both. Compared to the individual DD and DS analysis results, the two-stage approach increases statistical power and provides better type 1 error control.

res_DS <- pbDS(pb_sum, min_cells=0, filter="none", verbose=FALSE)
res <- stagewise_DS_DD(res_DS, res_DD, verbose=FALSE)
head(res[[1]][[1]]) # results for 1st cluster
## DataFrame with 6 rows and 8 columns
##          gene       p_adj    p_val.DS    p_val.DD  cluster_id    contrast
##   <character>   <numeric>   <numeric>   <numeric> <character> <character>
## 1       NOC2L 3.73143e-01          NA          NA     B cells        stim
## 2        HES4 4.71932e-07 1.35239e-05 5.53293e-08     B cells        stim
## 3       ISG15 2.17494e-84 2.62823e-32 4.67394e-87     B cells        stim
## 4    TNFRSF18 6.86768e-07 1.00944e-04 8.19769e-08     B cells        stim
## 5     TNFRSF4 1.24236e-02 4.70871e-03 9.07348e-02     B cells        stim
## 6        SDF4 3.21728e-03 6.89599e-02 8.71664e-04     B cells        stim
##                           res_DS                         res_DD
##                     <data.frame>                   <data.frame>
## 1    NOC2L:B cells:-0.208711:...    NOC2L:B cells:-0.303141:...
## 2     HES4:B cells: 2.225287:...     HES4:B cells: 2.201890:...
## 3    ISG15:B cells: 5.521586:...    ISG15:B cells: 2.568004:...
## 4 TNFRSF18:B cells:-1.271499:... TNFRSF18:B cells:-1.382451:...
## 5  TNFRSF4:B cells:-1.645505:...  TNFRSF4:B cells:-1.126530:...
## 6     SDF4:B cells:-0.612369:...     SDF4:B cells:-0.844448:...

3.1 Comparison

# for each approach, get adjusted p-values across clusters
ps <- map_depth(res, 2, \(df) {
    data.frame(
        df[, c("gene", "cluster_id")],
        p_adj.stagewise=df$p_adj,
        p_adj.DS=df$res_DS$p_adj.loc,
        p_adj.DD=df$res_DD$p_adj.loc)
}) |> 
    lapply(do.call, what=rbind) |>
    do.call(what=rbind) |>
    data.frame(row.names=NULL)
head(ps)
##       gene cluster_id p_adj.stagewise     p_adj.DS     p_adj.DD
## 1    NOC2L    B cells    3.731430e-01 6.064017e-01 2.863777e-01
## 2     HES4    B cells    4.719323e-07 5.686554e-05 2.399617e-07
## 3    ISG15    B cells    2.174939e-84 1.834502e-29 1.077728e-84
## 4 TNFRSF18    B cells    6.867680e-07 3.576604e-04 3.468337e-07
## 5  TNFRSF4    B cells    1.242358e-02 9.526608e-03 7.494398e-02
## 6     SDF4    B cells    3.217278e-03 8.069408e-02 1.727711e-03

To get an overview of how different approaches compare, we can count the number of genes found differential in each cluster for a given FDR threshold:

# for each approach & cluster, count number 
# of genes falling below 5% FDR threshold
ns <- lapply(seq(0, 0.2, 0.005), \(th) {
    ps |>
        mutate(th=th) |>
        group_by(cluster_id, th) |>
        summarise(
            .groups="drop",
            across(starts_with("p_"), 
            \(.) sum(. < th, na.rm=TRUE)))
}) |> 
    do.call(what=rbind) |>
    pivot_longer(starts_with("p_"))
ggplot(ns, aes(th, value, col=name)) + 
    geom_line(linewidth=0.8, key_glyph="point") +
    geom_vline(xintercept=0.05, lty=2, linewidth=0.4) +
    guides(col=guide_legend(NULL, override.aes=list(size=3))) +
    labs(x="FDR threshold", y="number of significantly\ndifferential genes") +
    facet_wrap(~cluster_id, scales="free_y", nrow=2) + 
    theme_bw() + theme(
        panel.grid.minor=element_blank(),
        legend.key.size=unit(0.5, "lines"))

We can further identify which hits are shared between or unique to a given approach. In the example below, for instance, the vast majority of hits is common to all approaches, many hits are shared between DD and stagewise testing, and only few genes are specific to any one approach:

# subset adjuster p-values for cluster of interest
qs <- ps[grep("CD4", ps$cluster_id), grep("p_", names(ps))]
# for each approach, extract genes at 5% FDR threshold
gs <- apply(qs, 2, \(.) ps$gene[. < 0.05])
# visualize set intersections between approaches
UpSetR::upset(UpSetR::fromList(gs), order.by="freq")
Upset plot of differential findings (FDR < 0.05) across DS, DD, and stagewise analysis for an exemplary cluster; shown are the 50 most frequent interactions.

Figure 2: Upset plot of differential findings (FDR < 0.05) across DS, DD, and stagewise analysis for an exemplary cluster; shown are the 50 most frequent interactions

# extract genes unique to stagewise testing
sw <- grep("stagewise", names(gs))
setdiff(gs[[sw]], unlist(gs[-sw]))
## [1] "OLIG1"   "MRPL39"  "SLC31A1"

Session info

sessionInfo()
## R Under development (unstable) (2024-10-21 r87258)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] tidyr_1.3.1                 UpSetR_1.4.0               
##  [3] scater_1.35.0               scuttle_1.17.0             
##  [5] muscData_1.21.0             SingleCellExperiment_1.29.1
##  [7] SummarizedExperiment_1.37.0 Biobase_2.67.0             
##  [9] GenomicRanges_1.59.0        GenomeInfoDb_1.43.0        
## [11] IRanges_2.41.0              S4Vectors_0.45.1           
## [13] MatrixGenerics_1.19.0       matrixStats_1.4.1          
## [15] ExperimentHub_2.15.0        AnnotationHub_3.15.0       
## [17] BiocFileCache_2.15.0        dbplyr_2.5.0               
## [19] BiocGenerics_0.53.2         generics_0.1.3             
## [21] purrr_1.0.2                 muscat_1.21.0              
## [23] limma_3.63.2                ggplot2_3.5.1              
## [25] dplyr_1.1.4                 cowplot_1.1.3              
## [27] patchwork_1.3.0             BiocStyle_2.35.0           
## 
## loaded via a namespace (and not attached):
##   [1] RcppAnnoy_0.0.22         splines_4.5.0            filelock_1.0.3          
##   [4] bitops_1.0-9             tibble_3.2.1             lifecycle_1.0.4         
##   [7] Rdpack_2.6.1             edgeR_4.5.0              doParallel_1.0.17       
##  [10] globals_0.16.3           lattice_0.22-6           MASS_7.3-61             
##  [13] backports_1.5.0          magrittr_2.0.3           sass_0.4.9              
##  [16] rmarkdown_2.29           jquerylib_0.1.4          yaml_2.3.10             
##  [19] sctransform_0.4.1        DBI_1.2.3                minqa_1.2.8             
##  [22] RColorBrewer_1.1-3       multcomp_1.4-26          abind_1.4-8             
##  [25] zlibbioc_1.53.0          EnvStats_3.0.0           glmmTMB_1.1.10          
##  [28] TH.data_1.1-2            rappdirs_0.3.3           sandwich_3.1-1          
##  [31] circlize_0.4.16          GenomeInfoDbData_1.2.13  ggrepel_0.9.6           
##  [34] pbkrtest_0.5.3           irlba_2.3.5.1            listenv_0.9.1           
##  [37] parallelly_1.39.0        codetools_0.2-20         DelayedArray_0.33.1     
##  [40] tidyselect_1.2.1         shape_1.4.6.1            UCSC.utils_1.3.0        
##  [43] farver_2.1.2             lme4_1.1-35.5            ScaledMatrix_1.15.0     
##  [46] viridis_0.6.5            jsonlite_1.8.9           GetoptLong_1.0.5        
##  [49] BiocNeighbors_2.1.0      survival_3.7-0           iterators_1.0.14        
##  [52] emmeans_1.10.5           foreach_1.5.2            tools_4.5.0             
##  [55] progress_1.2.3           Rcpp_1.0.13-1            blme_1.0-6              
##  [58] glue_1.8.0               gridExtra_2.3            SparseArray_1.7.1       
##  [61] xfun_0.49                mgcv_1.9-1               DESeq2_1.47.0           
##  [64] stageR_1.29.0            withr_3.0.2              numDeriv_2016.8-1.1     
##  [67] BiocManager_1.30.25      fastmap_1.2.0            boot_1.3-31             
##  [70] fansi_1.0.6              caTools_1.18.3           digest_0.6.37           
##  [73] rsvd_1.0.5               mime_0.12                R6_2.5.1                
##  [76] estimability_1.5.1       colorspace_2.1-1         Cairo_1.6-2             
##  [79] gtools_3.9.5             RSQLite_2.3.7            RhpcBLASctl_0.23-42     
##  [82] utf8_1.2.4               variancePartition_1.37.1 data.table_1.16.2       
##  [85] corpcor_1.6.10           prettyunits_1.2.0        httr_1.4.7              
##  [88] S4Arrays_1.7.1           uwot_0.2.2               pkgconfig_2.0.3         
##  [91] gtable_0.3.6             blob_1.2.4               ComplexHeatmap_2.23.0   
##  [94] XVector_0.47.0           remaCor_0.0.18           htmltools_0.5.8.1       
##  [97] bookdown_0.41            TMB_1.9.15               clue_0.3-66             
## [100] scales_1.3.0             png_0.1-8                fANCOVA_0.6-1           
## [103] reformulas_0.4.0         knitr_1.49               reshape2_1.4.4          
## [106] rjson_0.2.23             curl_6.0.0               coda_0.19-4.1           
## [109] nlme_3.1-166             nloptr_2.1.1             cachem_1.1.0            
## [112] zoo_1.8-12               GlobalOptions_0.1.2      stringr_1.5.1           
## [115] BiocVersion_3.21.1       KernSmooth_2.23-24       parallel_4.5.0          
## [118] vipor_0.4.7              AnnotationDbi_1.69.0     pillar_1.9.0            
## [121] grid_4.5.0               vctrs_0.6.5              gplots_3.2.0            
## [124] BiocSingular_1.23.0      beachmat_2.23.1          xtable_1.8-4            
## [127] cluster_2.1.6            beeswarm_0.4.0           evaluate_1.0.1          
## [130] magick_2.8.5             tinytex_0.54             mvtnorm_1.3-2           
## [133] cli_3.6.3                locfit_1.5-9.10          compiler_4.5.0          
## [136] rlang_1.1.4              crayon_1.5.3             future.apply_1.11.3     
## [139] labeling_0.4.3           plyr_1.8.9               ggbeeswarm_0.7.2        
## [142] stringi_1.8.4            viridisLite_0.4.2        BiocParallel_1.41.0     
## [145] Biostrings_2.75.1        lmerTest_3.1-3           munsell_0.5.1           
## [148] aod_1.3.3                Matrix_1.7-1             hms_1.1.3               
## [151] bit64_4.5.2              future_1.34.0            KEGGREST_1.47.0         
## [154] statmod_1.5.0            rbibutils_2.3            memoise_2.0.1           
## [157] broom_1.0.7              bslib_0.8.0              bit_4.5.0

References

Bouland, Gerard A, Ahmed Mahfouz, and Marcel J T Reinders. 2021. “Differential Analysis of Binarized Single-Cell RNA Sequencing Data Captures Biological Variation.” NAR Genomics and Bioinformatics 3 (4): lqab118.
Gilis, Jeroen, Laura Perin, Milan Malfait, Koen Van den Berge, Alemu Takele Assefa, Bie Verbist, Davide Risso, and Lieven Clement. 2023. “Differential Detection Workflows for Multi-Sample Single-Cell RNA-seq Data.” bioRxiv.
Qiu, Peng. 2020. “Embracing the Dropouts in Single-Cell RNA-seq Analysis.” Nature Communications 11 (1): 1169.
Van den Berge, Koen, Charlotte Soneson, Mark D Robinson, and Lieven Clement. 2017. “stageR: A General Stage-Wise Method for Controlling the Gene-Level False Discovery Rate in Differential Expression and Differential Transcript Usage.” Genome Biology 18 (151).