One of the most intuitive ways to evaluate a gene expression change is using Differential Expression (DE) analysis. Traditionally, DE analysis focuses on identifying genes that are up- or down-regulated (increased or decreased expression) between conditions, typically employing a basic mean-difference approach. We propose a paradigm shift that acknowledges the central role of gene expression variability in cellular function and challenges the current dominance of mean-based DE analysis in single-cell studies. We suggest that scRNA-seq data analysis should embrace the role of inherent gene expression variability in defining cellular function and move beyond mean-based approaches.
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("Xenon8778/SplineDV")
For this vignette, we’ll use SplineDV in conjunction with the scRNAseq package. While standard data preprocessing can be applied beforehand, the package itself handles quality control internally, requiring only a raw count matrix as input.
library(scRNAseq)
library(SplineDV)
Spline-DV is designed to compare expression variability between two experimental conditions. To illustrate this, we’ll utilize a publicly available mouse lung single-cell RNA-seq dataset from the scRNAseq Bioconductor package by Zilionis et al. [1]. This dataset includes tumor-infiltrating myeloid cells from both tumor and healthy conditions.
sce <- fetchDataset('zilionis-lung-2019','2023-12-20', path="mouse")
sce
## class: SingleCellExperiment
## dim: 28205 17549
## metadata(0):
## assays(1): counts
## rownames(28205): 0610007P14Rik 0610009B22Rik ... mt-Nd5 mt-Nd6
## rowData names(0):
## colnames: NULL
## colData names(12): Library Barcode ... Major cell type Minor subset
## reducedDimNames(1): SPRING
## mainExpName: NULL
## altExpNames(0):
Let us see the number of cells per condition and and the available cell types in the scRNA-seq dataset.
table(sce$`Major cell type`,sce$`Tissue`)
##
## healthy tumor
## B cells 939 1874
## Basophils 11 23
## MoMacDC 629 1768
## NK cells 230 400
## Neutrophils 4429 3593
## T cells 490 1491
## pDC 10 52
The expression matrix should be formatted with the genes/features as rows and cells as columns. The two input matrices for splineDV must be stored as two separate count expression matrices or SingleCellExperiment objects. We can isolate the two experimental conditions. To focus solely on changes within a specific cell type and eliminate potential variability arising from shifts in cell type distribution, we select only Neutrophils for further analysis.
# Extract Healthy data
healthyCount <- sce[, sce$Tissue == "healthy"]
# Extract Tumor data
tumorCount <- sce[, sce$Tissue == "tumor"]
# Select a specific cell type - Neutrophils
healthyCount <- healthyCount[, which(healthyCount$`Major cell type` == "Neutrophils")]
print(healthyCount)
## class: SingleCellExperiment
## dim: 28205 4429
## metadata(0):
## assays(1): counts
## rownames(28205): 0610007P14Rik 0610009B22Rik ... mt-Nd5 mt-Nd6
## rowData names(0):
## colnames: NULL
## colData names(12): Library Barcode ... Major cell type Minor subset
## reducedDimNames(1): SPRING
## mainExpName: NULL
## altExpNames(0):
tumorCount <- tumorCount[, which(tumorCount$`Major cell type` == "Neutrophils")]
print(tumorCount)
## class: SingleCellExperiment
## dim: 28205 3593
## metadata(0):
## assays(1): counts
## rownames(28205): 0610007P14Rik 0610009B22Rik ... mt-Nd5 mt-Nd6
## rowData names(0):
## colnames: NULL
## colData names(12): Library Barcode ... Major cell type Minor subset
## reducedDimNames(1): SPRING
## mainExpName: NULL
## altExpNames(0):
Given our isolation of a single cell type, any observed changes in gene expression variability should be attributed to experimental factors. To mitigate the potential impact of background distribution shifts caused by varying cell type and state proportions, we recommend conducting Spline-DV analysis in a cell state/type-specific manner.
The main function of SplineDV ios the splineDV function. For the analysis, the test data (X) is always use in contrast with the control data (Y). We use smaller QC parameters for the small example data sets to preserve enough cells and genes. We recommend using the default QC parameters for large data sets.
dvRes <- splineDV(X = tumorCount, Y = healthyCount)
## Computing distances for Data 1
## Performing normalization...
## Computing distances...
## HVG computed
## Computing distances for Data 2
## Performing normalization...
## Computing distances...
## HVG computed
head(dvRes)
## DataFrame with 6 rows and 19 columns
## genes mu1 mu2 CV1 CV2 drop1 drop2
## <character> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## 1 S100a6 2.423091 3.3651806 1.069873 0.623584 0.4375388 0.0482293
## 2 S100a9 3.388193 4.1249288 0.734337 0.490492 0.1233686 0.0139795
## 3 Fth1 3.245193 2.5737118 0.677005 0.534658 0.0264139 0.0272600
## 4 Spp1 0.696282 0.0304492 2.006586 2.309741 0.8713487 0.9804287
## 5 S100a8 3.340410 3.9502114 0.700011 0.491530 0.0873213 0.0118826
## 6 Retnlg 1.489806 2.8408711 1.346040 0.800264 0.7268490 0.1817335
## dist1 dist2 X_splinex X_spliney X_splinez Y_splinex Y_spliney
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## 1 0.450291 1.175883 2.088791 0.784693 0.231588 2.1523963 0.625115
## 2 1.213435 1.940386 2.178329 0.768020 0.210077 2.1916912 0.616329
## 3 1.086377 0.402181 2.171898 0.769218 0.211622 2.0816899 0.640936
## 4 0.613273 0.008306 0.679678 1.128363 0.634591 0.0304185 2.200492
## 5 1.170524 1.766433 2.175922 0.768469 0.210656 2.1841718 0.618010
## 6 0.538217 0.677298 1.614046 0.874726 0.347494 2.1047751 0.635768
## Y_splinez vectorDist Direction Pval FDR
## <numeric> <numeric> <numeric> <numeric> <numeric>
## 1 0.133051 1.020089 -1 1.24843e-97 6.69908e-94
## 2 0.122861 0.729556 -1 3.77682e-49 1.01332e-45
## 3 0.151399 0.690545 1 6.40575e-44 1.14578e-40
## 4 0.976790 0.605396 1 1.47841e-33 1.98329e-30
## 5 0.124811 0.599252 -1 7.27451e-33 7.80700e-30
## 6 0.145405 0.586091 -1 2.08436e-31 1.86411e-28
The output is a DataFrame containing statistical measures for each gene in the differential variability (DV) analysis. Genes with large vectorDist values exhibit substantial changes in expression variability between the two experimental conditions. Genes with FDR values below 0.05 are considered significantly differentially variable. The Direction column indicates whether the variability increased or decreased. By leveraging this information, we can identify biologically relevant genes that not only demonstrate shifts in mean expression but also alterations in expression variability across conditions.
To visualize the differential variability of genes, we can employ a 3D scatter plot. The DVPlot
function allows us to visualize the computed spline for each condition, represented by a distinct color, along with the distance vectors of genes from their respective spline. This plot provides a clear representation of the shift in expression variability between the two conditions.
fig <- DVPlot(dvRes, targetgene='Spp1')
fig
Next, we’ll introduce the Spline HVG algorithm, a feature selection technique designed to identify Highly Variable Genes (HVGs) in scRNA-seq data. This algorithm computes the distance from the spline to identify genes with significant variability. To demonstrate this, we’ll utilize the same sample dataset of healthy neutrophils from mice lungs used in the previous section.
# Healthy neutrophils data
print(healthyCount)
## class: SingleCellExperiment
## dim: 28205 4429
## metadata(0):
## assays(1): counts
## rownames(28205): 0610007P14Rik 0610009B22Rik ... mt-Nd5 mt-Nd6
## rowData names(0):
## colnames: NULL
## colData names(12): Library Barcode ... Major cell type Minor subset
## reducedDimNames(1): SPRING
## mainExpName: NULL
## altExpNames(0):
The expression matrix should be formatted with the genes/features as rows and cells as columns. The input matrix for splineHVG
must be stored as a count expression matrices or SingleCellExperiment objects. To focus solely on changes within a specific cell type and eliminate potential variability arising from shifts in cell type distribution, we select only Neutrophils for further analysis. Smaller QC parameters can be used for the small data sets (e.g., cells < 500 and genes < 5000) to preserve enough cells and genes for splineHVG analysis. However, we recommend using the default QC parameters for large data sets, if not more stringent.
HVGRes <- splineHVG(healthyCount, nHVGs = 200)
## QC done
## Performing normalization...
## Computing distances...
## HVG computed
head(HVGRes)
## DataFrame with 6 rows and 14 columns
## Means CV Dropout logMean logCV Distance nearidx
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## S100a9 60.8722 0.633138 0.0139795 4.12507 0.490503 1.932288 16245
## S100a8 50.9572 0.634967 0.0118826 3.95042 0.491622 1.758383 16245
## S100a6 27.9395 0.865402 0.0482293 3.36521 0.623477 1.167742 16245
## Srgn 25.2401 0.589972 0.0263281 3.26729 0.463716 1.082339 16245
## Fos 19.0464 0.587785 0.0163094 2.99805 0.462340 0.819475 16245
## Il1b 18.3003 0.616345 0.0214352 2.96012 0.480168 0.778621 16245
## dvecx dvecy dvecz splinex spliney splinez HVG
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <logical>
## S100a9 1.925290 -0.12444880 -0.1072863 2.19978 0.614952 0.121266 FALSE
## S100a8 1.750639 -0.12332955 -0.1093833 2.19217 0.616646 0.123230 FALSE
## S100a6 1.165425 0.00852479 -0.0730366 2.16000 0.623812 0.131539 FALSE
## Srgn 1.067508 -0.15123556 -0.0949378 2.15177 0.625644 0.133663 FALSE
## Fos 0.798269 -0.15261184 -0.1049564 2.13792 0.628731 0.137243 FALSE
## Il1b 0.760340 -0.13478441 -0.0998306 2.13182 0.630090 0.138818 FALSE
The output is a DataFrame containing statistical measures for each gene in the Spline HVG analysis. Genes with large Distance values exhibit significant deviation from expected behavior, as represented by the 3D spline. The top n HVGs can be identified by sorting the DataFrame by Distance in descending order. These highly variable genes, which are biologically relevant, can be utilized in various downstream analyses, such as for dimensional reduction and PCA.
# Extracting HVG Gene list
HVGList <- rownames(HVGRes)[HVGRes$HVG == TRUE]
head(HVGList)
## [1] "Retnlg" "Lyz2" "Thbs1" "Slfn4" "Rsad2" "Svil"
To visualize the highly variable genes, we can employ a 3D scatter plot. The HVGPlot
function allows us to visualize the computed spline, which represents the expected gene behavior. Each dot on the plot represents a single gene. This plot provides a clear representation of HVGs (highlighted in red), which deviate significantly from the 3D spline, indicating a substantial shift in their gene expression.
fig <- HVGPlot(HVGRes)
fig
SplineDV is a valuable tool for single-cell RNA sequencing (scRNA-seq) analysis, specifically designed to identify differential gene expression variability between conditions. By examining changes in expression dispersion, SplineDV complements traditional differential expression analysis methods. We recommend using SplineDV in conjunction with other tools from Seurat and Bioconductor to gain a more comprehensive understanding of biological processes.
citation("SplineDV")
## To cite package 'SplineDV' in publications use:
##
## Gatlin V, Gupta S, Romero S, Chapkin R, Cai J (2024). "Beyond
## Differential Expression: Embracing Cell-to-Cell Variability in
## Single-Cell Gene Expression Data Analysis." _bioRxiv_.
## doi:10.1101/2024.08.08.607086
## <https://doi.org/10.1101/2024.08.08.607086>.
##
## A BibTeX entry for LaTeX users is
##
## @Article{,
## title = {Beyond Differential Expression: Embracing Cell-to-Cell Variability in Single-Cell Gene Expression Data Analysis},
## author = {Victoria Gatlin and Shreyan Gupta and Selim Romero and Robert Chapkin and James Cai},
## journal = {bioRxiv},
## year = {2024},
## doi = {https://doi.org/10.1101/2024.08.08.607086},
## }
This is the output of sessionInfo()
on the system on which this document was compiled:
date()
## [1] "Wed Nov 20 19:10:54 2024"
sessionInfo()
## R Under development (unstable) (2024-10-21 r87258)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] SplineDV_0.99.11 scRNAseq_2.21.0
## [3] SingleCellExperiment_1.29.1 SummarizedExperiment_1.37.0
## [5] Biobase_2.67.0 GenomicRanges_1.59.1
## [7] GenomeInfoDb_1.43.1 IRanges_2.41.1
## [9] S4Vectors_0.45.2 BiocGenerics_0.53.3
## [11] generics_0.1.3 MatrixGenerics_1.19.0
## [13] matrixStats_1.4.1 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] jsonlite_1.8.9 magrittr_2.0.3
## [3] GenomicFeatures_1.59.1 gypsum_1.3.0
## [5] farver_2.1.2 rmarkdown_2.29
## [7] BiocIO_1.17.0 zlibbioc_1.53.0
## [9] vctrs_0.6.5 memoise_2.0.1
## [11] Rsamtools_2.23.0 DelayedMatrixStats_1.29.0
## [13] RCurl_1.98-1.16 htmltools_0.5.8.1
## [15] S4Arrays_1.7.1 AnnotationHub_3.15.0
## [17] curl_6.0.1 Rhdf5lib_1.29.0
## [19] SparseArray_1.7.2 rhdf5_2.51.0
## [21] sass_0.4.9 alabaster.base_1.7.2
## [23] bslib_0.8.0 htmlwidgets_1.6.4
## [25] alabaster.sce_1.7.0 httr2_1.0.6
## [27] plotly_4.10.4 cachem_1.1.0
## [29] GenomicAlignments_1.43.0 lifecycle_1.0.4
## [31] pkgconfig_2.0.3 Matrix_1.7-1
## [33] R6_2.5.1 fastmap_1.2.0
## [35] GenomeInfoDbData_1.2.13 digest_0.6.37
## [37] colorspace_2.1-1 AnnotationDbi_1.69.0
## [39] ExperimentHub_2.15.0 crosstalk_1.2.1
## [41] RSQLite_2.3.8 beachmat_2.23.1
## [43] filelock_1.0.3 fansi_1.0.6
## [45] httr_1.4.7 abind_1.4-8
## [47] compiler_4.5.0 bit64_4.5.2
## [49] withr_3.0.2 BiocParallel_1.41.0
## [51] DBI_1.2.3 HDF5Array_1.35.1
## [53] alabaster.ranges_1.7.0 alabaster.schemas_1.7.0
## [55] rappdirs_0.3.3 DelayedArray_0.33.2
## [57] rjson_0.2.23 tools_4.5.0
## [59] glue_1.8.0 restfulr_0.0.15
## [61] rhdf5filters_1.19.0 grid_4.5.0
## [63] gtable_0.3.6 tidyr_1.3.1
## [65] ensembldb_2.31.0 data.table_1.16.2
## [67] utf8_1.2.4 XVector_0.47.0
## [69] BiocVersion_3.21.1 pillar_1.9.0
## [71] dplyr_1.1.4 BiocFileCache_2.15.0
## [73] lattice_0.22-6 rtracklayer_1.67.0
## [75] bit_4.5.0 tidyselect_1.2.1
## [77] Biostrings_2.75.1 scuttle_1.17.0
## [79] knitr_1.49 bookdown_0.41
## [81] ProtGenerics_1.39.0 xfun_0.49
## [83] UCSC.utils_1.3.0 lazyeval_0.2.2
## [85] yaml_2.3.10 evaluate_1.0.1
## [87] codetools_0.2-20 tibble_3.2.1
## [89] alabaster.matrix_1.7.2 BiocManager_1.30.25
## [91] cli_3.6.3 munsell_0.5.1
## [93] jquerylib_0.1.4 Rcpp_1.0.13-1
## [95] dbplyr_2.5.0 png_0.1-8
## [97] XML_3.99-0.17 parallel_4.5.0
## [99] ggplot2_3.5.1 blob_1.2.4
## [101] AnnotationFilter_1.31.0 sparseMatrixStats_1.19.0
## [103] bitops_1.0-9 alabaster.se_1.7.0
## [105] viridisLite_0.4.2 scales_1.3.0
## [107] purrr_1.0.2 crayon_1.5.3
## [109] rlang_1.1.4 KEGGREST_1.47.0