Contents

1 Introduction

1.1 Beyond averages gene expression changes

One of the most intuitive ways to evaluate a gene expression change is using Differential Expression (DE) analysis. Traditionally, DE analysis focuses on identifying genes that are up- or down-regulated (increased or decreased expression) between conditions, typically employing a basic mean-difference approach. We propose a paradigm shift that acknowledges the central role of gene expression variability in cellular function and challenges the current dominance of mean-based DE analysis in single-cell studies. We suggest that scRNA-seq data analysis should embrace the role of inherent gene expression variability in defining cellular function and move beyond mean-based approaches.

2 Installation

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")

BiocManager::install("Xenon8778/SplineDV")

3 Import libraries

For this vignette, we’ll use SplineDV in conjunction with the scRNAseq package. While standard data preprocessing can be applied beforehand, the package itself handles quality control internally, requiring only a raw count matrix as input.

library(scRNAseq)
library(SplineDV)

4 Overview of the SplineDV workflow

4.1 Input - scRNA-seq expression matrix from two condtions

Spline-DV is designed to compare expression variability between two experimental conditions. To illustrate this, we’ll utilize a publicly available mouse lung single-cell RNA-seq dataset from the scRNAseq Bioconductor package by Zilionis et al. [1]. This dataset includes tumor-infiltrating myeloid cells from both tumor and healthy conditions.

sce <- fetchDataset('zilionis-lung-2019','2023-12-20', path="mouse")
sce
## class: SingleCellExperiment 
## dim: 28205 17549 
## metadata(0):
## assays(1): counts
## rownames(28205): 0610007P14Rik 0610009B22Rik ... mt-Nd5 mt-Nd6
## rowData names(0):
## colnames: NULL
## colData names(12): Library Barcode ... Major cell type Minor subset
## reducedDimNames(1): SPRING
## mainExpName: NULL
## altExpNames(0):

Let us see the number of cells per condition and and the available cell types in the scRNA-seq dataset.

table(sce$`Major cell type`,sce$`Tissue`)
##              
##               healthy tumor
##   B cells         939  1874
##   Basophils        11    23
##   MoMacDC         629  1768
##   NK cells        230   400
##   Neutrophils    4429  3593
##   T cells         490  1491
##   pDC              10    52

The expression matrix should be formatted with the genes/features as rows and cells as columns. The two input matrices for splineDV must be stored as two separate count expression matrices or SingleCellExperiment objects. We can isolate the two experimental conditions. To focus solely on changes within a specific cell type and eliminate potential variability arising from shifts in cell type distribution, we select only Neutrophils for further analysis.

# Extract Healthy data
healthyCount <- sce[, sce$Tissue == "healthy"]

# Extract Tumor data
tumorCount <- sce[, sce$Tissue == "tumor"]

# Select a specific cell type - Neutrophils
healthyCount <- healthyCount[, which(healthyCount$`Major cell type` == "Neutrophils")]
print(healthyCount)
## class: SingleCellExperiment 
## dim: 28205 4429 
## metadata(0):
## assays(1): counts
## rownames(28205): 0610007P14Rik 0610009B22Rik ... mt-Nd5 mt-Nd6
## rowData names(0):
## colnames: NULL
## colData names(12): Library Barcode ... Major cell type Minor subset
## reducedDimNames(1): SPRING
## mainExpName: NULL
## altExpNames(0):
tumorCount <- tumorCount[, which(tumorCount$`Major cell type` == "Neutrophils")]
print(tumorCount)
## class: SingleCellExperiment 
## dim: 28205 3593 
## metadata(0):
## assays(1): counts
## rownames(28205): 0610007P14Rik 0610009B22Rik ... mt-Nd5 mt-Nd6
## rowData names(0):
## colnames: NULL
## colData names(12): Library Barcode ... Major cell type Minor subset
## reducedDimNames(1): SPRING
## mainExpName: NULL
## altExpNames(0):

Given our isolation of a single cell type, any observed changes in gene expression variability should be attributed to experimental factors. To mitigate the potential impact of background distribution shifts caused by varying cell type and state proportions, we recommend conducting Spline-DV analysis in a cell state/type-specific manner.

4.2 Running Spline-DV

The main function of SplineDV ios the splineDV function. For the analysis, the test data (X) is always use in contrast with the control data (Y). We use smaller QC parameters for the small example data sets to preserve enough cells and genes. We recommend using the default QC parameters for large data sets.

dvRes <- splineDV(X = tumorCount, Y = healthyCount)
## Computing distances for Data 1
## Performing normalization...
## Computing distances...
## HVG computed
## Computing distances for Data 2
## Performing normalization...
## Computing distances...
## HVG computed
head(dvRes)
## DataFrame with 6 rows and 19 columns
##         genes       mu1       mu2       CV1       CV2     drop1     drop2
##   <character> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## 1      S100a6  2.423091 3.3651806  1.069873  0.623584 0.4375388 0.0482293
## 2      S100a9  3.388193 4.1249288  0.734337  0.490492 0.1233686 0.0139795
## 3        Fth1  3.245193 2.5737118  0.677005  0.534658 0.0264139 0.0272600
## 4        Spp1  0.696282 0.0304492  2.006586  2.309741 0.8713487 0.9804287
## 5      S100a8  3.340410 3.9502114  0.700011  0.491530 0.0873213 0.0118826
## 6      Retnlg  1.489806 2.8408711  1.346040  0.800264 0.7268490 0.1817335
##       dist1     dist2 X_splinex X_spliney X_splinez Y_splinex Y_spliney
##   <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## 1  0.450291  1.175883  2.088791  0.784693  0.231588 2.1523963  0.625115
## 2  1.213435  1.940386  2.178329  0.768020  0.210077 2.1916912  0.616329
## 3  1.086377  0.402181  2.171898  0.769218  0.211622 2.0816899  0.640936
## 4  0.613273  0.008306  0.679678  1.128363  0.634591 0.0304185  2.200492
## 5  1.170524  1.766433  2.175922  0.768469  0.210656 2.1841718  0.618010
## 6  0.538217  0.677298  1.614046  0.874726  0.347494 2.1047751  0.635768
##   Y_splinez vectorDist Direction        Pval         FDR
##   <numeric>  <numeric> <numeric>   <numeric>   <numeric>
## 1  0.133051   1.020089        -1 1.24843e-97 6.69908e-94
## 2  0.122861   0.729556        -1 3.77682e-49 1.01332e-45
## 3  0.151399   0.690545         1 6.40575e-44 1.14578e-40
## 4  0.976790   0.605396         1 1.47841e-33 1.98329e-30
## 5  0.124811   0.599252        -1 7.27451e-33 7.80700e-30
## 6  0.145405   0.586091        -1 2.08436e-31 1.86411e-28

The output is a DataFrame containing statistical measures for each gene in the differential variability (DV) analysis. Genes with large vectorDist values exhibit substantial changes in expression variability between the two experimental conditions. Genes with FDR values below 0.05 are considered significantly differentially variable. The Direction column indicates whether the variability increased or decreased. By leveraging this information, we can identify biologically relevant genes that not only demonstrate shifts in mean expression but also alterations in expression variability across conditions.

4.3 Visualize Gene Expression statistics

To visualize the differential variability of genes, we can employ a 3D scatter plot. The DVPlot function allows us to visualize the computed spline for each condition, represented by a distinct color, along with the distance vectors of genes from their respective spline. This plot provides a clear representation of the shift in expression variability between the two conditions.

fig <- DVPlot(dvRes, targetgene='Spp1')
fig 

5 Highly Variable Genes (HVGs) using Spline-HVG

Next, we’ll introduce the Spline HVG algorithm, a feature selection technique designed to identify Highly Variable Genes (HVGs) in scRNA-seq data. This algorithm computes the distance from the spline to identify genes with significant variability. To demonstrate this, we’ll utilize the same sample dataset of healthy neutrophils from mice lungs used in the previous section.

5.1 Input - scRNA-seq Expression matrix

# Healthy neutrophils data
print(healthyCount)
## class: SingleCellExperiment 
## dim: 28205 4429 
## metadata(0):
## assays(1): counts
## rownames(28205): 0610007P14Rik 0610009B22Rik ... mt-Nd5 mt-Nd6
## rowData names(0):
## colnames: NULL
## colData names(12): Library Barcode ... Major cell type Minor subset
## reducedDimNames(1): SPRING
## mainExpName: NULL
## altExpNames(0):

5.2 Running Spline-HVG

The expression matrix should be formatted with the genes/features as rows and cells as columns. The input matrix for splineHVG must be stored as a count expression matrices or SingleCellExperiment objects. To focus solely on changes within a specific cell type and eliminate potential variability arising from shifts in cell type distribution, we select only Neutrophils for further analysis. Smaller QC parameters can be used for the small data sets (e.g., cells < 500 and genes < 5000) to preserve enough cells and genes for splineHVG analysis. However, we recommend using the default QC parameters for large data sets, if not more stringent.

HVGRes <- splineHVG(healthyCount, nHVGs = 200)
## QC done
## Performing normalization...
## Computing distances...
## HVG computed
head(HVGRes)
## DataFrame with 6 rows and 14 columns
##            Means        CV   Dropout   logMean     logCV  Distance   nearidx
##        <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## S100a9   60.8722  0.633138 0.0139795   4.12507  0.490503  1.932288     16245
## S100a8   50.9572  0.634967 0.0118826   3.95042  0.491622  1.758383     16245
## S100a6   27.9395  0.865402 0.0482293   3.36521  0.623477  1.167742     16245
## Srgn     25.2401  0.589972 0.0263281   3.26729  0.463716  1.082339     16245
## Fos      19.0464  0.587785 0.0163094   2.99805  0.462340  0.819475     16245
## Il1b     18.3003  0.616345 0.0214352   2.96012  0.480168  0.778621     16245
##            dvecx       dvecy      dvecz   splinex   spliney   splinez       HVG
##        <numeric>   <numeric>  <numeric> <numeric> <numeric> <numeric> <logical>
## S100a9  1.925290 -0.12444880 -0.1072863   2.19978  0.614952  0.121266     FALSE
## S100a8  1.750639 -0.12332955 -0.1093833   2.19217  0.616646  0.123230     FALSE
## S100a6  1.165425  0.00852479 -0.0730366   2.16000  0.623812  0.131539     FALSE
## Srgn    1.067508 -0.15123556 -0.0949378   2.15177  0.625644  0.133663     FALSE
## Fos     0.798269 -0.15261184 -0.1049564   2.13792  0.628731  0.137243     FALSE
## Il1b    0.760340 -0.13478441 -0.0998306   2.13182  0.630090  0.138818     FALSE

The output is a DataFrame containing statistical measures for each gene in the Spline HVG analysis. Genes with large Distance values exhibit significant deviation from expected behavior, as represented by the 3D spline. The top n HVGs can be identified by sorting the DataFrame by Distance in descending order. These highly variable genes, which are biologically relevant, can be utilized in various downstream analyses, such as for dimensional reduction and PCA.

# Extracting HVG Gene list
HVGList <- rownames(HVGRes)[HVGRes$HVG == TRUE]
head(HVGList)
## [1] "Retnlg" "Lyz2"   "Thbs1"  "Slfn4"  "Rsad2"  "Svil"

5.3 Visualize Highly Variable Genes

To visualize the highly variable genes, we can employ a 3D scatter plot. The HVGPlot function allows us to visualize the computed spline, which represents the expected gene behavior. Each dot on the plot represents a single gene. This plot provides a clear representation of HVGs (highlighted in red), which deviate significantly from the 3D spline, indicating a substantial shift in their gene expression.

fig <- HVGPlot(HVGRes)
fig

6 Conclusion

SplineDV is a valuable tool for single-cell RNA sequencing (scRNA-seq) analysis, specifically designed to identify differential gene expression variability between conditions. By examining changes in expression dispersion, SplineDV complements traditional differential expression analysis methods. We recommend using SplineDV in conjunction with other tools from Seurat and Bioconductor to gain a more comprehensive understanding of biological processes.

7 Appendix

7.1 Citing SplineDV

citation("SplineDV")
## To cite package 'SplineDV' in publications use:
## 
##   Gatlin V, Gupta S, Romero S, Chapkin R, Cai J (2024). "Beyond
##   Differential Expression: Embracing Cell-to-Cell Variability in
##   Single-Cell Gene Expression Data Analysis." _bioRxiv_.
##   doi:10.1101/2024.08.08.607086
##   <https://doi.org/10.1101/2024.08.08.607086>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Article{,
##     title = {Beyond Differential Expression: Embracing Cell-to-Cell Variability in Single-Cell Gene Expression Data Analysis},
##     author = {Victoria Gatlin and Shreyan Gupta and Selim Romero and Robert Chapkin and James Cai},
##     journal = {bioRxiv},
##     year = {2024},
##     doi = {https://doi.org/10.1101/2024.08.08.607086},
##   }

7.2 References

  1. Zilionis, R., Engblom, C., Pfirschke, C., Savova, V., Zemmour, D., Saatcioglu, H. D., Krishnan, I., Maroni, G., Meyerovitz, C. V., Kerwin, C. M., Choi, S., Richards, W. G., De Rienzo, A., Tenen, D. G., Bueno, R., Levantini, E., Pittet, M. J., & Klein, A. M. (2019). Single-Cell Transcriptomics of Human and Mouse Lung Cancers Reveals Conserved Myeloid Populations across Individuals and Species. Immunity, 50(5), 1317–1334.e10. https://doi.org/10.1016/j.immuni.2019.03.009

8 sessionInfo

This is the output of sessionInfo() on the system on which this document was compiled:

date()
## [1] "Wed Nov 20 19:10:54 2024"
sessionInfo()
## R Under development (unstable) (2024-10-21 r87258)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] SplineDV_0.99.11            scRNAseq_2.21.0            
##  [3] SingleCellExperiment_1.29.1 SummarizedExperiment_1.37.0
##  [5] Biobase_2.67.0              GenomicRanges_1.59.1       
##  [7] GenomeInfoDb_1.43.1         IRanges_2.41.1             
##  [9] S4Vectors_0.45.2            BiocGenerics_0.53.3        
## [11] generics_0.1.3              MatrixGenerics_1.19.0      
## [13] matrixStats_1.4.1           BiocStyle_2.35.0           
## 
## loaded via a namespace (and not attached):
##   [1] jsonlite_1.8.9            magrittr_2.0.3           
##   [3] GenomicFeatures_1.59.1    gypsum_1.3.0             
##   [5] farver_2.1.2              rmarkdown_2.29           
##   [7] BiocIO_1.17.0             zlibbioc_1.53.0          
##   [9] vctrs_0.6.5               memoise_2.0.1            
##  [11] Rsamtools_2.23.0          DelayedMatrixStats_1.29.0
##  [13] RCurl_1.98-1.16           htmltools_0.5.8.1        
##  [15] S4Arrays_1.7.1            AnnotationHub_3.15.0     
##  [17] curl_6.0.1                Rhdf5lib_1.29.0          
##  [19] SparseArray_1.7.2         rhdf5_2.51.0             
##  [21] sass_0.4.9                alabaster.base_1.7.2     
##  [23] bslib_0.8.0               htmlwidgets_1.6.4        
##  [25] alabaster.sce_1.7.0       httr2_1.0.6              
##  [27] plotly_4.10.4             cachem_1.1.0             
##  [29] GenomicAlignments_1.43.0  lifecycle_1.0.4          
##  [31] pkgconfig_2.0.3           Matrix_1.7-1             
##  [33] R6_2.5.1                  fastmap_1.2.0            
##  [35] GenomeInfoDbData_1.2.13   digest_0.6.37            
##  [37] colorspace_2.1-1          AnnotationDbi_1.69.0     
##  [39] ExperimentHub_2.15.0      crosstalk_1.2.1          
##  [41] RSQLite_2.3.8             beachmat_2.23.1          
##  [43] filelock_1.0.3            fansi_1.0.6              
##  [45] httr_1.4.7                abind_1.4-8              
##  [47] compiler_4.5.0            bit64_4.5.2              
##  [49] withr_3.0.2               BiocParallel_1.41.0      
##  [51] DBI_1.2.3                 HDF5Array_1.35.1         
##  [53] alabaster.ranges_1.7.0    alabaster.schemas_1.7.0  
##  [55] rappdirs_0.3.3            DelayedArray_0.33.2      
##  [57] rjson_0.2.23              tools_4.5.0              
##  [59] glue_1.8.0                restfulr_0.0.15          
##  [61] rhdf5filters_1.19.0       grid_4.5.0               
##  [63] gtable_0.3.6              tidyr_1.3.1              
##  [65] ensembldb_2.31.0          data.table_1.16.2        
##  [67] utf8_1.2.4                XVector_0.47.0           
##  [69] BiocVersion_3.21.1        pillar_1.9.0             
##  [71] dplyr_1.1.4               BiocFileCache_2.15.0     
##  [73] lattice_0.22-6            rtracklayer_1.67.0       
##  [75] bit_4.5.0                 tidyselect_1.2.1         
##  [77] Biostrings_2.75.1         scuttle_1.17.0           
##  [79] knitr_1.49                bookdown_0.41            
##  [81] ProtGenerics_1.39.0       xfun_0.49                
##  [83] UCSC.utils_1.3.0          lazyeval_0.2.2           
##  [85] yaml_2.3.10               evaluate_1.0.1           
##  [87] codetools_0.2-20          tibble_3.2.1             
##  [89] alabaster.matrix_1.7.2    BiocManager_1.30.25      
##  [91] cli_3.6.3                 munsell_0.5.1            
##  [93] jquerylib_0.1.4           Rcpp_1.0.13-1            
##  [95] dbplyr_2.5.0              png_0.1-8                
##  [97] XML_3.99-0.17             parallel_4.5.0           
##  [99] ggplot2_3.5.1             blob_1.2.4               
## [101] AnnotationFilter_1.31.0   sparseMatrixStats_1.19.0 
## [103] bitops_1.0-9              alabaster.se_1.7.0       
## [105] viridisLite_0.4.2         scales_1.3.0             
## [107] purrr_1.0.2               crayon_1.5.3             
## [109] rlang_1.1.4               KEGGREST_1.47.0