Signature refining tutorial - scRNA-seq Workflow

Riccardo L. Rossi, Ivan Ferrari

04 July 2023

Introduction

Single-cell transcriptomics is a rapidly growing field which is essential to improve our understanding of complex tissues and organisms at an unprecedented resolution. The manual annotation of cell identity, which is the gold standard procedure to determine cell populations, remains a major limitation in most scRNA-seq analysis [1]; more recently, a growing number of resources are being published offering the scientific community many cell type specific gene signature as a reference. Alas, marker gene belonging to gene signatures could be unspecific if taken singularly, specially if that given gene is involved in all the differentiation steps in some cellular population of the dataset (from stem cells to a terminally differentiated population), making it a differentially expressed gene (DEG) in more than one cluster. In this context we decided to enlarge combiroc’s breath to the scRNA-seq analysis field as a signature refiner, in order to find the best few marker genes combinations from an already existing gene signature, regardless the genes’ ranking as differentially expressed. Combiroc is specialized in finding specific combinations, thus its use may allow researchers to find some cells of interest with a high degree of confidence and help in the process of manual annotation by reducing the number of marker genes to consider.

Here we show the complete workflow of combiroc-mediated signature refinement at single-cell resolution in which we used 3 PBMC public datasets showed in Seurat vignettes. All of them have been deeply characterized and scrupulously labelled, so they are accepted and used as reference by the scientific community.

The full story and more comments on this case study can be found in our “Less is more” bioRxiv preprint: Ferrari et al. Combiroc: when ‘less is more’ in bulk and single cell marker signatures. bioRxiv 2022.01.17.476603; doi: https://doi.org/10.1101/2022.01.17.476603

Libraries needed

# main libraries:
library(combiroc)
library(dplyr)
library(tidyr)
library(ggplot2)

and also needed for this specific scRNAseq workflow (install them first, if you didn’t already):

library(devtools) # used to install devel-level packages
library(Seurat) # used to process scRNAseq data
library(SeuratData) # used to install and load scRNAseq datasets
library(SeuratDisk) # used to read h5seurat files

A note on executing this vignette

The code described in this vignette was originally run on a jupyter-based R-kernel notebook backed by a high-performance computing (HPC) server infrastructure. Due to relevant size of some of the objects loaded and/or created, it is unlikely that the whole code can be run on a standard computing equipment (i.e. a laptop). The computationally heavy passages are therefore marked as [[Heavy-Code]]{style=“color: orange;”} and can be skipped by the user by reading/loading pre-computed objects or intermediate results either from rds or csv files. In this way the reader/reviewer will be able to check the code, to download (if needed) the pre-computed object when indicated and follow the whole workflow without having a HPC server.

Selection of NK cells marker combinations

The difficulty to distinguish NK (Natural Killer) cells from CD8+ T cells in many scRNAseq datasets is a good challenge to test combiroc performance in markers signature optimization.
The purpose of this analysis is to find the best combinations of known gene markers that best describe the NK-cell populations starting from a NK-cell single-cell transcriptomic signature, then using the models calculated on such combinations on other scRNAseq datasets.
After performing differential marker expression on an initial “training” scRNAseq dataset we will take into account the top 30 differentially expressed genes (DEGs) specific for NK cells cluster; then we will use combiroc to select the best gene combinations among these top 30 NK cells markers. Regression models from the selected combinations will then be used on other independent datasets (“test” datasets) of the same kind.

As combiroc training dataset we are using the PBMC CITE-seq atlas from Hao et al. 2021 [2]. This dataset can be downloaded as an h5 Seurat data (h5s) from the FredHutch-NYCG atlas page (exact download link: here) [[Heavy-Code]]{style=“color: orange;”}

# read the downloaded h5seurat dataset file (using SeuratDisk functions)
atlas <- LoadH5Seurat("pbmc_multimodal.h5seurat")
# activate the level-1 annotations
Idents(atlas) <- atlas@meta.data$celltype.l1
# overview annotated cell clusters with UMAP
DimPlot(atlas, label = T, repel = T)
plot of chunk unnamed-chunk-5

plot of chunk unnamed-chunk-5

To guarantee the best possible performance of the training dataset (the one we extract the markers combinations from) is important to make sure its value ranges are similar to those we wanto to use as test/validation datasets. For this reason we advise using the raw counts matrices for all the datasets, in order to be able to consistently rescale them in the same way by using the function Seurat::ScaleData().

Extracting the NK markers

The gene markers for NK cells were obtained with a standard Seurat analysis (see here); the Seurat::FindMarkers() function was used, then markers were ordered by fold change: [[Heavy-Code]]{style=“color: orange;”}

# Performing differential expression analysis
nk.de.markers <- FindMarkers(atlas, ident.1 = "NK", ident.2 = NULL)
nk.de.markers <- nk.de.markers[order(-nk.de.markers$avg_log2FC), ]
nk_genes <- rownames(nk.de.markers)[1:30]

By visualising the expression values of the top 4 markers of NK cells cluster it is evident that the majority of them is also higly expressed in other non-NK clusters (e.g. CD8 T), making them too unspecific to allow a well defined cluster annotation if considered individually.

FeaturePlot(atlas, nk_genes[1:4])
plot of chunk unnamed-chunk-8

plot of chunk unnamed-chunk-8

The user not willing (or computationally limited) to run the differential expression analysis above can read the pre-computed object containing the gene signature of the top 30 DEGs (overexpressed in NK cells vs all others):

nk.de.markers <- read.csv('inst/precomp/nk_degs')
nk_genes <- rownames(nk.de.markers)[1:30]
nk_genes
#>  [1] "GNLY"    "NKG7"    "GZMB"    "PRF1"    "FGFBP2"  "GZMA"    "KLRD1"  
#>  [8] "CST7"    "SPON2"   "KLRF1"   "CTSW"    "CD247"   "CCL5"    "CLIC3"  
#> [15] "HOPX"    "GZMH"    "IL2RB"   "KLRB1"   "TRDC"    "CD7"     "GZMM"   
#> [22] "MYOM2"   "FCGR3A"  "ARL4C"   "ABHD17A" "SYNE2"   "CMC1"    "EFHD2"  
#> [29] "ADGRG1"  "JAK1"

Converting Seurat objects in combiroc’s input format

Central to the combiroc workflow applied to single cell data is the function seurat_to_combiroc(). It takes a Seurat object as input and extracts both the selected marker expression values from the @data slot of a given assay and, optionally, the class of any specific celltype (in our case ‘NK’ or ‘Other’); then it directly assembles a dataframe compatible with combiroc workflow either for training (finding combinations and models) or testing purposes (using previously found combinations).

Preparing the training single-cell RNAseq dataset

Once we have a Seurat object obtained from a standard Seurat protocol, it needs to be converted in a combiroc-ready version in order to perform the training procedure, i.e. finding the best markers combinations and models, with combiroc’s functions. Such a train object can be obtained with the seurat_to_combiroc() function: [[Heavy-Code]]{style=“color: orange;”}

train <- seurat_to_combiroc(SeuratObject = atlas, 
                            gene_list = nk_genes, 
                            assay = 'SCT', 
                            labelled_data = T, 
                            case_class = 'NK', 
                            case_label = 'NK', 
                            control_label =  'Other')

The pre-computed train object obtained with the code above can be downloaded from here, then directly read:

train <- readRDS(file = "inst/precomp/train.rds")
head(train)
#>                    ID Class   ABHD17A    ADGRG1    ARL4C      CCL5     CD247
#> 1 L1_AAACCCAAGAAACTCA Other 0.6931472 0.0000000 0.000000 0.0000000 0.0000000
#> 2 L1_AAACCCAAGACATACA Other 0.0000000 0.0000000 1.609438 0.0000000 0.6931472
#> 3 L1_AAACCCACAACTGGTT Other 0.0000000 0.0000000 0.000000 0.0000000 0.0000000
#> 4 L1_AAACCCACACGTACTA    NK 1.3862944 0.6931472 1.609438 2.4849066 1.7917595
#> 5 L1_AAACCCACAGCATACT Other 0.0000000 0.0000000 1.386294 0.6931472 0.6931472
#> 6 L1_AAACCCACATCAGTCA Other 0.0000000 0.0000000 2.302585 2.9957323 1.6094379
#>         CD7    CLIC3     CMC1     CST7     CTSW     EFHD2   FCGR3A   FGFBP2
#> 1 0.0000000 0.000000 0.000000 0.000000 0.000000 1.0986123 0.000000 0.000000
#> 2 1.6094379 0.000000 0.000000 0.000000 0.000000 0.6931472 0.000000 0.000000
#> 3 0.6931472 0.000000 0.000000 0.000000 0.000000 0.0000000 0.000000 0.000000
#> 4 1.6094379 2.197225 1.945910 2.944439 1.945910 1.3862944 1.386294 2.079442
#> 5 0.6931472 0.000000 0.000000 0.000000 1.098612 0.6931472 0.000000 0.000000
#> 6 0.0000000 0.000000 2.302585 2.079442 0.000000 0.0000000 0.000000 0.000000
#>       GNLY     GZMA     GZMB    GZMH      GZMM     HOPX    IL2RB      JAK1
#> 1 0.000000 0.000000 0.000000 0.00000 0.0000000 0.000000 0.000000 0.6931472
#> 2 0.000000 0.000000 0.000000 0.00000 0.0000000 0.000000 0.000000 1.3862944
#> 3 0.000000 0.000000 0.000000 0.00000 0.6931472 0.000000 0.000000 0.0000000
#> 4 3.044522 1.386294 2.772589 1.94591 1.3862944 1.609438 1.791759 1.9459101
#> 5 0.000000 0.000000 0.000000 0.00000 1.0986123 0.000000 0.000000 0.6931472
#> 6 0.000000 1.098612 0.000000 0.00000 1.9459101 0.000000 0.000000 1.3862944
#>      KLRB1    KLRD1    KLRF1    MYOM2      NKG7     PRF1    SPON2     SYNE2
#> 1 0.000000 0.000000 0.000000 0.000000 0.0000000 0.000000 0.000000 0.0000000
#> 2 2.079442 0.000000 0.000000 0.000000 0.0000000 0.000000 0.000000 0.0000000
#> 3 0.000000 0.000000 0.000000 0.000000 0.6931472 0.000000 0.000000 0.6931472
#> 4 1.386294 1.609438 1.386294 1.791759 3.9120230 1.791759 2.397895 1.0986123
#> 5 0.000000 0.000000 0.000000 0.000000 0.0000000 0.000000 0.000000 0.0000000
#> 6 0.000000 0.000000 0.000000 0.000000 2.9957323 0.000000 0.000000 1.7917595
#>       TRDC
#> 1 0.000000
#> 2 0.000000
#> 3 0.000000
#> 4 1.791759
#> 5 0.000000
#> 6 0.000000

Then, as described in the main combiroc vignette, we used combiroc_long to make the data in long tidy format, fit for further processing:

train_long <- combiroc_long(train)

Finding the best marker combinations and models

Given a list of markers, combiroc assesses the performance of all combinations of such markers. The computational load of this process can be high for more than 10 markers. A list of \(n\) markers generates \({2^{n} - 1 }\), thus for \(n=30\) we have more than a billion possible combinations.
Mathematically, for combinations here we mean combinations without repetition, i.e. a subset of \(k\) distinct elements of a set with \(n\) elements, that can be calculated as the binomial coefficient \(\binom{n}{k}\).

The purpose of combiroc is always finding the best optimized combination of markers, i.e. a subset of the original full signature which, despite the much smaller number of markers, retains the discriminatory power of the original signature or it’s even better. This is particularly important in the field of diagnostics where a smaller set of marker has a bigger translational potential (see our first combiroc paper Mazzara et al. 2017 for a discussion on this).

Similarly, here we will limit the search to combinations composed by no more than 5 markers: to do so we will set the max_length = 5 attribute of the combi() function. With this limitation the number of combinations to compute drops to 174,436, which is computationally more manageable.

$$ Combinations(upTo5Markers) = \sum_{i=1}^5 \binom{n}{i} $$

#>   n_markers n_combs tot_combs
#> 1         1      30        30
#> 2         2     435       465
#> 3         3    4060      4525
#> 4         4   27405     31930
#> 5         5  142506    174436
plot of chunk unnamed-chunk-15

plot of chunk unnamed-chunk-15

The combi() function works on the train dataset computing the marker combinations and counting their corresponding positive samples in each class (once thresholds are selected).
A sample, to be considered positive for a given combination, must have a value higher than a given signal threshold (signalthr) for at least a given number of markers composing that combination (combithr).

As described in the combiroc’s vignette for the standard workflow, the argument signalthr in the combi() function should be set according to the guidelines and characteristics of the methodology used for the analysis or by an accurate inspection of the signal intensity distribution. In the event specific guidelines are missing, one should set the value signalthr as suggested by the distr$Density_plot feature.

Single cell RNA-seq datasets are exactly one of such cases: it is often neither possible nor easy to extrapolate the best signal threshold, due to the highly scattered distribution of the expression values: here we thus take advantage of the package-specific distr$Density_plot feature and will let the package find this value without the user intervention.

Combinatorial analysis

The core of a combiroc analysis are the two marker_distribution() and combi() functions: changing the value of key parameters in these functions we can modulate the stringency and severity of the analysis.

To obtain a performing and stringent signature refinement we set combithr argument to 2 in combi(): this is the minimum number of positively expressed genes (i.e. the minimum number of genes that need to reach the signal threshold) to consider the whole combination as a valid one.

The plots below show expression values (signal) distribution obtained with the markers_distributions() functions on the training dataset in long format (train_long). Differently from two well-defined classes (as seen in the standard combiroc workflow), it’s not intuitive where to put the signal threshold, but the function suggests a signal intensity value of 0.9. Genes expression values, as expected, are distributed in different ways in the two classes:

Please note: the markers_distribution() command naturally triggers a few warnings in which the user is reminded how to use the command arguments

distr <- markers_distribution(train_long,
                              signalthr_prediction = TRUE, 
                              case_class = "NK")
distr$Density_plot
plot of chunk unnamed-chunk-17

plot of chunk unnamed-chunk-17

The suggested threshold is indicated by the vertical dashed line and its value is shown in overlay. The exact value can be retrieved from the distr object by distr$Coord$threshold (see the Standard workflow for further details):

autoThreshold <- distr$Coord[distr$Coord$Youden==max(distr$Coord$Youden), 'threshold'][1]

# the exact value of optimal threshold in this case is 0.8958797, as can be seen with:
distr$Coord[distr$Coord$Youden==max(distr$Coord$Youden),]

Let’s apply now the combi() function on the training dataset train with the suggested threshold. The results is a table (the train_tab dataframe) of all the combinations with 1 (= the individual markers) to 5 genes in them.

In this dataframe are also shown sensitivity (SE) and specificity (SP) for each combination:

[[Heavy-Code]]{style=“color: orange;”} (download and use the pre-computed below)

train_tab <- combi(train, signalthr = autoThreshold, combithr = 2, max_length = 5, case_class = 'NK')

Read the precomputed object (available for download here):

train_tab <- read.table('inst/precomp/train_tab.rds')
tail(train_tab.rds)
#>                                        Markers X.Positives.Other X.Positives.NK
#> Combination 174401 MYOM2-NKG7-PRF1-SPON2-SYNE2             20313          18220
#> Combination 174402  MYOM2-NKG7-PRF1-SPON2-TRDC             12876          18413
#> Combination 174403  MYOM2-NKG7-PRF1-SYNE2-TRDC             20844          18482
#> Combination 174404 MYOM2-NKG7-SPON2-SYNE2-TRDC             19664          18206
#> Combination 174405 MYOM2-PRF1-SPON2-SYNE2-TRDC             12019          17767
#> Combination 174406  NKG7-PRF1-SPON2-SYNE2-TRDC             20866          18512
#>                          SE       SP n_markers
#> Combination 174401 97.62109 85.80503         5
#> Combination 174402 98.65517 91.00210         5
#> Combination 174403 99.02486 85.43396         5
#> Combination 174404 97.54608 86.25856         5
#> Combination 174405 95.19396 91.60098         5
#> Combination 174406 99.18560 85.41859         5

Then we rank the combinations by Youden index (SE+SP-1) and we filter them by their SE and SP values. We will select the top 4 (for demonstration purposes only, others may be also good): what we are using to rank is the Youden index and the number of genes in the combination, where combinations with fewer genes are considered better ones.

rmks <- ranked_combs(combo_table = train_tab, 
                     min_SE = 95, 
                     min_SP = 95)
head(rmks$table)
#>                                           Markers X.Positives.Other
#> Combination 169807    GNLY-IL2RB-KLRF1-SPON2-TRDC              6914
#> Combination 172173    GZMB-IL2RB-KLRF1-SPON2-TRDC              5633
#> Combination 163878  FCGR3A-GNLY-IL2RB-KLRF1-MYOM2              6401
#> Combination 137550 CLIC3-FCGR3A-IL2RB-KLRD1-KLRF1              6777
#> Combination 138800   CLIC3-GNLY-IL2RB-KLRF1-SPON2              6361
#> Combination 164702   FCGR3A-GZMB-IL2RB-KLRF1-TRDC              6351
#>                    X.Positives.NK       SE       SP n_markers    Youden
#> Combination 169807          18356 98.34976 95.16841         5 0.9351818
#> Combination 172173          18186 97.43892 96.06359         5 0.9350251
#> Combination 163878          18276 97.92113 95.52690         5 0.9344804
#> Combination 137550          18325 98.18367 95.26415         5 0.9344782
#> Combination 138800          18268 97.87827 95.55486         5 0.9343313
#> Combination 164702          18264 97.85684 95.56184         5 0.9341868

And with roc_reports() we carry out the computations for the top four combinations (we need to explicitly indicate their number IDs in the selected_combinations argument). Also, we need to indicate the case class (NK):

reports <-roc_reports(train, 
                      markers_table = train_tab, 
                      selected_combinations =  c(169807,172173,163878, 137550),       
                      case_class = 'NK',
                      single_markers= nk_genes[1:4])

Let’s explore the results generated:

reports$Plot
plot of chunk unnamed-chunk-26

plot of chunk unnamed-chunk-26

Zooming in to show the higher performance of the combinations:

rocplot <- (reports$Plot) + coord_cartesian(xlim = c(0, 0.2))
rocplot
reports$Metrics
plot of chunk unnamed-chunk-28

plot of chunk unnamed-chunk-28

#>                      AUC    SE    SP CutOff   ACC     TN    TP   FN    FP   NPV
#> GNLY               0.962 0.959 0.914  0.146 0.920 130861 17908  756 12239 0.994
#> NKG7               0.970 0.982 0.901  0.131 0.911 128972 18336  328 14128 0.997
#> GZMB               0.949 0.928 0.925  0.136 0.925 132375 17316 1348 10725 0.990
#> PRF1               0.965 0.935 0.919  0.137 0.921 131569 17452 1212 11531 0.991
#> Combination 169807 0.995 0.982 0.968  0.107 0.970 138553 18326  338  4547 0.998
#> Combination 172173 0.995 0.977 0.970  0.108 0.971 138769 18229  435  4331 0.997
#> Combination 163878 0.995 0.985 0.962  0.078 0.965 137680 18393  271  5420 0.998
#> Combination 137550 0.995 0.976 0.966  0.101 0.967 138179 18224  440  4921 0.997
#>                      PPV
#> GNLY               0.594
#> NKG7               0.565
#> GZMB               0.618
#> PRF1               0.602
#> Combination 169807 0.801
#> Combination 172173 0.808
#> Combination 163878 0.772
#> Combination 137550 0.787

The overall internal classification performances of the top 4 combinations are better than the top 4 single markers ones, and are very similar with each other.

To know which are the genes in any specific combination we can use the show_markers() function. For combinations 169807 and 172173, use:

show_markers(selected_combinations =c(169807,172713), markers_table = train_tab)

Testing the combinations on independent datasets (validation)

TEST/VALIDATION DATASETS

As independent test/validation we used 2 different datasets on which we will evaluate the expression levels of the previously selected NK related marker combinations in each cluster of the datasets:

Please note: the main reason we choose the above datasets is because they are well annotated. While the cell clusters annotation of the first (training) dataset was used in the computations by combiroc to distinguish NK cells from all other cells, the cells annotations of the other two validation datasets are taken away from the data, and they will be only used ex-post as a ground truth to check the expression of marker combinations.

Testing marker combinations on CBMC dataset

The CBMC test dataset, made of 8617 cord blood mononuclear cells (CBMCs), produced with CITE-seq was directly loaded from the SeuratData package. The dataset has been generated by Stoeckius et al. 2017 [3]. Further description can be found in the Seurat multimodal vignette or by calling the guide with ?cbmc.

We are going to invoke the cbmc dataset from the SeuratData package and preprocess it as needed:

# retrieve cbmc data from SeuratData package
if (!require("SeuratData", quietly = TRUE))
    devtools::install_github('satijalab/seurat-data')
library(SeuratData)
InstallData("cbmc.SeuratData")
data(cbmc)

# process data with standard Seurat protocol
library(Seurat)
cbmc <- NormalizeData(cbmc, verbose = F) %>% 
  FindVariableFeatures(verbose = F) %>% 
  ScaleData(verbose = F) %>% 
  RunPCA(verbose = F)
cbmc <- FindNeighbors(cbmc, dims = 1:30, verbose = F)
cbmc <- FindClusters(cbmc, resolution = 0.8, verbose = FALSE)
cbmc <- RunUMAP(cbmc, dims = 1:30, verbose = F)

# add cell annotations from CITE seq to identities
Idents(cbmc) <- cbmc@meta.data$protein_annotations
cbmc[['ID']]<- Idents(cbmc)
cbmc
# visualize UMAP clusters
DimPlot(cbmc, repel = T, label = T)
plot of chunk unnamed-chunk-33

plot of chunk unnamed-chunk-33

Please note that the “NA” gray cells cluster is from mouse cells whose presence acts as negative control in the original experiment/dataset. Also, cluster annotations are just superimposed and pulled out as ground truth of cells identity but are not further used in the combiroc procedure.

As previosuly done for the training dataset, we now obtain the first combiroc test dataset using seurat_to_combiroc() function. The test_cbmc object contains cell barcodes as sample IDs and the expression values of the 30 NK signature genes).

test_cbmc <- seurat_to_combiroc(SeuratObject = cbmc, 
                                gene_list = nk_genes, 
                                assay = 'RNA')
head(test_cbmc)

Distribution of individual markers’ expression

Let’s have a look at how the top four single NK-markers expression is distributed across the cell clusters in the CBMC dataset:

VlnPlot(cbmc, features = nk_genes[1:4], group.by = 'ID', pt.size=0, ncol = 2)
plot of chunk unnamed-chunk-36

plot of chunk unnamed-chunk-36

In this dataset, expressions of top individual NK markers is higher in the NK cluster, but also moderate in others: they are expressed in some CD8 T cells, CD34+ and CD16+ Mono cells, even if clusters have been defined and annotated relying on protein-level information (CITE seq). Moreover, GZMB alone is not sufficient to discriminate NK cells from pDCs. Only PRF1 among the top four markers is able to discriminate NK cells alone. Mouse cells (the “NA” cluster), which are kept from the original dataset as negative control, also show a bit of GNLY and NKG7 expression, which is not surprising given the mixed nature of this cluster.

Combiroc NK probability across CBMC cell clusters

We now look at the performance of the marker combination selected with combiroc using the combi-score, which is calculated here for the top Combination 172173 (made by genes GZMB, IL2RB, KLRF1, SPON2 and TRDC), which has the highest accuracy. This score expresses the probability of being NK cells.

# adding combi score for combination 172173
cs_cbmc <- combi_score(test_cbmc, Models = reports$Models, Metrics = reports$Metrics, Positive_class = 'NK', Negative_class = 'Other')
cbmc[['c172173_combi_score']]<- cs_cbmc$`Combination 172173`  

Let’s plot the combi-score for Combination 172173 across CBMC clusters:

p1 <- FeaturePlot(cbmc, features = "c172173_combi_score")
p2 <- VlnPlot(cbmc, features = "c172173_combi_score", group.by = "ID", pt.size = 0)
p1 | p2
plot of chunk unnamed-chunk-39

plot of chunk unnamed-chunk-39

The combi-score, which is the probability of belonging to NK cells, is highly specific for the clusters that are annotated with a “NK” label. pDCs cells do not show any signal and a very small number of mouse cells are colored in the FeaturePlot (left), yet without any relevance to the combi-score as seen in the violin plots (right).

Testing marker combination on PBMC3K dataset

The Peripheral blood mononuclear cells (PBMC3K) test dataset can also be directly invoked from the SeuratData library. The dataset is immediately available as an already processed Seurat object named pbmc3k.final.

# retrieve pbmc3k dataset from SeuratData package
# library(SeuratData)
InstallData("pbmc3k.SeuratData")
data("pbmc3k.final")
pbmc3k.final <- pbmc3k.final

# add cell annotations to identities
pbmc3k.final[['ID']] <- Idents(pbmc3k.final)
pbmc3k.final

The original UMAP embedding of the PBMC3K datasets:

DimPlot(pbmc3k.final, reduction = "umap", label = T, repel = T)
plot of chunk unnamed-chunk-42

plot of chunk unnamed-chunk-42

In the same way as previously done we are using now seurat_to_combiroc() to obtain the second combiroc test dataset (test_pbmc3k). Please note that this dataset does not contain TRDC expression values since this gene (which contributes to the top selected combinations) is not present in the original PBMC3K dataset.

test_pbmc3k <- seurat_to_combiroc(SeuratObject = pbmc3k.final, gene_list = nk_genes, assay = 'RNA')
head(test_pbmc3k)

Distribution of individual markers’ expression

Let’s have a look how the top four single NK-markers expression is distributed in the PBMC-3K dataset:

VlnPlot(pbmc3k.final, features = nk_genes[1:4], group.by = 'ID', pt.size=0, ncol = 2)
plot of chunk unnamed-chunk-45

plot of chunk unnamed-chunk-45

In this PBMC dataset, all top four markers have high expression values in NK cells and in CD8 T cells too, showing that individual markers mRNA level may be not sufficient to clearly identify the cells of interest.

Combiroc NK probability across PBMC3K cell clusters

We now look at the performance of the marker combination selected with combiroc using the combi-score, which is calculated here for the top Combination 137550 (made by genes CLIC3, FCGR3A, IL2RB, KLRD1 and KLRF1), the highest accuracy combination without TRDC gene (which is not expressed in the this dataset). This score also expresses the probability of being NK cells.

sub <- reports$Models
sub$`Combination 169807` <- NULL
sub$`Combination 172173` <- NULL
cs_pbmc3k <- combi_score(test_pbmc3k, Models = sub,
    Metrics = reports$Metrics, Positive_class = "NK", Negative_class = "Other")
pbmc3k.final[["c137550_combi_score"]] <- cs_pbmc3k$`Combination 137550`  # to add combi score of combination 137550

Let’s plot the combi-score for Combination 137550 across PBMC-3K clusters:

p1 <- FeaturePlot(pbmc3k.final, features = "c137550_combi_score")
p2 <- VlnPlot(pbmc3k.final, features = "c137550_combi_score", group.by = "ID", pt.size = 0)
p1 | p2
plot of chunk unnamed-chunk-48

plot of chunk unnamed-chunk-48

Here too, as seen for the previous dataset, the cell cluster labeled as “NK” is the only one displaying a very high combi-score, i.e. showing an actual probability of being NK cells. No other cell in any cluster produces noisy interference.

Combinations selected with combiroc are optimised smaller signatures

While the combi-score described here is defined as the predicted probability obtained while fitting the combinations models to the test datasets (returned by combi_score()), other metrics exist such as the gene-signature-score described in Della Chiara et al. 2021 [5]: the gene-signature-score takes into account both the expression level and co-expression of genes within each single cell.
Given a geneset, the gene-signature-score’s increase is directly proportional to the number of expressed genes in the signature and to the sum of their level of expression.

We are going now to use this gene-signature-score computed on the whole signature and computed on the combiroc-selected combinations. To do so, we reproduced the gene-signature-score computation in the custom R function signature_score.R:

source("inst/external_code/signature_score.R")

First, we compute the “gene-signature-score” for the whole 30 DEGs signature that we assume as reference. Being this a whole signature (30 genes) we expect it to be very specific. We are doing this on both test datasets previously used.

# computing whole signature gene-signature-score for CBMC
NK_sig_cbmc<-signature_score(SeuratObj = cbmc, geneset = nk_genes)
cbmc$NK_signature_score <- NK_sig_cbmc$scaled_combined_score

# computing whole signature gene-signature-score for PBMC-3K
NK_sig_pbmc3k<-signature_score(SeuratObj = pbmc3k.final, geneset = nk_genes)
pbmc3k.final$NK_signature_score <- NK_sig_pbmc3k$scaled_combined_score

Then, we compute the “gene-signature-score” limited to the combinations of five genes previously used:

# computing gene-signature-score of combination 172173 for CBMC
comb_cbmc<-signature_score(SeuratObj = cbmc, geneset = c('GZMB','IL2RB','KLRF1','SPON2','TRDC'))
cbmc[['c172173_signature_score']] <- comb_cbmc$scaled_combined_score  

# computing gene-signature-score of combination 137550 for PBMC-3K
comb_pbmc3k<-signature_score(SeuratObj = pbmc3k.final, geneset = c('CLIC3', 'FCGR3A', 'IL2RB','KLRD1','KLRF1'))
pbmc3k.final[['c137550_signature_score']] <- comb_pbmc3k$scaled_combined_score 

We are now plotting side by side the gene-signature-scores from the whole-signature (30 genes) and the combinations (5 genes).
For the CBMC dataset:

v1 <- VlnPlot(cbmc, features='NK_signature_score', group.by = 'ID', pt.size=0)
v2 <- VlnPlot(cbmc, features='c172173_signature_score', group.by = 'ID', pt.size=0)
v1|v2
plot of chunk unnamed-chunk-53

plot of chunk unnamed-chunk-53

And for the PBMC-3K dataset:

v1 <- VlnPlot(pbmc3k.final, features='NK_signature_score', group.by = 'ID', pt.size=0)
v2 <- VlnPlot(pbmc3k.final, features = "c137550_signature_score", group.by = "ID", pt.size = 0)
v1|v2
plot of chunk unnamed-chunk-55

plot of chunk unnamed-chunk-55

As expected, the gene-signature-score for both datasets correctly discriminates NK-cells from others, but while in CBMC cells the NK cluster is unequivocally selected, in PBMC-3K cells also cytotoxic CD8-T cells are picked with a still relevant score. This does not happen when the gene-signature-score is computed over the combinations selected with combiroc (plots on the right). Overall, the gene-signature-scores on combinations are highly specific and less noisy.

This demonstrates that having more genes as in the 30-genes whole signature does not add discriminatory power compared to combinations with much less genes. The 5-genes combinations selected with combiroc are de-facto optimized gene signatures with equal or even higher discriminatory power compared to the bigger gene signatures they are extracted from.

The combi-score can be used to discriminate among cell clusters

Finally we compare here the complete NK whole gene-signature-score with the combiroc’s NK combi-score to see how they perform in both test datasets.

Comparison in CBMC cells:

v1 <- VlnPlot(cbmc, features='NK_signature_score', group.by = 'ID', pt.size=0)
v2 <- VlnPlot(cbmc, features='c172173_combi_score', group.by = 'ID', pt.size=0)
v1|v2
plot of chunk unnamed-chunk-57

plot of chunk unnamed-chunk-57

Comparison in PBMC-3K cells:

v1 <- VlnPlot(pbmc3k.final, features='NK_signature_score', group.by = 'ID', pt.size=0)
v2 <- VlnPlot(pbmc3k.final, features = "c137550_combi_score", group.by = "ID", pt.size = 0)
v1|v2
plot of chunk unnamed-chunk-59

plot of chunk unnamed-chunk-59

Remarkably, despite the 6 fold reduction in the number of considered features to build the gene score, for both the datasets, the signature score confirmed that the best combination is, at worst, as accurate as the whole signature in indicating the NK cluster.
In addition, combi-score results to be even more precise and further suggests that these combinations can be used as a refined version of the 30-genes NK whole-signature.

The difference between the gene-signature-score and combi-score is not surprising since to compute the first the genes within the combinations are considered equally, while in the latter combiroc models assign a ‘weight’ to each gene, further indicating the power of combiroc combination models in suggesting the identity of a cluster with a modest amount of genes.

As a mean of control, we check the effect of a set of random combinations on the test datasets

combs_length <- c(30, 5)
random_combs <- c(0, 0)
set.seed(1492)
for (i in combs_length) {
      random_list <- sample(rownames(cbmc), i)
      dfc <- signature_score(cbmc, random_list)
      cbmc[[paste0("random_", as.character(i))]] <- dfc$scaled_combined_score
      dfp <- signature_score(pbmc3k.final, random_list)
      pbmc3k.final[[paste0("random_", as.character(i))]] <- dfp$scaled_combined_score
}

Here random combinations do not have a preferential cluster and show a generalized higher background signal noise. For CBMC and PBMC-3K datasets:

VlnPlot(cbmc, features = c('random_30', 'random_5'), group.by = 'ID', pt.size=0)
plot of chunk unnamed-chunk-62

plot of chunk unnamed-chunk-62

VlnPlot(pbmc3k.final, features = c('random_30', 'random_5'), group.by = 'ID', pt.size=0)
plot of chunk unnamed-chunk-64

plot of chunk unnamed-chunk-64

Beyond using scores of random combinations as negative control, they also show the magnitude of background noise associated to each random combination taken into consideration. In fact, random scores show quite homogeneously distributed values across clusters indicating that the cluster-preferential distributions of real combinations scores are not stochastic.
Generally speaking since the higher is the length of random signatures, the higher will be the noise hence the probability to pick genes expressed in all the clusters: this suggests that the higher is the length of a real DEG signature the higher is the probability to include less specific genes and to increase the noise.

Conclusions

This guided tutorial shows how to use the combiroc package to optimize single-cell gene signature with combinations of marker and to identify a population of interest in scRNA-seq datasets:

Identifying NK cells using the expression level of the best markers individually is not efficient since they are too unspecific for classification purposes if taken one by one.
The power of a combined signature of five markers only is comparable - and in some cases superior - to that of a whole scRNA-seq gene signature, typically in the range of tens of genes: scaling down a gene signature while retaining its discriminatory power is highly relevant in both research and diagnostic context.

Back to the top of this docment.

Go to Combiroc standard workflow tutorial

References

  1. Abdelaal et al. 2019. Genome Biology. A comparison of automatic cell identification methods for single-cell RNA sequencing data
  2. Hao et al. 2021.Cell Integrated analysis of multimodal single-cell data
  3. Stoeckius et al. 2017. Nature Methods. Simultaneous epitope and transcriptome measurement in single cells
  4. Satija et al. 2015. Nature Biotechnology. Spatial reconstruction of single-cell gene expression data.
  5. Della Chiara et al. 2021. Nature Communications. Epigenomic landscape of human colorectal cancer unveils an aberrant core of pan-cancer enhancers orchestrated by YAP/TAZ.