The standR
package is short for Spatial transcriptomics analyzes and decoding in R, it aims at providing good practice pipeline and useful functions for users to analyze Nanostring’s GeoMx DSP data, including data with GeoMX protein (~30-60 markers), CTA (cancer transcriptome atlas, ~1800 genes) and WTA (whole transcriptome atlas, ~18000 genes) panels.
The data analysis workflow consists of the following steps:
The purpose of this vignette is to demonstrate how to use standR to analyze Nanostring GeoMx DSP data.
The development version of standR
can be installed from GitHub:
We will load the standR
package.
We designed the package and the analysis workflow to be consistent with the BioConductor’s Orchestrating Spatially Resolved Transcriptomics Analysis (OSTA) framework, namely the SpatialExperiment
.
In this vignette, we will use a publicly available WTA dataset of diabetic kidney disease (DKD) as the case study.
Data we are using here is a publicly available whole transcriptome atlas (WTA) dataset of diabetic kidney disease (DKD) shared by Nanostring.
This is the background for the data:
NanoString GeoMx DSP dataset of diabetic kidney disease (DKD) vs healthy kidney tissue. Seven slides were analyzed, 4 DKD and 3 healthy. Regions of Interest (ROI) were focused two different parts of a kidney’s structure: tubules or glomeruli. Individual glomeruli were identified by a pathologist as either relatively healthy or diseased regardless if the tissue was DKD or healthy. Tubule ROIs were segmented into distal (PanCK) and proximal (neg) tubules. While both distal and proximal tubules are called tubules, they perform very different functions in the kidney.
In order to import the data into a spatial experiment object. We use the readGeoMx
function.
## ExperimentHub with 3 records
## # snapshotDate(): 2022-04-19
## # $dataprovider: Nanostring
## # $species: NA
## # $rdataclass: data.frame
## # additional mcols(): taxonomyid, genome, description,
## # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## # rdatapath, sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["EH7364"]]'
##
## title
## EH7364 | GeomxDKDdata_count
## EH7365 | GeomxDKDdata_sampleAnno
## EH7366 | GeomxDKDdata_featureAnno
countFile <- eh[["EH7364"]]
sampleAnnoFile <- eh[["EH7365"]]
featureAnnoFile <- eh[["EH7366"]]
spe <- readGeoMx(countFile, sampleAnnoFile, featureAnnoFile = featureAnnoFile, hasNegProbe = TRUE)
Next, we inspect the SpatialExperiment object. For more details, see SpatialExperiment.
We can see the dataset has 18503 genes and 231 ROIs (regions of interest).
## class: SpatialExperiment
## dim: 18503 231
## metadata(1): NegProbes
## assays(2): counts logcounts
## rownames(18503): PADI2 CYP24A1 ... FAM166A AGTPBP1
## rowData names(3): RTS_ID ProbeID Negative
## colnames(231): disease3_scan | 001 | PanCK disease3_scan | 001 | neg
## ... disease1B_scan | 023 | Geometric Segment disease1B_scan | 024 |
## Geometric Segment
## colData names(25): SlideName ScanName ... RoiReportY sample_id
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## spatialCoords names(2) : ROICoordinateX ROICoordinateY
## imgData names(0):
Here we can see two types of count matrix, raw count in the count
assay and logCPM count in the logcount
assay.
## [1] "counts" "logcounts"
Alternatively, standR
provided a function to generate a spatial experiment object from a DGEList object (edgeR
package).
## class: SpatialExperiment
## dim: 18503 231
## metadata(0):
## assays(2): counts logcounts
## rownames(18503): PADI2 CYP24A1 ... FAM166A AGTPBP1
## rowData names(3): RTS_ID ProbeID Negative
## colnames(231): disease3_scan | 001 | PanCK disease3_scan | 001 | neg
## ... disease1B_scan | 023 | Geometric Segment disease1B_scan | 024 |
## Geometric Segment
## colData names(28): group lib.size ... RoiReportY sample_id
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## spatialCoords names(0) :
## imgData names(0):
The quality control (QC) procedures are consisting of three major steps:
Inspection of the sample metadata: sample metadata can be seen as tabular-like format using the colData
function, however here we aim to visualise the relations among important sample metadata, such as what slides did the ROI come from, which are the control groups and treatment groups, what are the pre-defined tissue types etc. By doing this step, we can have a general idea of some important question such as how the experiment was designed, should we be looking out for batch effect, and what interested comparison can be established.
Gene level QC: at gene level, by default we aim at removing genes that are not expressed in more than 90% of the ROIs, and identifying any ROIs with only few genes that are expressed. This is the similar process as edgeR::filterByExpr
, as genes with consistently low counts are unlikely be identified as DE genes, and only keeping genes with sufficiently large counts in the DE analysis can increase statistical power while reduce multiple testing burden.
ROI level QC: at ROI level, we aim at identify low-quality ROIs with relatively small library size and low cell count. Problematic ROIs that are not removed could show up as separate clusters in PCA/UMAPs and further affect the comparisons in DE analysis.
To visualise sample metadata, we can use the plotSampleInfo
function. In this dataset, slides, diseases status, tissue regions, and different subtypes of the tissues are the important features that we could input into the function.
Based on the description of the data, we know that all glomerulus are classified as abnormal and healthy, and tubule are classified as neg and PanCK.
We therefore merge the region-related annotations to avoid collinearity, which can affect the process of batch correction.
Using the addPerROIQC
function, we can add key statistics to the colData
of the object. In this case, argument rm_genes
is set to TRUE, with the default settings of min_count = 5
and sample_fraction = 0.9
, meaning here we first calculated a expression threshold on logCPM scale (to account for library size differences), then genes that have expression value below the threshold in more than 90% of the ROIs will be removed. The count matrix with the removed gene are stored in the metadata
of the object, along with the calculated expression threshold.
Here we can see that 121 genes are removed.
## [1] 18382 231
## [1] "NegProbes" "lcpm_threshold" "genes_rm_rawCount"
## [4] "genes_rm_logCPM"
Using the plotGeneQC
function, we can have a look at which were the genes removed and the overall distribution of percentage of non-expressed genes in all ROIs. By default, top 9 genes are plotted here (arranging by mean expression), user can increase the number of plotted genes by changing the top_n
parameter.
Moreover, users can order the samples (using parameter ordannots
) and color/shape the dots with annotation to find out if these genes are specifically expressed in some samples (e.g. in some tissue types or in some treatment group) so that we may need to retain them.
In this case we don’t see any specific biological pattern for the samples expressing this genes (figure above).
Using the plotROIQC
function, we are able to visualise some QC statistics in the ROI level, such as library size and cell count (AOINucleiCount) (the default settinngs for this function).
In the ROI level QC, we first aim to identify (if any) ROI(s) that have relatively low library size and low cell count because they are considered as low quality samples due to insufficient sequencing depth or lack of RNA in the chosen region.
In this case, looking at the distribution plots of library size and nuclei count, we don’t see any particular spike in the low ends, rather the distributions are relatively smooth. Looking at the dot plot, library sizes are mostly positively correlate with the nuclei count, with some data have relatively low library size while the nuclei count is reasonable. We therefore can try to draw an filtering threshold at the low end of the library size, in this case 50000. By coloring the dot with their slide names, we find that the ROIs below the threshold are all from slide disease1B, suggesting the reason for this might be some technical issues of slide disease1B.
Since library size of 50000 seems to be a reasonable threshold, here we subset the spatial experiment object based on the library size in colData
.
We can use function plotRLExpr
to visualise the relative log expression (RLE) of the data to inspect the technical variation of the data by looking at the distance from the median of the RLE (the boxplot dot) to zero.
By default, we are plotting RLE of the raw count, where most of the variation are from library size differences.
Using assay = 2
to run RLE on the logCPM data, and we can see that most technical variation from library sizes are removed. And we can also sort the data using the annotation in the object by specifying ordannots
, with color or shape mapping parameter as we usually write in ggplot, so that we can have a look at what factor is heavily contributing to the technical variation.
Here we can see obvious variation from slides to slides, and small variations are also observed within each slide.
Using the drawPCA
function, we can perform principal component analysis (PCA) on the data. The PCA plot can help visualising the variation (both biological and technical) in the data and finding out which are the main factors contributing to the variations.
Here we color the PCA with slide information, and shape by regions (tissue). We can see that PC1 is mainly spread out by regions, especially glomerulus and tubule. And grouping based on slide within each tissue are observed. The subtypes in tubule are clearly separated, but different subtypes of glomerulus is still grouping together. Moreover, diseased tissues and control tissues are mixed as well (disease slides and normal slides).
plotMDS
can be used to visualise the multidimension scaling of the data.
Further more, to be consistent with the scater
package, which is commonly used in single cell data and visium data analysis, we provide the function plotDR
to visualise any kinds of dimension reduction results generated from scater::run*
, such as UMAP, TSNE and NMF, by specifying dimred
.
Here we plot the UMAP of our data. Similar variation can be observed like PCA and MDS above.
As we observed the technical variations in the data in both RLE and PCA plots. It is necessary to perform normalization in the data.
In the standR
package, we offer normalization options including TMM, RPKM, TPM, CPM, upperquartile and sizefactor. Among them, RPKM and TPM required gene length information (add genelength
column to the rowData
of the object). For TMM, upperquartile and sizefactor, their normalized factor will be stored their metadata
.
Here we used TMM to normalize the data.
colData(spe)$biology <- paste0(colData(spe)$disease_status, "_", colData(spe)$regions)
spe_tmm <- geomxNorm(spe, method = "TMM")
Then we use RLE and PCA plot to assess the normalization count.
The RLE plot show most of the median of RLE are close to zero, indicating that lots of technical variation are removed.
However, batch effect from the different slides are still observed, confounding the separation between disease and normal.
In the Nanostring’s GeoMX DSP protocol, due to the fact that one slide is only big enough for a handful of tissue segments (ROIs), it is common that we see the DSP data being confounded by the batch effect introduced by different slides. In order to establish fair comparison between ROIs later on, it is necessary to remove this batch effect from the data.
In the standR
package, we provide two approaches of removing batch effects, including RUV4 and Limma.
To run RUV4 batch correction, we need to provide a list of “negative control genes (NCGs)”.
The function findNCGs
allows identifying the NCGs from the data. In this case, since the batch effect is mostly introduced by slide, we therefore want to identify NCGs across all slides, so here we set the batch_name
to “SlideName”, and select the top 500 least variable genes across different slides as NCGs.
## [1] "NegProbes" "lcpm_threshold" "genes_rm_rawCount"
## [4] "genes_rm_logCPM" "NCGs"
Now we can run RUV4 on the data using function geomxBatchCorrection
. By default this function will use RUV4
to normalize the data. For RUV4 correction, the function is requiring 3 parameters other than the input object, including factors
: the factor of interest, i.e. the biological variation we plan to keep; NCGs
: the list of negative control genes detected using the function findNCGs
; and k
: is the number of unwanted factors to use, in the RUV documentation, it is suggest that we should use the smallest k once we don’t observe technical variation in the data.
Here we can find the best k by using function findBestK
to examine the silhouette score of ruv4-normalized count data clustering according to batch from k = 1 to 10, where the score is closing to zero means we are more likely to successfully remove the batch effect from the data. In this case, k = 5 is the satisfied k.
We therefore will choose k=5 for this dataset.
We can then inspect the PCA of the corrected data with annotations, to inspect the removal of batch effects, and the retaining of the biological factors.
Another option is set the parameter method
to “Limma”, which uses the remove batch correction method from limma
. In this mode, the function is requiring 2 parameters, including batch
: a vector that indicating batches for all samples; and design
: a design matrix which is generated by model.matrix
, in the design matrix, all biologically-relevant factors should be included.
spe_lrb <- geomxBatchCorrection(spe,
batch = colData(spe)$SlideName, method = "Limma",
design = model.matrix(~ 0 + disease_status + regions,
data = colData(spe)))
## Coefficients not estimable: batch6
Again, using PCA to inspect the batch correction process.
Besides looking at the PCA plots of normalized count, another way to evaluate the batch correction is by summarising statistics such as adjusted rand index, jaccard similarity coefficient, silhouette coefficient and etc. We can do this by using the plotClusterEvalStats
function. In the biology section, higher the score is better while in the batch section, lower is better.
We can see that when it comes to stratifying for biology factors (disease status and tissue regions) or measuring batch clustering for this dataset, RUV4 outperform Limma in most statistics.
Moreover, we can also have a look at the RLE plots of the normalized count to determine which batch correction performs better for this dataset.
We can clearly see that RUV4-corrected count have a overall more-closer-to-zero median RLE compared to the limma-corrected data. Therefore, in the downstream differential expression analysis, we would suggest using the RUV4 as the batch correction method for this specific dataset.
## R version 4.2.0 RC (2022-04-19 r82224)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] ggalluvial_0.12.3 ggplot2_3.3.5
## [3] ExperimentHub_2.4.0 AnnotationHub_3.4.0
## [5] BiocFileCache_2.4.0 dbplyr_2.1.1
## [7] limma_3.52.0 SpatialExperiment_1.6.0
## [9] SingleCellExperiment_1.18.0 SummarizedExperiment_1.26.0
## [11] Biobase_2.56.0 GenomicRanges_1.48.0
## [13] GenomeInfoDb_1.32.0 IRanges_2.30.0
## [15] S4Vectors_0.34.0 BiocGenerics_0.42.0
## [17] MatrixGenerics_1.8.0 matrixStats_0.62.0
## [19] standR_1.0.0
##
## loaded via a namespace (and not attached):
## [1] backports_1.4.1 splines_4.2.0
## [3] BiocParallel_1.30.0 scater_1.24.0
## [5] digest_0.6.29 htmltools_0.5.2
## [7] viridis_0.6.2 magick_2.7.3
## [9] fansi_1.0.3 magrittr_2.0.3
## [11] memoise_2.0.1 ScaledMatrix_1.4.0
## [13] cluster_2.1.3 tzdb_0.3.0
## [15] Biostrings_2.64.0 readr_2.1.2
## [17] R.utils_2.11.0 vroom_1.5.7
## [19] colorspace_2.0-3 blob_1.2.3
## [21] rappdirs_0.3.3 ggrepel_0.9.1
## [23] rbibutils_2.2.8 xfun_0.30
## [25] dplyr_1.0.8 crayon_1.5.1
## [27] RCurl_1.98-1.6 jsonlite_1.8.0
## [29] glue_1.6.2 ruv_0.9.7.1
## [31] gtable_0.3.0 zlibbioc_1.42.0
## [33] XVector_0.36.0 DelayedArray_0.22.0
## [35] car_3.0-12 BiocSingular_1.12.0
## [37] DropletUtils_1.16.0 Rhdf5lib_1.18.0
## [39] HDF5Array_1.24.0 abind_1.4-5
## [41] scales_1.2.0 DBI_1.1.2
## [43] edgeR_3.38.0 rstatix_0.7.0
## [45] Rcpp_1.0.8.3 viridisLite_0.4.0
## [47] xtable_1.8-4 dqrng_0.3.0
## [49] bit_4.0.4 rsvd_1.0.5
## [51] httr_1.4.2 FNN_1.1.3
## [53] ellipsis_0.3.2 pkgconfig_2.0.3
## [55] R.methodsS3_1.8.1 farver_2.1.0
## [57] scuttle_1.6.0 sass_0.4.1
## [59] uwot_0.1.11 locfit_1.5-9.5
## [61] utf8_1.2.2 tidyselect_1.1.2
## [63] labeling_0.4.2 rlang_1.0.2
## [65] later_1.3.0 AnnotationDbi_1.58.0
## [67] munsell_0.5.0 BiocVersion_3.15.2
## [69] tools_4.2.0 cachem_1.0.6
## [71] cli_3.3.0 generics_0.1.2
## [73] RSQLite_2.2.12 broom_0.8.0
## [75] evaluate_0.15 stringr_1.4.0
## [77] fastmap_1.1.0 mclustcomp_0.3.3
## [79] yaml_2.3.5 knitr_1.38
## [81] bit64_4.0.5 purrr_0.3.4
## [83] KEGGREST_1.36.0 nlme_3.1-157
## [85] sparseMatrixStats_1.8.0 mime_0.12
## [87] R.oo_1.24.0 compiler_4.2.0
## [89] beeswarm_0.4.0 filelock_1.0.2
## [91] curl_4.3.2 png_0.1-7
## [93] interactiveDisplayBase_1.34.0 ggsignif_0.6.3
## [95] tibble_3.1.6 bslib_0.3.1
## [97] stringi_1.7.6 highr_0.9
## [99] RSpectra_0.16-1 lattice_0.20-45
## [101] Matrix_1.4-1 vctrs_0.4.1
## [103] pillar_1.7.0 lifecycle_1.0.1
## [105] rhdf5filters_1.8.0 BiocManager_1.30.17
## [107] Rdpack_2.3 jquerylib_0.1.4
## [109] BiocNeighbors_1.14.0 cowplot_1.1.1
## [111] bitops_1.0-7 irlba_2.3.5
## [113] httpuv_1.6.5 patchwork_1.1.1
## [115] R6_2.5.1 promises_1.2.0.1
## [117] gridExtra_2.3 vipor_0.4.5
## [119] assertthat_0.2.1 rhdf5_2.40.0
## [121] rjson_0.2.21 withr_2.5.0
## [123] GenomeInfoDbData_1.2.8 mgcv_1.8-40
## [125] parallel_4.2.0 hms_1.1.1
## [127] grid_4.2.0 beachmat_2.12.0
## [129] tidyr_1.2.0 rmarkdown_2.14
## [131] DelayedMatrixStats_1.18.0 carData_3.0-5
## [133] ggpubr_0.4.0 shiny_1.7.1
## [135] ggbeeswarm_0.6.0