Abstract
Symptomatic heterogeneity in complex diseases reveals differences in molecular states that need to be investigated. However, selecting the numerous parameters of an exploratory clustering analysis in RNA profiling studies requires deep understanding of machine learning and extensive computational experimentation. Tools that assist with such decisions without prior field knowledge are nonexistent and further gene association analyses need to be performed independently. We have developed a suite of tools to automate these processes and make robust unsupervised clustering of transcriptomic data more accessible through automated machine learning based functions. The efficiency of each tool was tested with four datasets characterised by different expression signal strengths. Our toolkit’s decisions reflected the real number of stable partitions in datasets where the subgroups are discernible. Even in datasets with less clear biological distinctions, stable subgroups with different expression profiles and clinical associations were found.
Loading the library to access the functions and the two toy datasets: gene expressions and cluster memberships.
## R Under development (unstable) (2024-10-21 r87258)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] omada_1.9.0 dplyr_1.1.4 glmnet_4.1-8 Matrix_1.7-1
## [5] clValid_0.7 cluster_2.1.6 genieclust_1.1.6 reshape_0.8.9
## [9] ggplot2_3.5.1 diceR_2.2.0 Rcpp_1.0.13 fpc_2.2-13
## [13] kernlab_0.9-33 pdfCluster_1.0-4
##
## loaded via a namespace (and not attached):
## [1] magic_1.6-1 sass_0.4.9 utf8_1.2.4
## [4] generics_0.1.3 class_7.3-22 robustbase_0.99-4-1
## [7] shape_1.4.6.1 lattice_0.22-6 digest_0.6.37
## [10] magrittr_2.0.3 evaluate_1.0.1 grid_4.5.0
## [13] iterators_1.0.14 fastmap_1.2.0 foreach_1.5.2
## [16] plyr_1.8.9 jsonlite_1.8.9 nnet_7.3-19
## [19] survival_3.7-0 mclust_6.1.1 purrr_1.0.2
## [22] fansi_1.0.6 scales_1.3.0 codetools_0.2-20
## [25] modeltools_0.2-23 jquerylib_0.1.4 abind_1.4-8
## [28] cli_3.6.3 rlang_1.1.4 splines_4.5.0
## [31] munsell_0.5.1 withr_3.0.2 cachem_1.1.0
## [34] yaml_2.3.10 geometry_0.5.0 tools_4.5.0
## [37] flexmix_2.3-19 parallel_4.5.0 colorspace_2.1-1
## [40] vctrs_0.6.5 R6_2.5.1 stats4_4.5.0
## [43] lifecycle_1.0.4 MASS_7.3-61 pkgconfig_2.0.3
## [46] bslib_0.8.0 pillar_1.9.0 gtable_0.3.6
## [49] glue_1.8.0 DEoptimR_1.1-3 xfun_0.48
## [52] tibble_3.2.1 tidyselect_1.2.1 prabclus_2.3-4
## [55] knitr_1.48 htmltools_0.5.8.1 rmarkdown_2.28
## [58] compiler_4.5.0 diptest_0.77-1
To investigate the clustering feasibility of a dataset this package
provides two simulating functions of stability assessment that simulate
a dataset of specific dimensions and calculate the dataset’s stabilities
for a range of clusters. feasibilityAnalysis()
generates an
idependent dataset for specific number of classes, samples and features
while feasibilityAnalysisDataBased()
accepts an existing
dataset extracting statistics(means and standard deviations) for a
specific number of clusters. Note that these estimations only serve as
an indication of the datasets fitness for the dowstream analysis and not
as the actual measure of quality as they do not account for the actual
signal in the data but only the relation between the number of samples,
features and clusters.
# Selecting dimensions and number of clusters
new.dataset.analysis <- feasibilityAnalysis(classes = 4, samples = 50,
features = 15)
# Basing the simulation on an existing dataset and selecting the number of clusters
existing.dataset.analysis <- feasibilityAnalysisDataBased(data = toy_genes,
classes = 3)
# Extract results of either function
average.sts.k <- get_average_stabilities_per_k(new.dataset.analysis)
maximum.st <- get_max_stability(new.dataset.analysis)
average.st <- get_average_stability(new.dataset.analysis)
generated.ds <- get_generated_dataset(new.dataset.analysis)
Using omada()
along with a gene expression dataframe and
an upper k (number of clusters to be considered) we can run the whole
analysis toolkit to automate clustering decision making and produce the
estimated optimal clusters. Removal or imputation of NA values is
required before running the any of the tools.
# Running the whole cascade of tools inputting an expression dataset
# and the upper k (number of clusters) to be investigated
omada.analysis <- omada(toy_genes, method.upper.k = 6)
# Extract results
pa.scores <- get_partition_agreement_scores(omada.analysis)
fs.scores <- get_feature_selection_scores(omada.analysis)
fs.optimal.features <-
get_feature_selection_optimal_features(omada.analysis)
fs.optimal.number.of.features <-
get_feature_selection_optimal_number_of_features(omada.analysis)
cv.scores <- get_cluster_voting_scores(omada.analysis)
cv.memberships <- get_cluster_voting_memberships(omada.analysis)
cv.metrics.votes <- get_cluster_voting_metric_votes(omada.analysis)
cv.k.votes <- get_cluster_voting_k_votes(omada.analysis)
sample.memberships <- get_sample_memberships(omada.analysis)
# Plot results
plot_partition_agreement(omada.analysis)
To select the most appropriate clustering technique for our dataset
we compare the internal partition agreement of three different
approaches, namely spectral, k-means and hierarchical clustering using
the clusteringMethodSelection()
function. We define the
upper k to be considered as well as the number of internal comparisons
per approach. Increased number of comparisons introduces more robustness
and highest run times.
# Selecting the upper k limit and number of comparisons
method.results <- clusteringMethodSelection(toy_genes, method.upper.k = 3,
number.of.comparisons = 2)
# Extract results
pa.scores <- get_partition_agreement_scores(method.results)
# Plot results
plot_partition_agreement(method.results)
This suite also provides the function to individually calculate the
partition agreement between two specific clustering approaches and
parameter sets by utilizing function partitionAgreement()
which requires the selection of the 2 algorithms, measures and number of
clusters.
To select the features that provide the most stable clusters the
function featureSelection()
requires the minimum and
maximum number of clusters(k) and the feature step that dictates the
rate of each feature set’s increase. It is advised to use the algorithm
the previous tools provide.
# Selecting minimum and maximum number of clusters and feature step
feature.selection.results <- featureSelection(toy_genes, min.k = 3, max.k = 6,
step = 3)
# Extract results
feature.selection.scores <- get_average_feature_k_stabilities(feature.selection.results)
optimal.number.of.features <- get_optimal_number_of_features
optimal.features <- get_optimal_features(feature.selection.results)
# Plot results
plot_average_stabilities(feature.selection.results)
To estimate the most appropriate number of clusters based on an
ensemble of internal metrics function clusterVoting()
accepts the minimum and maximum number of clusters to be considered as
well as the algorithm of choice (“sc” for spectral, “km” for kmeans and
“hr” for hierachical clustering). It is advised to use the feature set
and algorithm the previous tools provide.
# Selecting minimum and maximum number of clusters and algorithm to be used
cluster.voting.results <- clusterVoting(toy_genes, 4,8,"sc")
# Extract results
internal.metric.scores <- get_internal_metric_scores(cluster.voting.results)
cluster.memberships.k <- get_cluster_memberships_k(cluster.voting.results)
metric.votes.k <- get_metric_votes_k(cluster.voting.results)
vote.frequencies.k <- get_vote_frequencies_k(cluster.voting.results)
# Plot results
plot_vote_frequencies(cluster.voting.results)
Previous steps have provided every clustering parameter needed to go
through with the partitioning utilising
optimalClustering()
. This tool is using the dataset with
the most stable feature set, number of clusters(k) and appropriate
algorithm. This tool additionally runs through the possible algorithm
parameters to retain the one with the highest stability.
# Running the clustering with specific number of clusters(k) and algorithm
sample.memberships <- optimalClustering(toy_genes, 4, "spectral")
# Extract results
memberships <- get_optimal_memberships(sample.memberships)
optimal.stability <- get_optimal_stability_score(sample.memberships)
optimal.parameter <- get_optimal_parameter_used(sample.memberships)