Main usage example
Let us consider some real genomic data. We’re going to use FactoMineR package data. As they are no longer available online we added them to this package This data consists of two types of variables. First group are gene expression data. The second is RNA data. Please note that it may take few minutes to run the following code:
comp_file_name <- system.file("extdata", "gene.csv", package = "varclust")
comp <- read.table(comp_file_name, sep=";", header=T, row.names=1)
benchmarkClustering <- c(rep(1, 68), rep(2, 356))
comp <- as.matrix(comp[,-ncol(comp)])
set.seed(2)
mlcc.fit <- mlcc.bic(comp, numb.clusters = 1:10, numb.runs = 10, max.dim = 8, greedy = TRUE,
estimate.dimensions = TRUE, numb.cores = 1, verbose = FALSE)
print(mlcc.fit)
## $nClusters: 2
## $segmentation:
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2
## [36] 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 1 1
## [71] 1 1 1 1 1 1 2 2 1 1 1 1 1 2 1 1 2 1 1 2 1 1 1 2 1 2 1 2 2 2 1 1 1 1 1
## [106] 2 2 2 1 2 1 1 2 1 1 1 1 1 1 1 2 1 1 1 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 2
## [141] 2 2 1 1 2 1 2 1 1 1 2 1 1 1 1 1 2 2 1 1 1 1 1 2 2 2 1 1 2 1 1 1 2 1 2
## [176] 1 1 2 1 2 1 1 2 1 2 1 2 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 2 1 2 1 1 1 1
## [211] 1 1 1 1 1 1 1 1 1 2 1 2 2 1 1 1 1 2 1 1 1 1 2 1 1 1 2 1 1 2 1 1 1 1 2
## [246] 1 1 2 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 2 2 1 1 2 1 1 1
## [281] 2 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1
## [316] 2 1 1 1 1 1 2 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1 2 1 2 1
## [351] 1 1 1 2 2 1 1 1 2 2 1 1 2 2 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 2 1 1 1 1 1
## [386] 1 1 1 1 2 2 1 2 1 1 1 1 2 1 1 2 1 1 1 1 1 1 2 1 2 1 2 2 1 2 1 1 1 1 1
## [421] 2 1 1 2
## $BIC: -20488.45
## $subspacesDimensions:
## 8 2
## [1] 0.251669
## [1] 0.1603774
## [1] 0.8284038 0.6886792
Please note that although we use benchmarkClustering as a reference, it is not an oracle. Some variables from expression data can be highly correlated and act together with RNA data.
More details about the method
The algorithm aims to reduce dimensionality of data by clustering variables. It is assumed that variables lie in few low-rank subspaces. Our iterative algorithm recovers their partition as well as estimates number of clusters and dimensions of subspaces. This kind of problem is called Subspace Clustering. For a reference comparing multiple approaches see here.