varclust package tutorial

Tutorial for varclust package

Introduction

varclust is a package that enables dimension reduction via variables clustering. We assume that each group of variables can be summarized with few latent variables.

It also provides a function to determine number of principal components in PCA.

This tutorial will gently introduce you to usage of package varclust and familiarize with its options.

You can install varclust from github (current development version).

install_github("psobczyk/varclust")

or from CRAN

install.package("varclust")

Main usage example

library(varclust)
library(mclust)

Let us consider some real genomic data. We’re going to use FactoMineR package data. As they are no longer available online we added them to this package This data consists of two types of variables. First group are gene expression data. The second is RNA data. Please note that it may take few minutes to run the following code:

comp_file_name <- system.file("extdata", "gene.csv", package = "varclust")
comp <- read.table(comp_file_name, sep=";", header=T, row.names=1) 
benchmarkClustering <- c(rep(1, 68), rep(2, 356))    
comp <- as.matrix(comp[,-ncol(comp)])
set.seed(2)
mlcc.fit <- mlcc.bic(comp, numb.clusters = 1:10, numb.runs = 10, max.dim = 8, greedy = TRUE, 
                     estimate.dimensions = TRUE, numb.cores = 1, verbose = FALSE)
print(mlcc.fit)

## $nClusters:  2 
## $segmentation:
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2
##  [36] 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 1 1
##  [71] 1 1 1 1 1 1 2 2 1 1 1 1 1 2 1 1 2 1 1 2 1 1 1 2 1 2 1 2 2 2 1 1 1 1 1
## [106] 2 2 2 1 2 1 1 2 1 1 1 1 1 1 1 2 1 1 1 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 2
## [141] 2 2 1 1 2 1 2 1 1 1 2 1 1 1 1 1 2 2 1 1 1 1 1 2 2 2 1 1 2 1 1 1 2 1 2
## [176] 1 1 2 1 2 1 1 2 1 2 1 2 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 2 1 2 1 1 1 1
## [211] 1 1 1 1 1 1 1 1 1 2 1 2 2 1 1 1 1 2 1 1 1 1 2 1 1 1 2 1 1 2 1 1 1 1 2
## [246] 1 1 2 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 2 2 1 1 2 1 1 1
## [281] 2 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1
## [316] 2 1 1 1 1 1 2 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1 2 1 2 1
## [351] 1 1 1 2 2 1 1 1 2 2 1 1 2 2 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 2 1 1 1 1 1
## [386] 1 1 1 1 2 2 1 2 1 1 1 1 2 1 1 2 1 1 1 1 1 1 2 1 2 1 2 2 1 2 1 1 1 1 1
## [421] 2 1 1 2
## $BIC:  -20488.45 
## $subspacesDimensions:
##  8 2

plot(mlcc.fit)

mclust::adjustedRandIndex(mlcc.fit$segmentation, benchmarkClustering)

## [1] 0.251669

misclassification(mlcc.fit$segmentation, benchmarkClustering, max(table(benchmarkClustering)), 2)

## [1] 0.1603774

integration(mlcc.fit$segmentation, benchmarkClustering)

## [1] 0.8284038 0.6886792

Please note that although we use benchmarkClustering as a reference, it is not an oracle. Some variables from expression data can be highly correlated and act together with RNA data.

More details about the method

The algorithm aims to reduce dimensionality of data by clustering variables. It is assumed that variables lie in few low-rank subspaces. Our iterative algorithm recovers their partition as well as estimates number of clusters and dimensions of subspaces. This kind of problem is called Subspace Clustering. For a reference comparing multiple approaches see here.

Running algorithm with some initial segmentation

You should also use mlcc.reps function if you have some apriori knowledge regarding true segmentation. You can enforce starting point

mlcc.fit3 <- mlcc.reps(comp, numb.clusters = 2, numb.runs = 0, max.dim = 8, 
                       initial.segmentations = list(benchmarkClustering), numb.cores = 1)
print(mlcc.fit3)

## $segmentation:
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2
##  [71] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [106] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [141] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [176] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [211] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [246] 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [281] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2
## [316] 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [351] 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [386] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [421] 2 2 2 2
## $BIC:  -20342.01 
## $basis:
## List of 2
##  $ : num [1:43, 1:4] 6.054 -2.101 -0.993 -3.061 1.938 ...
##  $ : num [1:43, 1:8] -4.39 -12.18 -1.79 -13.93 -8.31 ...

mclust::adjustedRandIndex(mlcc.fit3$segmentation, benchmarkClustering)

## [1] 0.9413814

misclassification(mlcc.fit3$segmentation, benchmarkClustering, max(table(benchmarkClustering)), 2)

## [1] 0.01179245

integration(mlcc.fit3$segmentation, benchmarkClustering)

## [1] 0.9870291 0.9704146

Execution time

Execution time of mlcc.bic depends mainly on:

Number of clusters (numb.clusters)
Number of variables
Number of runs of k-means algorithm (numb.runs)

For a dataset of 1000 variables and 10 clusters computation takes about 8 minutes on Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz.

Choosing values of parameters

If possible one should use multiple cores for computation. By default all but one cores are used. User can override this with numb.cores parameter
For more precise segmentation one should increase numb.runs. Default value is 20
Parameter max.dim should reflect how large we expect subspaces to be. Default value is 4
If parameter greedy is TRUE (value set by default) the number of clusters is estimated in a greedy way. So program stops after getting first BIC local maximum
If estimate.dimensions is TRUE subspaces dimensions are estimated. Otherwise all subspaces are assumed to be of dimension max.dim