VALIDICLUST
(VALID Inference for CLUsters Separation
Testing) is a package for ensuring valid inference after data
clustering. This problem occurs when clustering forces differences
between groups of observations to build clusters. This leads to an
inflation of the Type I error rate, especially because the data is used
twice: i) to build clusters, i.e., to form hypotheses, and ii) to
perform the inference step. The 3 main functions of the package are:
test_selective_inference()
following the work of Gao
et al. [1] on selective inference for
clustering
merge_selective_inference()
for the merging
selective test. In this method, all adjacent p-values are merged using
the harmonic mean
test_multimod()
uses the DipTest [2] to investigate the presence of a continuum between two
estimated clusters for a given variable
Install the development version from GitHub
::install_github("benhvt/VALIDICLUST") remotes
To illustrate our proposed method, we use the palmerpenguins
dataset from the palmerpenguins
package. To ensure that
there are no truly separate groups of observations, we selected only the
female penguins of the Adelie species, yielding 73 observations. On this
dataset, we apply hierarchical clustering (on Euclidean distances with
Ward linkage) to build 2 clusters. In this negative control dataset, we
know the true, i.e., the non-existing separated group of observations.
We therefore apply our 3 tests to each of the 4 numerical measurements
to test for separation of the 2 clusters, and compare the resulting
p-values to the p-values of the classical t-test.
data("penguins")
<- subset(penguins, species == "Adelie" & sex == "female")
data <- data[,3:6]
data
# Clustering function
<- function(x){
hcl2 <- scale(x, center = T, scale = T)
x <- dist(x, method = "euclidean")
distance <- hclust(distance, method = "ward.D2")
cah return(as.factor(cutree(cah, k=2)))
}
$Cluster <- hcl2(data)
data
ggpairs(data, aes(colour = Cluster, fill = Cluster)) +
scale_colour_manual(values = c("#F1786D","#53888f")) +
scale_fill_manual(values = c("#F1786D","#53888f")) +
theme_minimal()
<- pval.merge <- pval.multimod <- pval.t.test <- rep(NA, 4)
pval.inference.selective for (i in 1:4){
<- test_selective_inference(as.matrix(data[,1:4]),
pval.inference.selective[i] k1=1,
k2=2,
g=i,
cl_fun = hcl2,
cl=data$Cluster)$pval
<- merge_selective_inference(as.matrix(data[,1:4]),
pval.merge[i] k1=1,
k2=2,
g=i,
cl_fun = hcl2,
cl=data$Cluster)$pval
<- test_multimod(as.matrix(data[,1:4]),
pval.multimod[i] g=i,
k1=1,
k2=2,
cl = data$Cluster)$pval
<- t.test(data[data$Cluster == 1,i], data[data$Cluster==2, i])$p.value
pval.t.test[i]
}
<- rbind(pval.inference.selective,
pval.res
pval.merge,
pval.multimod,
pval.t.test)
colnames(pval.res) <- gsub("_", x=colnames(data)[1:4], replacement = " ")
rownames(pval.res) <- c("Selective Inference",
"Merging Selective Inference",
"Multimoality test",
"T-test")
round(pval.res,3) %>% htmlTable::htmlTable()
bill length mm | bill depth mm | flipper length mm | body mass g | |
---|---|---|---|---|
Selective Inference | 0.156 | 0.716 | 0.484 | 0.438 |
Merging Selective Inference | 0.153 | 0.76 | 0.492 | 0.48 |
Multimoality test | 0.806 | 0.193 | 0.044 | 0.599 |
T-test | 0.4 | 0 | 0 | 0 |
[1] Gao, L. L., Bien, J., & Witten, D. (2022). Selective inference for hierarchical clustering. Journal of the American Statistical Association, (just-accepted), 1-27.
[2] HARTIGAN, John A. et HARTIGAN, Pamela M. The dip test of unimodality. The annals of Statistics, 1985, p. 70-84.