The dbscan package [6] includes a fast implementation of Hierarchical DBSCAN (HDBSCAN) and its related algorithm(s) for the R platform. This vignette introduces how to interface with these features. To understand how HDBSCAN works, we refer to an excellent Python Notebook resource that goes over the basic concepts of the algorithm (see the SciKit-learn docs). For the sake of simplicity, consider the same sample dataset from the notebook:
library("dbscan")
##
## Attaching package: 'dbscan'
## The following object is masked from 'package:stats':
##
## as.dendrogram
data("moons")
plot(moons, pch=20)
To run the HDBSCAN algorithm, simply pass the dataset and the (single) parameter value ‘minPts’ to the hdbscan function.
cl <- hdbscan(moons, minPts = 5)
cl
## HDBSCAN clustering for 100 objects.
## Parameters: minPts = 5
## The clustering contains 3 cluster(s) and 0 noise points.
##
## 1 2 3
## 25 25 50
##
## Available fields: cluster, minPts, coredist, cluster_scores,
## membership_prob, outlier_scores, hc
The ‘flat’ results are stored in the ‘cluster’ member. Noise points are given a value of 0, so increment by 1.
plot(moons, col=cl$cluster+1, pch=20)
The results match intuitive notions of what ‘similar’ clusters may look like when they manifest in arbitrary shapes.
The resulting HDBSCAN object contains a hierarchical representation of every possible DBSCAN* clustering. This hierarchical representation is compactly stored in the familiar ‘hc’ member of the resulting HDBSCAN object, in the same format of traditional hierarchical clustering objects formed using the ‘hclust’ method from the stats package.
cl$hc
##
## Call:
## hdbscan(x = moons, minPts = 5)
##
## Cluster method : robust single
## Distance : mutual reachability
## Number of objects: 100
Note that although this object is available for use with any of the methods that work with ‘hclust’ objects, the distance method HDBSCAN uses (mutual reachability distance, see [2]) is not an available method of the hclust function. This hierarchy, denoted the “HDBSCAN* hierarchy” in [3], can be visualized using the built-in plotting method from the stats package
plot(cl$hc, main="HDBSCAN* Hierarchy")
As the name implies, the fascinating thing about the HDBSCAN* hierarchy is that any global ‘cut’ is equivalent to running DBSCAN* (DBSCAN w/o border points) at the tree’s cutting threshold \(eps\) (assuming the same \(minPts\) parameter setting was used). But can this be verified manually? Using a modified function to distinguish noise using core distance as 0 (since the stats cutree method does not assign singletons with 0), the results can be shown to be identical.
cl <- hdbscan(moons, minPts = 5)
check <- rep(FALSE, nrow(moons)-1)
core_dist <- kNNdist(moons, k=5-1)
## cutree doesn't distinguish noise as 0, so we make a new method to do it manually
cut_tree <- function(hcl, eps, core_dist){
cuts <- unname(cutree(hcl, h=eps))
cuts[which(core_dist > eps)] <- 0 # Use core distance to distinguish noise
cuts
}
eps_values <- sort(cl$hc$height, decreasing = TRUE)+.Machine$double.eps ## Machine eps for consistency between cuts
for (i in 1:length(eps_values)) {
cut_cl <- cut_tree(cl$hc, eps_values[i], core_dist)
dbscan_cl <- dbscan(moons, eps = eps_values[i], minPts = 5, borderPoints = FALSE) # DBSCAN* doesn't include border points
## Use run length encoding as an ID-independent way to check ordering
check[i] <- (all.equal(rle(cut_cl)$lengths, rle(dbscan_cl$cluster)$lengths) == "TRUE")
}
print(all(check == TRUE))
## [1] TRUE
The HDBSCAN* hierarchy is useful, but for larger datasets it can become overly cumbersome since every data point is represented as a leaf somewhere in the hierarchy. The hdbscan object comes with a powerful visualization tool that plots the ‘simplified’ hierarchy(see [2] for more details), which shows cluster-wide changes over an infinite number of \(eps\) thresholds. It is the default visualization dispatched by the ‘plot’ method
plot(cl)
You can change up colors
plot(cl, gradient = c("yellow", "orange", "red", "blue"))
… and scale the widths for individual devices appropriately
plot(cl, gradient = c("purple", "blue", "green", "yellow"), scale=1.5)
… even outline the most ‘stable’ clusters reported in the flat solution
plot(cl, gradient = c("purple", "blue", "green", "yellow"), show_flat = TRUE)
Note the stability scores correspond to the labels on the condensed tree, but the cluster assignments in the cluster member element do not correspond to the labels in the condensed tree. Also, note that these scores represent the stability scores before the traversal up the tree that updates the scores based on the children.
print(cl$cluster_scores)
## 1 2 3
## 110.70613 90.86559 45.62762
The individual point membership ‘probabilities’ are in the probabilities member element
head(cl$membership_prob)
## [1] 0.4354753 0.2893287 0.4778663 0.4035933 0.4574012 0.4904582
These can be used to show the ‘degree of cluster membership’ by, for example, plotting points with transparencies that correspond to their membership degrees.
plot(moons, col=cl$cluster+1, pch=21)
colors <- mapply(function(col, i) adjustcolor(col, alpha.f = cl$membership_prob[i]),
palette()[cl$cluster+1], seq_along(cl$cluster))
points(moons, col=colors, pch=20)
A recent journal publication on HDBSCAN comes with a new outlier measure that computes an outlier score of each point in the data based on local and global properties of the hierarchy, defined as the Global-Local Outlier Score from Hierarchies (GLOSH)[4]. An example of this is shown below, where unlike the membership probabilities, the opacity of point represents the amount of “outlierness” the point represents. Traditionally, outliers are generally considered to be observations that deviate from the expected value of their presumed underlying distribution, where the measure of deviation that is considered significant is determined by some statistical threshold value.
Note: Because of the distinction made that noise points, points that are not assigned to any clusters, should be considered in the definition of an outlier, the outlier scores computed are not just the inversely-proportional scores to the membership probabilities.
top_outliers <- order(cl$outlier_scores, decreasing = TRUE)[1:10]
colors <- mapply(function(col, i) adjustcolor(col, alpha.f = cl$outlier_scores[i]),
palette()[cl$cluster+1], seq_along(cl$cluster))
plot(moons, col=colors, pch=20)
text(moons[top_outliers, ], labels = top_outliers, pos=3)
A larger example dataset may be more beneficial in explicitly revealing the usefulness of HDSBCAN. Consider the ‘DS3’ dataset originally published as part of a benchmark test dataset for the Chameleon clustering algorithm [5]. It’s clear that the shapes in this dataset can be distinguished sufficiently well by a human, however, it is well known that many clustering algorithms fail to capture the intuitive structure.
data("DS3")
plot(DS3, pch=20, cex=0.25)
Using the single parameter setting of, say, 25, HDBSCAN finds 6 clusters
cl2 <- hdbscan(DS3, minPts = 25)
cl2
## HDBSCAN clustering for 8000 objects.
## Parameters: minPts = 25
## The clustering contains 6 cluster(s) and 815 noise points.
##
## 0 1 2 3 4 5 6
## 815 1585 617 614 924 1656 1789
##
## Available fields: cluster, minPts, coredist, cluster_scores,
## membership_prob, outlier_scores, hc
Marking the noise appropriately and highlighting points based on their ‘membership probabilities’ as before, a visualization of the cluster structure can be easily crafted.
plot(DS3, col=cl2$cluster+1,
pch=ifelse(cl2$cluster == 0, 8, 1), # Mark noise as star
cex=ifelse(cl2$cluster == 0, 0.5, 0.75), # Decrease size of noise
xlab=NA, ylab=NA)
colors <- sapply(1:length(cl2$cluster),
function(i) adjustcolor(palette()[(cl2$cluster+1)[i]], alpha.f = cl2$membership_prob[i]))
points(DS3, col=colors, pch=20)
The simplified tree can be particularly useful for larger datasets
plot(cl2, scale = 3, gradient = c("purple", "orange", "red"), show_flat = TRUE)
All of the computational and memory intensive tasks required by HDSBCAN were written in C++ using the Rcpp package. With DBSCAN, the performance depends on the parameter settings, primarily on the radius at which points are considered as candidates for clustering (‘eps’), and generally less so on the ‘minPts’ parameter. Intuitively, larger values of eps increase the computation time.
One of the primary computational bottleneck with using HDBSCAN is the computation of the full (euclidean) pairwise distance between all points, for which HDBSCAN currently relies on base R ‘dist’ method for. If a precomputed one is available, the running time of HDBSCAN can be moderately reduced.