Some best practices for anticlustering

Martin Papenberg

This vignette documents some “best practices” for anticlustering using the R package anticlust. In many cases, the suggestions pertain to overriding the default values of arguments of anticlustering(), which seems to be a difficult decision for users. However, I advise you: Do not stick with the defaults; check out the results of different anticlustering specifications; repeat the process; play around; read the documentation (especially ?anticlustering); change arguments arbitrarily; compare the output. Nothing can break.¹

This document uses somewhat imperative language; nuance and explanations are given in the package documentation, the other vignettes, and the papers by Papenberg and Klau (2021; https://doi.org/10.1037/met0000301) and Papenberg (2024; https://doi.org/10.1111/bmsp.12315). Note that deciding which anticlustering objective to use usually requires substantial content considerations and cannot be reduced to “which one is better”. However, some hints are given below.

If speed is not an issue (it usually is not), use method = "local-maximum" instead of the default method = "exchange". It is unambiguously better.
If speed is not an issue (it usually is not), use several repetitions.
Use standardize = TRUE instead of the default standardize = FALSE.²
Do not use the default objective = "diversity" when the group sizes are not equal (preferably, use objective = "kplus" or objective = "average-diversity").
If you only care about similarity in mean values, use objective = "variance".
You should (probably) not only care about similarity in mean values: prefer objective = "kplus" over objective = "variance" (or check out the function kplus_anticlustering()).
If you (only) care about similarity in means and standard deviations, use objective = "kplus" instead of the default objective = "diversity".
With k-plus anticlustering, always use standardize = TRUE.
If you want to apply anticlustering on a large data set, read the vignette “Speeding up Anticlustering”.

References

Papenberg, M., & Klau, G. W. (2021). Using anticlustering to partition data sets into equivalent parts. Psychological Methods, 26(2), 161–174. https://doi.org/10.1037/met0000301.

Papenberg, M. (2024). K-plus Anticlustering: An Improved k-means Criterion for Maximizing Between-Group Similarity. British Journal of Mathematical and Statistical Psychology, 77 (1), 80–102. https://doi.org/10.1111/bmsp.12315

Well, actually your R session can break if you use an optimal method (method = "ilp") with a data set that is too large.↩︎
You might ask why standardize = TRUE is not the default. Actually, there are two reasons. First, the argument was not always available in anticlust and changing the default behaviour of a function when releasing a new version is oftentimes undesirable. Second, it seems like a big decision to me to just change users’ data by default (which is done when standardizing the data). In doubt, just compare the results of using standardize = TRUE and standardize = FALSE and decide for yourself which you like best. Standardization may not be the best choice in all settings.↩︎