Rainette is an R package which implements a variant of the Reinert textual clustering method. This method is available in other softwares such as Iramuteq (free software) or Alceste (commercial, closed source).
The package is installable from CRAN.
install_packages("rainette")
The development version is installable from R-universe.
install.packages("rainette", repos = "https://juba.r-universe.dev")
Let’s start with an example corpus provided by the excellent quanteda package.
library(quanteda)
data_corpus_inaugural
First, we’ll use split_segments()
to split each document
into segments of about 40 words (punctuation is taken into account).
<- split_segments(data_corpus_inaugural, segment_size = 40) corpus
Next, we’ll apply some preprocessing and compute a document-term
matrix with quanteda
functions.
<- tokens(corpus, remove_punct = TRUE)
tok <- tokens_remove(tok, stopwords("en"))
tok <- dfm(tok, tolower = TRUE)
dtm <- dfm_trim(dtm, min_docfreq = 10) dtm
We can then apply a simple clustering on this matrix with the
rainette()
function. We specify the number of clusters
(k
), and the minimum number of forms in each segment
(min_segment_size
). Segments which do not include enough
forms will be merged with the following or previous one when
possible.
<- rainette(dtm, k = 6, min_segment_size = 15) res
We can use the rainette_explor()
shiny interface to
visualise and explore the different clusterings at each
k
.
rainette_explor(res, dtm, corpus)
The Cluster documents tab allows to browse and filter the documents in each cluster.
We can also directly generate the clusters description plot for a
given k
with rainette_plot()
.
rainette_plot(res, dtm, k = 5)
Or cut the tree at chosen k
and add a group membership
variable to our corpus metadata.
docvars(corpus)$cluster <- cutree(res, k = 5)
In addition to this, we can also perform a double clustering,
ie two simple clusterings produced with different
min_segment_size
which are then “crossed” to generate more
robust clusters. To do this, we use rainette2()
on two
rainette()
results :
<- rainette(dtm, k = 5, min_segment_size = 10)
res1 <- rainette(dtm, k = 5, min_segment_size = 15)
res2 <- rainette2(res1, res2, max_k = 5) res
We can then use rainette2_explor()
to explore and
visualise the results.
rainette2_explor(res, dtm, corpus)
Two vignettes are available :
This clustering method has been created by Max Reinert, and is described in several articles, notably :
Thanks to Pierre Ratineau, the author of Iramuteq, for providing it as free software and open source. Even if the R code has been almost entirely rewritten, it has been a precious resource to understand the algorithms.
Many thanks to Sébastien Rochette for the creation of the hex logo.
Many thanks to Florian Privé for his work on rewriting and optimizing the Rcpp code.