Determine a Prototype from a number of runs of Latent Dirichlet Allocation (LDA) measuring its similarities with S-CLOP: A procedure to select the LDA run with highest mean pairwise similarity, which is measured by S-CLOP (Similarity of multiple sets by Clustering with Local Pruning), to all other runs. LDA runs are specified by its assignments leading to estimators for distribution parameters. Repeated runs lead to different results, which we encounter by choosing the most representative LDA run as prototype.
Please cite the JOSS paper using the BibTeX entry
@article{<placeholder>,
title = {{ldaPrototype}: A method in {R} to get a Prototype of multiple Latent Dirichlet Allocations},
author = {Jonas Rieger},
journal = {Journal of Open Source Software},
year = {2020},
volume = {5},
number = {51},
pages = {2181},
doi = {10.21105/joss.02181},
url = {https://doi.org/10.21105/joss.02181}
}
which is also obtained by the call
citation("ldaPrototype")
.
ldaPrototype
.ldaPrototype
.This R package is licensed under the GPLv3. For bug reports (lack of documentation, misleading or wrong documentation, unexpected behaviour, …) and feature requests please use the issue tracker. Pull requests are welcome and will be included at the discretion of the author.
install.packages("ldaPrototype")
For the development version use devtools:
devtools::install_github("JonasRieger/ldaPrototype")
Load the package and the example dataset from Reuters consisting of
91 articles - tosca::LDAprep
can be used to manipulate text data to the format requested by
ldaPrototype
.
library("ldaPrototype")
data(reuters_docs)
data(reuters_vocab)
Run the shortcut function to create a LDAPrototype object. It
consists of the LDAPrototype of 4 LDA runs (with specified seeds) with
10 topics each. The LDA selected by the algorithm can be retrieved using
getPrototype
or getLDA
.
res = LDAPrototype(docs = reuters_docs, vocabLDA = reuters_vocab, n = 4, K = 10, seeds = 1:4)
proto = getPrototype(res) #= getLDA(res)
The same result can also be achieved by executing the following lines of code in several steps, which can be useful for interim evaluations.
reps = LDARep(docs = reuters_docs, vocab = reuters_vocab,
n = 4, K = 10, seeds = 1:4)
topics = mergeTopics(reps, vocab = reuters_vocab)
jacc = jaccardTopics(topics)
sclop = SCLOP.pairwise(jacc)
res2 = getPrototype(reps, sclop = sclop)
proto2 = getPrototype(res2) #= getLDA(res2)
identical(res, res2)
There is also the option to use similarity measures other than the
Jaccard coefficient. Currently, the measures cosine similarity
(cosineTopics
), Jensen-Shannon divergence
(jsTopics
) and rank-biased overlap (rboTopics
)
are implemented in addition to the standard Jaccard coefficient
(jaccardTopics
).
To get an overview of the workflow, the associated functions and getters for each type of object, the following call is helpful:
?`ldaPrototype-package`
Similar to the quick start example, the shortcut of one single call
is again compared with the step-by-step procedure. We model 5 LDAs with
K = 12
topics, hyperparameters
alpha = eta = 0.1
and seeds 1:5
. We want to
calculate the log likelihoods for the 20 iterations after 5 burn-in
iterations and topic similarities should be based on
atLeast = 3
words (see Step 3 below). In addition, we want
to keep all interim calculations, which would be discarded by default to
save memory space.
res = LDAPrototype(docs = reuters_docs, vocabLDA = reuters_vocab,
n = 5, K = 12, alpha = 0.1, eta = 0.1, compute.log.likelihood = TRUE,
burnin = 5, num.iterations = 20, atLeast = 3, seeds = 1:5,
keepLDAs = TRUE, keepSims = TRUE, keepTopics = TRUE)
Based on res
we can have a look at several getter
functions:
getID(res)
getPrototypeID(res)
getParam(res)
getParam(getLDA(res))
getLDA(res, all = TRUE)
getLDA(res)
est = getEstimators(getLDA(res))
est$phi[,1:3]
est$theta[,1:3]
getLog.likelihoods(getLDA(res))
getSCLOP(res)
getSimilarity(res)[1:5, 1:5]
tosca::topWords(getTopics(getLDA(res)), 5)
In the first step we simply run the LDA procedure five times with the
given parameters. This can also be done with support of batchtools using
LDABatch
instead of LDARep
or parallelMap setting
the pm.backend
and (optionally) ncpus
argument(s).
reps = LDARep(docs = reuters_docs, vocab = reuters_vocab,
n = 5, K = 12, alpha = 0.1, eta = 0.1, compute.log.likelihood = TRUE,
burnin = 5, num.iterations = 20, seeds = 1:5)
The topic matrices of all replications are merged and reduced to the
vocabulary given in vocab
. By default the vocabulary of the
first topic matrix is used as a simplification of the case that all LDAs
contain the same vocabulary set.
topics = mergeTopics(reps, vocab = reuters_vocab)
We use the merged topic matrix to calculate pairwise topic
similarites using the Jaccard coefficient with parameters adjusting the
consideration of words. A word is taken as relevant for a topic if its
count passes thresholds given by limit.rel
and
limit.abs
. A word is considered for calculation of
similarities if it’s relevant for the topic or if it belongs to the
(atLeast =
) 3 most common words in the corresponding topic.
Alternatively, the similarities can also be calculated considering the
cosine similarity (cosineTopics
), Jensen-Shannon divergence
(jsTopics
- parameter epsilon
to ensure
computability) or rank-biased overlap (rboTopics
-
parameter k
for maximum depth of evaluation and
p
as weighting parameter).
jacc = jaccardTopics(topics, limit.rel = 1/500, limit.abs = 10, atLeast = 3)
getSimilarity(jacc)[1:3, 1:3]
We can check the number of relevant and considered words using the
ad-hoc getter. The difference between n1
and
n2
can become larger than (atLeast =
) 3 if
there are ties in the count of words, which is negligible for large
sample sizes.
n1 = getRelevantWords(jacc)
n2 = getConsideredWords(jacc)
(n2-n1)[n2-n1 != 0]
It is possible to represent the calulcated pairwise topic
similarities as dendrogram using dendTopics
and related
plot
options.
dend = dendTopics(jacc)
plot(dend)
The S-CLOP algorithm results in a pruning state of the dendrogram,
which can be retrieved calling pruneSCLOP
. By default each
of the topics is colorized by its LDA run belonging; but the cluster
belongings can also be visualized by the colors or by vertical lines
with freely chosen parameters.
pruned = pruneSCLOP(dend)
plot(dend, pruned)
plot(dend, pruning = pruned, pruning.par = list(type = "both", lty = 1, lwd = 2, col = "red"))
For determination of the LDAPrototype the pairwise S-CLOP similarities of the 5 LDA runs are needed.
sclop = SCLOP.pairwise(jacc)
In the last step the LDAPrototype itself is determined by maximizing the mean pairwise S-CLOP per LDA.
res2 = getPrototype(reps, sclop = sclop)
There are several possibilites for using shortcut functions to summarize steps of the procedure. For example, we can determine the LDAPrototype after Step 1:
res3 = getPrototype(reps, atLeast = 3)