Package can be installed from CRAN
or downloaded from the GitHub repository (developer version).
Package ldatuning
realizes 4 metrics to select perfect
number of topics for LDA model.
Load “AssociatedPress” dataset from the topicmodels
package.
library("topicmodels")
data("AssociatedPress", package="topicmodels")
dtm <- AssociatedPress[1:10, ]
The most easy way is to calculate all metrics at once. All existing
methods require to train multiple LDA models to select one with the best
performance. It is computation intensive procedure and
ldatuning
uses parallelism, so do not forget to point
correct number of CPU cores in mc.core
parameter to archive
the best performance.
All standard LDA methods and parameters from topimodels
package can be set with method
and
control
.
result <- FindTopicsNumber(
dtm,
topics = seq(from = 2, to = 15, by = 1),
metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010"),
method = "Gibbs",
control = list(seed = 77),
mc.cores = 2L,
verbose = TRUE
)
## fit models... done.
## calculate metrics:
## Griffiths2004... done.
## CaoJuan2009... done.
## Arun2010... done.
Result is a number of topics and corresponding values of metrics
topics | Griffiths2004 | CaoJuan2009 | Arun2010 |
---|---|---|---|
15 | -15297.82 | 0.5047240 | 15.92711 |
14 | -15338.24 | 0.4927860 | 15.36552 |
13 | -15319.82 | 0.4944709 | 15.80569 |
12 | -15326.94 | 0.4756351 | 15.81278 |
11 | -15293.55 | 0.4347111 | 15.23313 |
10 | -15291.00 | 0.3829542 | 14.93706 |
9 | -15303.87 | 0.3379840 | 14.71664 |
8 | -15256.30 | 0.3061726 | 14.78140 |
7 | -15259.80 | 0.2746812 | 14.82908 |
6 | -15251.04 | 0.2612029 | 15.28425 |
5 | -15226.91 | 0.1875260 | 15.34470 |
4 | -15242.86 | 0.1779016 | 16.29708 |
3 | -15266.66 | 0.1600736 | 16.97832 |
2 | -15349.79 | 0.1169522 | 18.47430 |
Simple approach in analyze of metrics is to find extremum, more complete description is in corresponding papers:
Support function FindTopicsNumber_plot
can be used for
easy analyze of the results
Results calculated on the whole dataset (about 10 hours on quad-core computer) look like
From this plot can be made conclusion that optimal number of topics is in range 90-140. Metric Deveaud2014 is not informative in this situation.