--- title: "Select number of topics for LDA model" author: "Murzintcev Nikita" date: '`r Sys.Date()`' output: html_document: theme: united toc: yes csl: acm-sigchi-proceedings.csl bibliography: references.bib vignette: | %\VignetteIndexEntry{Number of topics} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} # library(printr) # knitr::opts_chunk$set(cache=TRUE) # knitr::opts_chunk$set(collapse = TRUE, comment = "#>") ``` Package can be installed from CRAN ```{r, eval=FALSE} install.packages("ldatuning") ``` or downloaded from the GitHub repository (developer version). ```{r, eval=FALSE} install.packages("devtools") devtools::install_github("nikita-moor/ldatuning") ``` Package `ldatuning` realizes 4 metrics to select perfect number of topics for LDA model. ```{r, message=FALSE} library("ldatuning") ``` Load "AssociatedPress" dataset from the `topicmodels` package. ```{r, message=FALSE} library("topicmodels") data("AssociatedPress", package="topicmodels") dtm <- AssociatedPress[1:10, ] ``` The most easy way is to calculate all metrics at once. All existing methods require to train multiple LDA models to select one with the best performance. It is computation intensive procedure and `ldatuning` uses parallelism, so do not forget to point correct number of CPU cores in `mc.core` parameter to archive the best performance. All standard LDA methods and parameters from `topimodels` package can be set with `method` and `control`. ```{r, eval=TRUE} result <- FindTopicsNumber( dtm, topics = seq(from = 2, to = 15, by = 1), metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010"), method = "Gibbs", control = list(seed = 77), mc.cores = 2L, verbose = TRUE ) ``` Result is a number of topics and corresponding values of metrics ```{r, echo=FALSE} knitr::kable(result) ``` Simple approach in analyze of metrics is to find extremum, more complete description is in corresponding papers: - minimization: - Arun2010 [@Arun2010] - CaoJuan2009 [@CaoJuan2009] - maximization: - Deveaud2014 [@Deveaud2014] - Griffiths2004 [@Griffiths2004; @Ponweiser2012] Support function `FindTopicsNumber_plot` can be used for easy analyze of the results ```{r, fig.width=6, fig.height=3, results="hide"} FindTopicsNumber_plot(result) ``` Results calculated on the whole dataset (about 10 hours on quad-core computer) look like ```{r, fig.width=9, fig.height=5, echo=FALSE} result <- read.csv(file = "files/APress.csv", header = TRUE) FindTopicsNumber_plot(result[result$topics < 500, ]) ``` From this plot can be made conclusion that optimal number of topics is in range 90-140. Metric Deveaud2014 is not informative in this situation. # References