Exploring Symmetrical and Asymmetrical Dirichlet Priors for Latent Dirichlet Allocation
Abstract
Latent Dirichlet Allocation (LDA) has gained much attention from researchers and is increasingly being applied to uncover underlying semantic structures from a variety of corpora. However, nearly all researchers use symmetrical Dirichlet priors, often unaware of the underlying practical implications that they bear. This research is the first to explore symmetrical and asymmetrical Dirichlet priors on topic coherence and human topic ranking when uncovering latent semantic structures from scientific research articles. More specifically, we examine the practical effects of several classes of Dirichlet priors on 2000 LDA models created from abstract and full-text research articles. Our results show that symmetrical or asymmetrical priors on the document–topic distribution or the topic–word distribution for full-text data have little effect on topic coherence scores and human topic ranking. In contrast, asymmetrical priors on the document–topic distribution for abstract data show a significant increase in topic coherence scores and improved human topic ranking compared to a symmetrical prior. Symmetrical or asymmetrical priors on the topic–word distribution show no real benefits for both abstract and full-text data.
References
- 1. , The rate of growth in scientific publication and the decline in coverage provided by Science Citation Index, Scientometrics 84 (2010) 575–603. Crossref, ISI, Google Scholar
- 2. , Text Mining: Classification, Clustering, and Applications (CRC Press, 2009). Crossref, Google Scholar
- 3. , Latent Dirichlet allocation, J. Mach. Learn. Res. 3 (2003) 993–1022. ISI, Google Scholar
- 4. , Probabilistic latent semantic indexing, in Proc. 22nd Annual Int. ACM SIGIR Conf. Research and Development in Information Retrieval, 1999, pp. 50–57. Google Scholar
- 5. , Learning object categories from Google's image search, in Tenth IEEE Int. Conf. Computer Vision, 2005, pp. 1816–1823. Google Scholar
- 6. , Abnormal crowd behavior detection using social force model, in IEEE Conf. Computer Vision and Pattern Recognition, 2009, pp. 935–942. Google Scholar
- 7. , Acoustic topic model for audio information retrieval, in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2009, pp. 37–40. Google Scholar
- 8. , Rethinking LDA: Why priors matter, in Proc. 22nd Int. Conf. Neural Information Processing Systems, 2009, pp. 1973–1981. Google Scholar
- 9. , On smoothing and inference for topic models, in Proc. Twenty-Fifth Conf. Uncertainty in Artificial Intelligence, 2012, pp. 27–34. Google Scholar
- 10. , Reading tea leaves: How humans interpret topic models, in Proc. 22nd Int. Conf. Neural Information Processing Systems, 2009, pp. 288–296. Google Scholar
- 11. , Evaluating topic coherence using distributional semantics, in Proc. 10th Int. Conf. Computational Semantics, 2013, pp. 13–22. Google Scholar
- 12. , Exploring topic coherence over many models and many topics, in Proc. Joint Conf. Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 2012, pp. 952–961. Google Scholar
- 13. , Automatic evaluation of topic coherence, in Human Language Technologies: Annual Conf. North American Chapter of the Association for Computational Linguistics, 2010, pp. 100–108. Google Scholar
- 14. , Exploring the space of topic coherence measures, in Proc. Eighth ACM Int. Conf. Web Search and Data Mining, 2015, pp. 399–408. Google Scholar
- 15. , Distributional structure, WORD 10 (1954) 146–162. Crossref, ISI, Google Scholar
- 16. , Software framework for topic modelling with large corpora, in Proc. Workshop on New Challenges for NLP Frameworks,
European Language Resources Association , 2010, pp. 45–50. Google Scholar - 17. C. J. Gatti, J. D. Brooks and S. G. Nurre, A historical analysis of the field of OR/MS using topic models, arXiv.org stat.ML (Oct 2015). Google Scholar
- 18. , Discovering themes and trends in transportation research using topic modeling, Transp. Res. C, Emerg. Technol. 77 (2017) 49–66. Crossref, ISI, Google Scholar
- 19. , Text analysis tools for identification of emerging topics and research gaps in conservation science, Conserv. Biol. 29 (2015) 1606–1614. Crossref, ISI, Google Scholar
- 20. , Topics over time: A non-Markov continuous-time model of topical trends, in Proc. 12th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, 2006, pp. 424–433. Google Scholar
- 21. , Six decades of agricultural and resource economics in Australia: An analysis of trends in topics, authorship and collaboration, Aust. J. Agric. Res. Econ. 60 (2016) 554–568. Crossref, ISI, Google Scholar
- 22. , Mixed-membership models of scientific publications, Proc. Nat. Acad. Sci. USA 101 (2004) 5220–5227. Crossref, ISI, Google Scholar
- 23. , Topic models, in Text Mining: Classification, Clustering, and Applications, 2009, pp. 71–89. Google Scholar
- 24. , Distributed inference for latent dirichlet allocation, in Proc. 20th Int. Conf. Neural Information Processing Systems, 2007, pp. 1081–1088. Google Scholar
- 25. , Fast collapsed gibbs sampling for latent dirichlet allocation, in Proc. 14th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, 2008, pp. 569–577. Google Scholar
- 26. , Variational inference for Dirichlet process mixtures, Bayesian Anal. 1 (2006) 121–143. Crossref, ISI, Google Scholar
- 27. , A collapsed variational Bayesian inference algorithm for latent dirichlet allocation, in Proc. 19th Int. Conf. Neural Information Processing Systems, 2006, pp. 1353–1360. Google Scholar
- 28. , Online variational inference for the hierarchical dirichlet process, in Proc. 14th Int. Conf. Artificial Intelligence and Statistics 15 (2011) 752–760. Google Scholar
- 29. , Probabilistic topic models, Commun. ACM 55 (2012) 77–84. Crossref, ISI, Google Scholar
- 30. , Finding scientific topics, Proc. Natl. Acad. Sci. 101 (2004) 5228–5235. Crossref, ISI, Google Scholar
- 31. , Text as data: The promise and pitfalls of automatic content analysis methods for political texts, Polit. Anal. 21 (2013) 267–297. Crossref, ISI, Google Scholar
- 32. , Model trees with topic model preprocessing: An approach for data journalism illustrated with the WikiLeaks Afghanistan war logs, Ann. Appl. Stat. 7 (2013) 613–639. Crossref, ISI, Google Scholar
- 33. , 35 years and 160,000 articles: A bibliometric exploration of the evolution of ecology, Scientometrics 80 (2009) 657–682. Crossref, ISI, Google Scholar
- 34. , Studying the history of ideas using topic models, in Proc. Conf. Empirical Methods in Natural Language Processing, 2008, pp. 363–371. Google Scholar
- 35. , Using machine learning to uncover latent research topics in fishery models, Rev. Fish. Sci. Aquaculture 26 (3) (2018) 319–336. Crossref, ISI, Google Scholar
- 36. , Narrow lenses for capturing the complexity of fisheries: A topic analysis of fisheries science from 1990 to 2016, Fish Fish. 19(4) (2018) 643–661. Crossref, ISI, Google Scholar
- 37. , Murray, Iain, Salakhutdinov, Ruslan and Mimno, Evaluation methods for topic models, in Proc. 26th Ann. Int. Conf. Machine Learning, 2009, pp. 1105–1112. Google Scholar
- 38. , Measuring coherence, Synthese 156 (3) (2007) 405–425. Crossref, ISI, Google Scholar
- 39. , Normalized (pointwise) mutual information in collocation extraction, in Proc. German Society for Computational Linguistics, 2009, pp. 31–40. Google Scholar
- 40. , Bootstrapping a semantic lexicon on verb similarities, in Proc. 8th Int. Joint Conf. Knowledge Discovery, Knowledge Engineering and Knowledge Management 1 (Ic3k), 2016, pp. 189–196. Google Scholar
- 41. , Full-text or abstract? Examining topic coherence scores using latent dirichlet allocation, in 4th IEEE Int. Conf. Data Science and Advanced Analytics, 2017, pp. 165–174. Google Scholar
- 42. , Latent semantic analysis: Five methodological recommendations, Eur. J. Inf. Syst. 21 (2012) 70–86. Crossref, ISI, Google Scholar
- 43. , Parameter estimation for text analysis, Bernoulli 35 (2005) 1–31. Google Scholar
- 44. J. Huang, Maximum Likelihood Estimation of Dirichlet Distribution Parameters, Tech. Rep., Carnegie Mellon University (Pittsburgh, Pennsylvania, 2005). Google Scholar
- 45. , Online learning for latent dirichlet allocation, in Proc. 23rd Int. Conf. Neural Information Processing Systems — Vol. 1, 2010, pp. 856–864. Google Scholar
- 46. , The tradeoffs of large scale learning, in Proc. 20th Int. Conf. Neural Information Processing Systems 20, 2007, pp. 161–168. Google Scholar
- 47. , Optimizing semantic coherence in topic models, in Proc. Conf. Empirical Methods in Natural Language Processing, 2011, pp. 262–272. Google Scholar
- 48. , Topic modeling: Beyond bag-of-words, in Proc. 23rd Int. Conf. Machine Learning, 2006, pp. 977–984. Google Scholar
- 49. , Dynamic topic models, in Proc. 23rd Int. Conf. Machine Learning, 2006, pp. 113–120. Google Scholar
- 50. , Sharing clusters among related groups: Hierarchical Dirichlet processes, in Proc. 17th Int. Conf. Neural Information Processing Systems, 2004, pp. 1385–1392. Google Scholar
- 51. , A correlated topic model of science, Ann. Appl. Stat. 1 (2007) 17–35. Crossref, ISI, Google Scholar
- 52. , The author-topic model for authors and documents, in Proc. 20th Conf. Uncertainty in Artificial Intelligence, 2004, pp. 487–494. Google Scholar
- 53. , Hierarchical relational models for document networks, Ann. Appl. Stat. 4 (2010) 124–150. Crossref, ISI, Google Scholar
- 54. , Spherical topic models, in Proc. 27th Int. Conf. Machine Learning, International Machine Learning Society, 2010, pp. 903–910. Google Scholar
- 55. , Decoupling sparsity and smoothness in the discrete hierarchical dirichlet process, in Proc. 22nd Int. Conf. Neural Information Processing Systems, 2009, pp. 1982–1989. Google Scholar
- 56. , Accounting for burstiness in topic models, in Proc. 26th Annual Int. Conf. Machine Learning, 2009, pp. 281–288. Google Scholar
- 57. , Interpretation and trust, in Proc. 2012 ACM Annual Conf. Human Factors in Computing Systems, 2012, pp. 443–452. Google Scholar
- 58. , LDAvis: A method for visualizing and interpreting topics, in Proc. Workshop on Interactive Language Learning, Visualization, and Interfaces, 2014, pp. 63–70. Google Scholar
- 59. , An encounter with grounded theory: Tackling the practical and philosophical issues, in Qualitative Research in IS, 2001, pp. 104–140. Crossref, Google Scholar
Remember to check out the Most Cited Articles! |
---|
Check out our titles in Semantic Computing! |