World Scientific
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×
Our website is made possible by displaying certain online content using javascript.
In order to view the full content, please disable your ad blocker or whitelist our website www.worldscientific.com.

System Upgrade on Mon, Jun 21st, 2021 at 1am (EDT)

During this period, the E-commerce and registration of new users may not be available for up to 6 hours.
For online purchase, please visit us again. Contact us at [email protected] for any enquiries.
Special Issue — Best Papers from the 12th IEEE International Conference on Semantic Computing (ICSC 2018); Guest Editor: Fabio PersiaNo Access

Exploring Symmetrical and Asymmetrical Dirichlet Priors for Latent Dirichlet Allocation

    Latent Dirichlet Allocation (LDA) has gained much attention from researchers and is increasingly being applied to uncover underlying semantic structures from a variety of corpora. However, nearly all researchers use symmetrical Dirichlet priors, often unaware of the underlying practical implications that they bear. This research is the first to explore symmetrical and asymmetrical Dirichlet priors on topic coherence and human topic ranking when uncovering latent semantic structures from scientific research articles. More specifically, we examine the practical effects of several classes of Dirichlet priors on 2000 LDA models created from abstract and full-text research articles. Our results show that symmetrical or asymmetrical priors on the document–topic distribution or the topic–word distribution for full-text data have little effect on topic coherence scores and human topic ranking. In contrast, asymmetrical priors on the document–topic distribution for abstract data show a significant increase in topic coherence scores and improved human topic ranking compared to a symmetrical prior. Symmetrical or asymmetrical priors on the topic–word distribution show no real benefits for both abstract and full-text data.

    References

    • 1. P. O. Larsen and M. von Ins, The rate of growth in scientific publication and the decline in coverage provided by Science Citation Index, Scientometrics 84 (2010) 575–603. Crossref, ISIGoogle Scholar
    • 2. A. Srivastava and M. Sahami, Text Mining: Classification, Clustering, and Applications (CRC Press, 2009). CrossrefGoogle Scholar
    • 3. D. M. Blei, A. Y. Ng and M. I. Jordan, Latent Dirichlet allocation, J. Mach. Learn. Res. 3 (2003) 993–1022. ISIGoogle Scholar
    • 4. T. Hofmann, Probabilistic latent semantic indexing, in Proc. 22nd Annual Int. ACM SIGIR Conf. Research and Development in Information Retrieval, 1999, pp. 50–57. Google Scholar
    • 5. R. Fergus, L. Fei-Fei, P. Perona and A. Zisserman, Learning object categories from Google's image search, in Tenth IEEE Int. Conf. Computer Vision, 2005, pp. 1816–1823. Google Scholar
    • 6. R. Mehran, A. Oyama and M. Shah, Abnormal crowd behavior detection using social force model, in IEEE Conf. Computer Vision and Pattern Recognition, 2009, pp. 935–942. Google Scholar
    • 7. S. Kim, S. Narayanan and S. Sundaram, Acoustic topic model for audio information retrieval, in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2009, pp. 37–40. Google Scholar
    • 8. H. M. Wallach, D. Mimno and A. Mccallum, Rethinking LDA: Why priors matter, in Proc. 22nd Int. Conf. Neural Information Processing Systems, 2009, pp. 1973–1981. Google Scholar
    • 9. A. Asuncion, M. Welling, P. Smyth and Y. W. Teh, On smoothing and inference for topic models, in Proc. Twenty-Fifth Conf. Uncertainty in Artificial Intelligence, 2012, pp. 27–34. Google Scholar
    • 10. J. Chang, S. Gerrish, C. Wang and D. M. Blei, Reading tea leaves: How humans interpret topic models, in Proc. 22nd Int. Conf. Neural Information Processing Systems, 2009, pp. 288–296. Google Scholar
    • 11. N. Aletras and M. Stevenson, Evaluating topic coherence using distributional semantics, in Proc. 10th Int. Conf. Computational Semantics, 2013, pp. 13–22. Google Scholar
    • 12. K. Stevens, P. Kegelmeyer, D. Andrzejewski and D. Buttler, Exploring topic coherence over many models and many topics, in Proc. Joint Conf. Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 2012, pp. 952–961. Google Scholar
    • 13. D. Newman, J. Lau, K. Grieser and T. Baldwin, Automatic evaluation of topic coherence, in Human Language Technologies: Annual Conf. North American Chapter of the Association for Computational Linguistics, 2010, pp. 100–108. Google Scholar
    • 14. M. Röder, A. Both and A. Hinneburg, Exploring the space of topic coherence measures, in Proc. Eighth ACM Int. Conf. Web Search and Data Mining, 2015, pp. 399–408. Google Scholar
    • 15. Z. S. Harris, Distributional structure, WORD 10 (1954) 146–162. Crossref, ISIGoogle Scholar
    • 16. R. Rehurek and P. Sojka, Software framework for topic modelling with large corpora, in Proc. Workshop on New Challenges for NLP Frameworks, European Language Resources Association, 2010, pp. 45–50. Google Scholar
    • 17. C. J. Gatti, J. D. Brooks and S. G. Nurre, A historical analysis of the field of OR/MS using topic models, arXiv.org stat.ML (Oct 2015). Google Scholar
    • 18. L. Sun and Y. Yin, Discovering themes and trends in transportation research using topic modeling, Transp. Res. C, Emerg. Technol. 77 (2017) 49–66. Crossref, ISIGoogle Scholar
    • 19. M. J. Westgate, P. S. Barton, J. C. Pierson and D. B. Lindenmayer, Text analysis tools for identification of emerging topics and research gaps in conservation science, Conserv. Biol. 29 (2015) 1606–1614. Crossref, ISIGoogle Scholar
    • 20. X. Wang and A. McCallum, Topics over time: A non-Markov continuous-time model of topical trends, in Proc. 12th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, 2006, pp. 424–433. Google Scholar
    • 21. J. M. Alston and P. G. Pardey, Six decades of agricultural and resource economics in Australia: An analysis of trends in topics, authorship and collaboration, Aust. J. Agric. Res. Econ. 60 (2016) 554–568. Crossref, ISIGoogle Scholar
    • 22. E. Erosheva, S. Fienberg and J. Lafferty, Mixed-membership models of scientific publications, Proc. Nat. Acad. Sci. USA 101 (2004) 5220–5227. Crossref, ISIGoogle Scholar
    • 23. D. M. Blei and J. D. Lafferty, Topic models, in Text Mining: Classification, Clustering, and Applications, 2009, pp. 71–89. Google Scholar
    • 24. D. Newman, A. Asuncion, P. Smyth and M. Welling, Distributed inference for latent dirichlet allocation, in Proc. 20th Int. Conf. Neural Information Processing Systems, 2007, pp. 1081–1088. Google Scholar
    • 25. I. Porteous, D. Newman, A. Ihler, A. Asuncion, P. Smyth and M. Welling, Fast collapsed gibbs sampling for latent dirichlet allocation, in Proc. 14th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, 2008, pp. 569–577. Google Scholar
    • 26. D. M. Blei and M. I. Jordan, Variational inference for Dirichlet process mixtures, Bayesian Anal. 1 (2006) 121–143. Crossref, ISIGoogle Scholar
    • 27. Y. W. Teh, D. Newman, M. Welling and D. Neaman, A collapsed variational Bayesian inference algorithm for latent dirichlet allocation, in Proc. 19th Int. Conf. Neural Information Processing Systems, 2006, pp. 1353–1360. Google Scholar
    • 28. C. Wang, J. Paisley and D. M. Blei, Online variational inference for the hierarchical dirichlet process, in Proc. 14th Int. Conf. Artificial Intelligence and Statistics 15 (2011) 752–760. Google Scholar
    • 29. D. M. Blei, Probabilistic topic models, Commun. ACM 55 (2012) 77–84. Crossref, ISIGoogle Scholar
    • 30. T. L. Griffiths and M. Steyvers, Finding scientific topics, Proc. Natl. Acad. Sci. 101 (2004) 5228–5235. Crossref, ISIGoogle Scholar
    • 31. J. Grimmer and B. M. Stewart, Text as data: The promise and pitfalls of automatic content analysis methods for political texts, Polit. Anal. 21 (2013) 267–297. Crossref, ISIGoogle Scholar
    • 32. T. Rusch, P. Hofmarcher, R. Hatzinger and K. Hornik, Model trees with topic model preprocessing: An approach for data journalism illustrated with the WikiLeaks Afghanistan war logs, Ann. Appl. Stat. 7 (2013) 613–639. Crossref, ISIGoogle Scholar
    • 33. M. W. Neff and E. A. Corley, 35 years and 160,000 articles: A bibliometric exploration of the evolution of ecology, Scientometrics 80 (2009) 657–682. Crossref, ISIGoogle Scholar
    • 34. D. Hall, D. Jurafsky and C. D. Manning, Studying the history of ideas using topic models, in Proc. Conf. Empirical Methods in Natural Language Processing, 2008, pp. 363–371. Google Scholar
    • 35. S. Syed and C. T. Weber, Using machine learning to uncover latent research topics in fishery models, Rev. Fish. Sci. Aquaculture 26 (3) (2018) 319–336. Crossref, ISIGoogle Scholar
    • 36. S. Syed, M. Borit and M. Spruit, Narrow lenses for capturing the complexity of fisheries: A topic analysis of fisheries science from 1990 to 2016, Fish Fish. 19(4) (2018) 643–661. Crossref, ISIGoogle Scholar
    • 37. D. Wallach, Hanna M., Murray, Iain, Salakhutdinov, Ruslan and Mimno, Evaluation methods for topic models, in Proc. 26th Ann. Int. Conf. Machine Learning, 2009, pp. 1105–1112. Google Scholar
    • 38. I. Douven and W. Meijs, Measuring coherence, Synthese 156 (3) (2007) 405–425. Crossref, ISIGoogle Scholar
    • 39. G. Bouma, Normalized (pointwise) mutual information in collocation extraction, in Proc. German Society for Computational Linguistics, 2009, pp. 31–40. Google Scholar
    • 40. S. Syed, M. Spruit and M. Borit, Bootstrapping a semantic lexicon on verb similarities, in Proc. 8th Int. Joint Conf. Knowledge Discovery, Knowledge Engineering and Knowledge Management 1 (Ic3k), 2016, pp. 189–196. Google Scholar
    • 41. S. Syed and M. Spruit, Full-text or abstract? Examining topic coherence scores using latent dirichlet allocation, in 4th IEEE Int. Conf. Data Science and Advanced Analytics, 2017, pp. 165–174. Google Scholar
    • 42. N. Evangelopoulos, X. Zhang and V. R. Prybutok, Latent semantic analysis: Five methodological recommendations, Eur. J. Inf. Syst. 21 (2012) 70–86. Crossref, ISIGoogle Scholar
    • 43. G. Heinrich, Parameter estimation for text analysis, Bernoulli 35 (2005) 1–31. Google Scholar
    • 44. J. Huang, Maximum Likelihood Estimation of Dirichlet Distribution Parameters, Tech. Rep., Carnegie Mellon University (Pittsburgh, Pennsylvania, 2005). Google Scholar
    • 45. M. D. Hoffman, D. M. Blei and F. Bach, Online learning for latent dirichlet allocation, in Proc. 23rd Int. Conf. Neural Information Processing Systems — Vol. 1, 2010, pp. 856–864. Google Scholar
    • 46. L. Bottou and O. Bousquet, The tradeoffs of large scale learning, in Proc. 20th Int. Conf. Neural Information Processing Systems 20, 2007, pp. 161–168. Google Scholar
    • 47. D. Mimno, H. M. Wallach, E. Talley, M. Leenders and A. McCallum, Optimizing semantic coherence in topic models, in Proc. Conf. Empirical Methods in Natural Language Processing, 2011, pp. 262–272. Google Scholar
    • 48. H. M. Wallach, Topic modeling: Beyond bag-of-words, in Proc. 23rd Int. Conf. Machine Learning, 2006, pp. 977–984. Google Scholar
    • 49. D. M. Blei and J. D. Lafferty, Dynamic topic models, in Proc. 23rd Int. Conf. Machine Learning, 2006, pp. 113–120. Google Scholar
    • 50. Y. Whye Teh, M. I. Jordan, M. J. Beal and D. M. Blei, Sharing clusters among related groups: Hierarchical Dirichlet processes, in Proc. 17th Int. Conf. Neural Information Processing Systems, 2004, pp. 1385–1392. Google Scholar
    • 51. D. M. Blei and J. D. Lafferty, A correlated topic model of science, Ann. Appl. Stat. 1 (2007) 17–35. Crossref, ISIGoogle Scholar
    • 52. M. Rosen-Zvi, T. Griffiths, M. Steyvers and P. Smyth, The author-topic model for authors and documents, in Proc. 20th Conf. Uncertainty in Artificial Intelligence, 2004, pp. 487–494. Google Scholar
    • 53. J. Chang and D. M. Blei, Hierarchical relational models for document networks, Ann. Appl. Stat. 4 (2010) 124–150. Crossref, ISIGoogle Scholar
    • 54. J. Reisinger, A. Waters, B. Silverthorn and R. J. Mooney, Spherical topic models, in Proc. 27th Int. Conf. Machine Learning, International Machine Learning Society, 2010, pp. 903–910. Google Scholar
    • 55. C. Wang and D. M. Blei, Decoupling sparsity and smoothness in the discrete hierarchical dirichlet process, in Proc. 22nd Int. Conf. Neural Information Processing Systems, 2009, pp. 1982–1989. Google Scholar
    • 56. G. Doyle and C. Elkan, Accounting for burstiness in topic models, in Proc. 26th Annual Int. Conf. Machine Learning, 2009, pp. 281–288. Google Scholar
    • 57. J. Chuang, D. Ramage, C. Manning and J. Heer, Interpretation and trust, in Proc. 2012 ACM Annual Conf. Human Factors in Computing Systems, 2012, pp. 443–452. Google Scholar
    • 58. C. Sievert and K. Shirley, LDAvis: A method for visualizing and interpreting topics, in Proc. Workshop on Interactive Language Learning, Visualization, and Interfaces, 2014, pp. 63–70. Google Scholar
    • 59. C. Urquhart, An encounter with grounded theory: Tackling the practical and philosophical issues, in Qualitative Research in IS, 2001, pp. 104–140. CrossrefGoogle Scholar
    Remember to check out the Most Cited Articles!

    Check out our titles in Semantic Computing!