• Open Access

High-Reproducibility and High-Accuracy Method for Automated Topic Classification

Andrea Lancichinetti, M. Irmak Sirer, Jane X. Wang, Daniel Acuna, Konrad Körding, and Luís A. Nunes Amaral
Phys. Rev. X 5, 011007 – Published 29 January 2015
PDFHTMLExport Citation

Abstract

Much of human knowledge sits in large databases of unstructured text. Leveraging this knowledge requires algorithms that extract and record metadata on unstructured text documents. Assigning topics to documents will enable intelligent searching, statistical characterization, and meaningful classification. Latent Dirichlet allocation (LDA) is the state of the art in topic modeling. Here, we perform a systematic theoretical and numerical analysis that demonstrates that current optimization techniques for LDA often yield results that are not accurate in inferring the most suitable model parameters. Adapting approaches from community detection in networks, we propose a new algorithm that displays high reproducibility and high accuracy and also has high computational efficiency. We apply it to a large set of documents in the English Wikipedia and reveal its hierarchical structure.

  • Figure
  • Figure
  • Figure
  • Figure
  • Figure
  • Figure
  • Figure
2 More
  • Received 21 May 2014

DOI:https://doi.org/10.1103/PhysRevX.5.011007

This article is available under the terms of the Creative Commons Attribution 3.0 License. Further distribution of this work must maintain attribution to the author(s) and the published article’s title, journal citation, and DOI.

Published by the American Physical Society

Authors & Affiliations

Andrea Lancichinetti1,2, M. Irmak Sirer2, Jane X. Wang3, Daniel Acuna4, Konrad Körding4, and Luís A. Nunes Amaral1,2,5,6

  • 1Howard Hughes Medical Institute (HHMI), Northwestern University, Evanston, Illinois 60208, USA
  • 2Department of Chemical and Biological Engineering, Northwestern University, Evanston, Illinois 60208, USA
  • 3Department of Medical Social Sciences, Northwestern University, Feinberg School of Medicine, Chicago, Illinois 60611, USA
  • 4Department of Physical Medicine and Rehabilitation, Rehabilitation Institute of Chicago, Northwestern University, Chicago, Illinois 60611, USA
  • 5Department of Physics and Astronomy, Northwestern University, Evanston, Illinois 60208, USA
  • 6Northwestern Institute on Complex Systems, Northwestern University, Evanston, Illinois 60208, USA

Popular Summary

A significant fraction of data that are currently being generated and stored is in the form of unstructured text. Researchers use topic modeling algorithms to automatically classify text documents into topics in text recommendation systems, digital image analysis, spam filtering, and high-dimensional data mining with the goal of recording and extracting relevant data. With the goal of enabling intelligent data searches, we study synthetic and real data, whose topics are known, to evaluate the performance of state-of-the-art algorithms. We show that current optimization techniques often have trouble yielding accurate and reproducible results, particularly if the topics are heterogeneously distributed. By borrowing methods from graph clustering, we propose a novel optimization method with high accuracy and reproducibility and no computational overhead.

One state-of-the-art algorithm for classifying text-based data is latent Dirichlet allocation, a means of assigning topics to documents. We measure its performance using specific synthetic data with known topics. We find that the algorithm is not able to detect the ground-truth topics because of the roughness of the likelihood landscape. We propose a novel technique that builds a network of co-occurrent words and finds topics starting from clusters of words in that network. We show that our method is more accurate, both in synthetic cases and in a real-world case of documents from the journal Science.

We expect that our results will yield new ways of automatically classifying text documents, a process that is particularly relevant given the growth of electronic, searchable data.

Key Image

Article Text

Click to Expand

Supplemental Material

Click to Expand

References

Click to Expand
Issue

Vol. 5, Iss. 1 — January - March 2015

Subject Areas
Reuse & Permissions
Author publication services for translation and copyediting assistance advertisement

Authorization Required


×
×

Images

×

Sign up to receive regular email alerts from Physical Review X

Reuse & Permissions

It is not necessary to obtain permission to reuse this article or its components as it is available under the terms of the Creative Commons Attribution 3.0 License. This license permits unrestricted use, distribution, and reproduction in any medium, provided attribution to the author(s) and the published article's title, journal citation, and DOI are maintained. Please note that some figures may have been included with permission from other third parties. It is your responsibility to obtain the proper permission from the rights holder directly for these figures.

×

Log In

Cancel
×

Search


Article Lookup

Paste a citation or DOI

Enter a citation
×