High-Reproducibility and High-Accuracy Method for Automated Topic Classification

Open Access

High-Reproducibility and High-Accuracy Method for Automated Topic Classification

Andrea Lancichinetti, M. Irmak Sirer, Jane X. Wang, Daniel Acuna, Konrad Körding, and Luís A. Nunes Amaral

Phys. Rev. X 5, 011007 – Published 29 January 2015

Abstract

Much of human knowledge sits in large databases of unstructured text. Leveraging this knowledge requires algorithms that extract and record metadata on unstructured text documents. Assigning topics to documents will enable intelligent searching, statistical characterization, and meaningful classification. Latent Dirichlet allocation (LDA) is the state of the art in topic modeling. Here, we perform a systematic theoretical and numerical analysis that demonstrates that current optimization techniques for LDA often yield results that are not accurate in inferring the most suitable model parameters. Adapting approaches from community detection in networks, we propose a new algorithm that displays high reproducibility and high accuracy and also has high computational efficiency. We apply it to a large set of documents in the English Wikipedia and reveal its hierarchical structure.

2 More

Received 21 May 2014

DOI:https://doi.org/10.1103/PhysRevX.5.011007

This article is available under the terms of the Creative Commons Attribution 3.0 License. Further distribution of this work must maintain attribution to the author(s) and the published article’s title, journal citation, and DOI.

Published by the American Physical Society

Authors & Affiliations

Andrea Lancichinetti^1,2, M. Irmak Sirer², Jane X. Wang³, Daniel Acuna⁴, Konrad Körding⁴, and Luís A. Nunes Amaral^1,2,5,6

¹Howard Hughes Medical Institute (HHMI), Northwestern University, Evanston, Illinois 60208, USA
²Department of Chemical and Biological Engineering, Northwestern University, Evanston, Illinois 60208, USA
³Department of Medical Social Sciences, Northwestern University, Feinberg School of Medicine, Chicago, Illinois 60611, USA
⁴Department of Physical Medicine and Rehabilitation, Rehabilitation Institute of Chicago, Northwestern University, Chicago, Illinois 60611, USA
⁵Department of Physics and Astronomy, Northwestern University, Evanston, Illinois 60208, USA
⁶Northwestern Institute on Complex Systems, Northwestern University, Evanston, Illinois 60208, USA

Popular Summary

A significant fraction of data that are currently being generated and stored is in the form of unstructured text. Researchers use topic modeling algorithms to automatically classify text documents into topics in text recommendation systems, digital image analysis, spam filtering, and high-dimensional data mining with the goal of recording and extracting relevant data. With the goal of enabling intelligent data searches, we study synthetic and real data, whose topics are known, to evaluate the performance of state-of-the-art algorithms. We show that current optimization techniques often have trouble yielding accurate and reproducible results, particularly if the topics are heterogeneously distributed. By borrowing methods from graph clustering, we propose a novel optimization method with high accuracy and reproducibility and no computational overhead.

One state-of-the-art algorithm for classifying text-based data is latent Dirichlet allocation, a means of assigning topics to documents. We measure its performance using specific synthetic data with known topics. We find that the algorithm is not able to detect the ground-truth topics because of the roughness of the likelihood landscape. We propose a novel technique that builds a network of co-occurrent words and finds topics starting from clusters of words in that network. We show that our method is more accurate, both in synthetic cases and in a real-world case of documents from the journal Science.

We expect that our results will yield new ways of automatically classifying text documents, a process that is particularly relevant given the growth of electronic, searchable data.

Key Image

Article Text

Click to Expand

Supplemental Material

Click to Expand

References

Click to Expand

Issue

Vol. 5, Iss. 1 — January - March 2015

Subject Areas

Interdisciplinary Physics

Reuse & Permissions

Author publication services for translation and copyediting assistance advertisement

Physical Review X