• Open Access

Stochastic Model for the Vocabulary Growth in Natural Languages

Martin Gerlach and Eduardo G. Altmann
Phys. Rev. X 3, 021006 – Published 14 May 2013
PDFHTMLExport Citation

Abstract

We propose a stochastic model for the number of different words in a given database which incorporates the dependence on the database size and historical changes. The main feature of our model is the existence of two different classes of words: (i) a finite number of core words, which have higher frequency and do not affect the probability of a new word to be used, and (ii) the remaining virtually infinite number of noncore words, which have lower frequency and, once used, reduce the probability of a new word to be used in the future. Our model relies on a careful analysis of the Google Ngram database of books published in the last centuries, and its main consequence is the generalization of Zipf’s and Heaps’ law to two-scaling regimes. We confirm that these generalizations yield the best simple description of the data among generic descriptive models and that the two free parameters depend only on the language but not on the database. From the point of view of our model, the main change on historical time scales is the composition of the specific words included in the finite list of core words, which we observe to decay exponentially in time with a rate of approximately 30 words per year for English.

  • Received 4 December 2012

DOI:https://doi.org/10.1103/PhysRevX.3.021006

This article is available under the terms of the Creative Commons Attribution 3.0 License. Further distribution of this work must maintain attribution to the author(s) and the published article’s title, journal citation, and DOI.

Published by the American Physical Society

Authors & Affiliations

Martin Gerlach and Eduardo G. Altmann

  • Max Planck Institute for the Physics of Complex Systems, 01187 Dresden, Germany

Popular Summary

How many different words exist on the Internet, in all published books, or in a language as a whole? How are these numbers growing, in time and with database size? Is the finiteness of the vocabulary of any practical significance? The recent availability of large databases opens up possibilities of making new analyses, but deeper and systematic insights and predictive power must come from models that go beyond data alone. In this paper, we obtain a simple stochastic model that reveals the basic mechanisms that govern vocabulary growth and provides new insight into the questions above. The motivation and validation of this model come from a careful analysis of the Google Ngram database, a collection of several millions of books over the last five centuries.

Statistics of word usages share remarkable similarities with other social, physical, and biological systems. The most well-known similarity is the widespread appearance of fat-tailed distributions, e.g., Zipf’s law, which shows that words in a text span a wide range of frequencies. Revisiting some of the statistical features of word frequencies in our larger database, we observe the appearance of two scaling regimes in the Zipf’s law analysis and also in the functional dependence of the number of different words on the database size. Our two-parameter fitting curve explains within 50% accuracy the number of different words in texts that contain from 1000 to 100 000 000 000 words. These findings lead us to propose a model that describes the growth of the vocabulary as a simple stochastic process. Hypothetical, arbitrarily large texts that share the crucial properties observed in our database can be generated from the model. Interpreting the empirical observations with the help of this model, we conclude that (i) the vocabulary is virtually infinite, (ii) there exists a core vocabulary of the size of approximately 10 000 words, and (iii) the composition of the core vocabulary changes at a constant rate of 30 words per year following an exponential decay.

Our findings have a direct impact on the study of language dynamics on historical time scales and on applications in computer science such as search engines. They have additional significance as a remarkable example of how simple models are able to capture the main statistical features that emerge from the interaction of millions of individuals (or components). The simplicity and generality of our model make us confident that it will also find applications in different complex systems showing similar statistical behavior.

Key Image

Article Text

Click to Expand

Supplemental Material

Click to Expand

References

Click to Expand
Issue

Vol. 3, Iss. 2 — April - June 2013

Subject Areas
Reuse & Permissions
Author publication services for translation and copyediting assistance advertisement

Authorization Required


×
×

Images

×

Sign up to receive regular email alerts from Physical Review X

Reuse & Permissions

It is not necessary to obtain permission to reuse this article or its components as it is available under the terms of the Creative Commons Attribution 3.0 License. This license permits unrestricted use, distribution, and reproduction in any medium, provided attribution to the author(s) and the published article's title, journal citation, and DOI are maintained. Please note that some figures may have been included with permission from other third parties. It is your responsibility to obtain the proper permission from the rights holder directly for these figures.

×

Log In

Cancel
×

Search


Article Lookup

Paste a citation or DOI

Enter a citation
×