Abstract
We propose a stochastic model for the number of different words in a given database which incorporates the dependence on the database size and historical changes. The main feature of our model is the existence of two different classes of words: (i) a finite number of core words, which have higher frequency and do not affect the probability of a new word to be used, and (ii) the remaining virtually infinite number of noncore words, which have lower frequency and, once used, reduce the probability of a new word to be used in the future. Our model relies on a careful analysis of the Google Ngram database of books published in the last centuries, and its main consequence is the generalization of Zipf’s and Heaps’ law to two-scaling regimes. We confirm that these generalizations yield the best simple description of the data among generic descriptive models and that the two free parameters depend only on the language but not on the database. From the point of view of our model, the main change on historical time scales is the composition of the specific words included in the finite list of core words, which we observe to decay exponentially in time with a rate of approximately 30 words per year for English.
- Received 4 December 2012
DOI:https://doi.org/10.1103/PhysRevX.3.021006
This article is available under the terms of the Creative Commons Attribution 3.0 License. Further distribution of this work must maintain attribution to the author(s) and the published article’s title, journal citation, and DOI.
Published by the American Physical Society
Popular Summary
How many different words exist on the Internet, in all published books, or in a language as a whole? How are these numbers growing, in time and with database size? Is the finiteness of the vocabulary of any practical significance? The recent availability of large databases opens up possibilities of making new analyses, but deeper and systematic insights and predictive power must come from models that go beyond data alone. In this paper, we obtain a simple stochastic model that reveals the basic mechanisms that govern vocabulary growth and provides new insight into the questions above. The motivation and validation of this model come from a careful analysis of the Google Ngram database, a collection of several millions of books over the last five centuries.
Statistics of word usages share remarkable similarities with other social, physical, and biological systems. The most well-known similarity is the widespread appearance of fat-tailed distributions, e.g., Zipf’s law, which shows that words in a text span a wide range of frequencies. Revisiting some of the statistical features of word frequencies in our larger database, we observe the appearance of two scaling regimes in the Zipf’s law analysis and also in the functional dependence of the number of different words on the database size. Our two-parameter fitting curve explains within 50% accuracy the number of different words in texts that contain from 1000 to 100 000 000 000 words. These findings lead us to propose a model that describes the growth of the vocabulary as a simple stochastic process. Hypothetical, arbitrarily large texts that share the crucial properties observed in our database can be generated from the model. Interpreting the empirical observations with the help of this model, we conclude that (i) the vocabulary is virtually infinite, (ii) there exists a core vocabulary of the size of approximately 10 000 words, and (iii) the composition of the core vocabulary changes at a constant rate of 30 words per year following an exponential decay.
Our findings have a direct impact on the study of language dynamics on historical time scales and on applications in computer science such as search engines. They have additional significance as a remarkable example of how simple models are able to capture the main statistical features that emerge from the interaction of millions of individuals (or components). The simplicity and generality of our model make us confident that it will also find applications in different complex systems showing similar statistical behavior.