Stochastic model for the vocabulary growth in natural languages

We propose a stochastic model for the number of different words in a given database which incorporates the dependence on the database size and historical changes. The main feature of our model is the existence of two different classes of words: (i) a finite number of core-words which have higher frequency and do not affect the probability of a new word to be used; and (ii) the remaining virtually infinite number of noncore-words which have lower frequency and once used reduce the probability of a new word to be used in the future. Our model relies on a careful analysis of the google-ngram database of books published in the last centuries and its main consequence is the generalization of Zipf's and Heaps' law to two scaling regimes. We confirm that these generalizations yield the best simple description of the data among generic descriptive models and that the two free parameters depend only on the language but not on the database. From the point of view of our model the main change on historical time scales is the composition of the specific words included in the finite list of core-words, which we observe to decay exponentially in time with a rate of approximately 30 words per year for English.


I. INTRODUCTION
Even in our time of big data [1][2][3] there is no indication of a saturation of the vocabulary size (total number of different words) with increasing database size. In order to clarify whether it is meaningful to estimate a vocabulary size in the limit of infinitely large databases, it is essential to understand not only the birth and death of words [4][5][6], but also the process governing the usage of new words and its dependence on database size. The interest in this problem is motivated by fundamental linguistic studies [7,8] as well as by applications in search engines, which require an estimation of the number of different words in a given database [9][10][11].
The scaling between the number of different words, N , and the size of the database in words, M , as N ∼ M λ is known as Heaps' law [12] and has been studied in different linguistic [13][14][15] and non-linguistic [16,17] contexts. The universality and interest of this empirical scaling is surpassed only by Zipf's law [18], which states that the frequency F (r) of the r-th most frequent word in a database decays as F (r) ∼ 1/r. The relation between Heaps' and Zipf's law has been the subject of great recent interest [19][20][21]. Furthermore, it is well known that deviations of the Heaps'-and Zipf's-laws are observed in the tails of Heaps'-and Zipf's-plots (i.e., for large N and r, respectively) [22][23][24]. Similar deviations of fat-tailed distributions appear in a variety of social and physical systems [25,26] and are crucial when extrapolating to the limit of large databases.
In this paper we propose a stochastic growth model whose predictions go beyond the simpler scalings of Heaps' and Zipf's law and are compatible with actual observations in the tail of the corresponding distributions. Our model is in the same spirit of, but differs from, the simpler versions of Yule's-, Simon's-, Gibrat's-, and preferential attachment-growth models [26][27][28][29], because it contains two categories of words and leads to two scaling regimes in the Heaps'-and Zipf's-plots. These findings are supported by a statistical analysis of the googlengram database indicating that the only two free parameters needed in the description of these scalings remain unchanged over centuries, depend only on the language, and that there is a slow change of words belonging to each category. The latter adds to the recent interest in language dynamics as a complex system [30,31].
The paper is organized as follows: in Sec. II we present statistical analysis of the google-ngram database in terms of word frequencies as well as the growth of the vocabulary. This will then lead us to the formulation of our stochastic model for the vocabulary growth in Sec. III. In Sec. IV we investigate dynamical aspects on historical time scales within the framework of our model.

II. DATA ANALYSIS Data
The main motivation for our model comes from empirical observations. As databases, we use the googlengram corpus [1] for English, German, French, Spanish, and Russian, which provide data of the word-frequencies (occurring in printed books) with a yearly resolution for a period of several hundred years (1520-2000). Our main interest in this database stems from its large size (several millions of books with > 10 11 words) and from the long time span it covers (thus enabling us to trace historical changes in the usage of language). We consider as words only the 1-grams consisting uniquely by letters present in the alphabet of the corresponding language. This pragmatic definition reduces the effect of symbolic sequences, foreign words, numbers, or scanning problems in our observations and should be taken into account when interpreting our findings about the vocabulary. For each language we use two different partitions of the database: i) yearly (y), in which case y(t) corresponds to the database of the year t; and ii) cumulative (Y ), in which case Y (t) = t t =to y(t ). We consider only words which appeared at least n = 41 times in order to avoid biases due to the filtering mechanism used in the google-ngram database, see Supplemental Material (SM) Sec. I for further details [32]. Here we show our detailed analysis for the largest database (English, t 0 = 1520, t ∈ [1805,2000]). For the other 4 languages we report the main findings and leave the details for the SM [32].

Zipf 's analysis
Our first empirical analysis focuses on the distribution of word frequencies. In his seminal work, Zipf proposed that the frequency of the r-th most frequent word in a given text is given by F (r) = F (1)/r [18]. It is easy to see that this scaling has to break for large r: due to the divergence of the harmonic series, for sufficiently large databases one arrives at N r=1 F (r) > 1 (sum of frequencies larger than text size). In English F (1) ≈ 0.07 (the frequency of "the") and N r=1 F (r) > 1 for N ≈ 10 6 , meaning that F (r) has to decay faster than 1/r for r 10 6 . This well-known expectation, which is clearly seen in our data shown in Fig. 1(a), motivated numerous different generalization of Zipf's proposal [33][34][35]. While many of these proposals were shown to provide a better account of particular databases, they remain in a great extent unsatisfactory because they lack the simplicity and universality of Zipf's original proposal (e.g., the parameters vary depending on the size, topic or date of publication of the analyzed texts [36,37]).
Motivated by the new magnitude of our large database, we apply rigorous statistical tests to determine which of the previously proposed distributions provide a better account of the data. We select 7 of the most popular previously-proposed heavy-tailed distributions with at most 2 free parameters [8,23,24]: power-law, two power-laws, shifted power-law, log-normal, Weibull, and power-laws with exponential cutoffs (in the tail and beginning, respectively). The parameters for each distribution were obtained numerically by means of Maximum Likelihood (ML) estimation [38]. In addition we i) calculate the probability that the data was generated by that model (χ 2 p-value [39,40]) and ii) compare which model is more likely to describe the data (relative likelihood [41,42]) for each fit (for details see SM-Sec. II A [32]).
The results show that it is extremely unlikely (p < 10 −15 ) that the data was drawn exactly from any of the proposed distributions, a consequence of the large databases which makes any small (true) deviation incompatible with these simple fits. On the other hand, the results show unequivocally that for English the distribution with two power-laws is the best fit (1 − p < 10 −15 ) for all databases with a size larger than 10 9 words. We confirm that the double power-law is also the best fit for the English Wikipedia, a strong indication of the validity of this result in databases of different origin (see SM-Sec. IIB for the detailed analysis on both databases [32] which uses methods reported in Refs. [43]).
We now discuss in detail the best two-parameter model we identify from our data: characterizing a double power-law (dp), where b, and γ are free-parameters, and C = C(γ, b) is the normalization constant [44]. The effect of the threshold n applied to the frequency of words is that, in practice, data of F (r) is limited to F (r) ≥ n/M (M is the observed number of words). The original Zipf's law is recovered for highfrequency words and a critical rank r = b determines a transition to a power-law with exponent γ. Double power-laws were proposed as a generalization of Zipf's law in Ref. [45] and further investigated in Refs. [46,47]. These insightful works used distributions with two powerlaw exponents γ 1 , γ 2 and were motivated by the visual inspection of double logarithmic plots. Our improved statistical analysis confirm and extend these observations for the simpler distribution Eq. (1). Besides the likelihood analysis and visual inspection given in Fig. 1, a third strong evidence in favor of distribution (1) comes from the comparison of the estimated parameters of different corpora shown in Fig. 1(b,c). Very similar values b ∈ [7 · 10 3 , 12 · 10 3 ] and γ ∈ [1.8, 2.5] were obtained for non-overlapping databases (for the English Wikipedia: b = 7830, γ = 1.68), and the fluctuations become smaller for increasing database size. These observations strongly suggest that the same fixed parameters provide a good description of all English texts (e.g., y(1900) and y (2000)). Therefore, hereafter we do not consider individual fits for each database and instead assume that Eq. (1) is valid with b = b * = 7873 and γ = γ * = 1.77, values obtained for our largest database Y (2000). Similar findings also apply to the other languages. In Tab. I we summarize the parameters γ * and b * obtained from a ML-fit of the largest database Y (2000) of the respective language to Eq. (1). French and Spanish are also best described by Eq. (1) for databases exceeding a particular size and yield values for γ * and b * similar to English. For German and Russian Eq. (1) constitutes only the second best model. However, we have strong indications that it provides a better account of the tails (r b * ) and therefore we expect that even larger databases will reveal the double power-law as the best fit also in these languages (see SM-Sec. II B for details [32]). Apart from being the smallest databases among the investigated languages, another feature affecting the fitting in German and especially in Russian is the higher degree of inflection in the morphology of these languages. We recall that no lemmatization was applied in our definition of words and, therefore, inflected words (obtained, e.g., by adding a suffix) are counted as distinct words. This reasoning explains the higher measured values of b * (vocabulary in the r −1 regime). From the fitting perspective, however, the large values of b * in German and Russian require even larger databases to characterize the deviations from the r −1 regime for r b * .

Heaps' analysis
We now turn to our second empirical analysis: the dependence of the number of different words, N , on the size of the database, i.e. total number of words, M . The classical result for this relation is the empirical Heaps' law [12], which states that N ∼ M λ with λ ∈ [0, 1] (A ∼ B indicates that A/B =constant for large B). We start searching for the consequences of our previous observations in the Zipf's analysis to this new problem. A simple and powerful approach is the so-called Zipfian ensemble (ZE) [21], which can be traced [47] back to Mandelbrot [48]. It assumes that the occurrence of every possible word is governed by a Poisson process with an intensity proportional to its frequency (see SM-Sec. III A [32]). It was shown that under this or similar assumptions (e.g., stochastic processes with fixed frequencies for words), asymptotically Heaps' law can be interpreted as a direct consequence of a Zipfian rank frequency distribution F (r) ∼ r −γ [9,13,14,19,21] and vice versa [20,49,50], where γ = 1/λ [48]. Here we want to draw attention to the fact that these observations are not restricted to Zipf's and Heaps' laws, i.e., assuming a stochastic model, the relationship between F (r) and N (M ) can always be established. The expectation of the ZE of Eq. (1) with a threshold n 1 is (see SM-Sec. III B [32]) where M b is the number of words such that N (M b ) = b and the scaling constant C n = C/n [C ≈ F (1) being the frequency of the most common word, as can be seen from Eq. (1)]. Thus, the effect of the threshold n applied to the growth curve of the vocabulary simply amounts to rescaling the constant C. While the expected (average) number of distinct words over many realizations of the stochastic process leads to a sharp transition between the two regimes, the values of N dp (M ≈ M b ) might depend more strongly on the particular realization. In Fig. 2 we show that the data in the google-ngram database obeys the scalings of Eq. (2). In Fig. 2(a) we present the N (M ) curve for English. While for the yearly database y(t) we obtain a set of points for each t, the cumulative database Y (t) builds a curve of vocabulary growth for increasing t. Despite the differences in these databases, all the data lie in a relatively narrow region of the plot which resembles a single curve compatible with the double scaling of Eq. (2). This curve is well described by the N (M ) curve obtained from the combination of the double power-law distribution Eq. (1) with fixed parameters (γ * , b * ) and the assumption of Poisson usage of words, in the spirit of the ZE. Similar observations apply to all considered languages, as shown in Fig. 2(b). On closer inspection, Fig. 2(c), the fine details of the N (M ) curve are not compatible with the fluctuations expected from the strongly simplifying assumptions of the ZE. Nevertheless, it is remarkable that the agreement between model and data remains within 50% for different databases and over 9 orders of magnitude in size.
Here it is worth revisiting the question about the finitude of the vocabulary. Even after more than 10 6 different words the N (M ) data in Fig. 2 does not seem to saturate. To further investigate this point, we perform the ZE with the same rank-frequency distribution from Eq. (1) (fixed b * , γ * ) but varying the maximum possible number of different words N max ZE , e.g., 1, 2, 5 ,10, and 100 times the observed number of distinct words in our largest database Y (2000). It can be seen in Fig. 2(d) that the differences for the predicted growth curves for such different hypothetical vocabulary sizes are negligible compared to the fluctuations of the real data. From this we conclude that given the data accessible so far the possible vocabulary can be regarded for all practical purposes to be infinite (although bounded by combinatorial arguments due to a finite alphabet and word length). The fact that the same distribution Eq. (1) with fixed parameters accounts for the observation across all years shows that the observation of different number of words is driven mainly by the different database size and not by a change in vocabulary richness over time.

III. MODEL
In this section we propose a simple generative model which recovers and allows for an improved interpretation of the double scalings in our empirical findings -Eqs. (1) and (2).
Our approach is different from Zipf's original explanation based on a principle of least effort between speakers and listeners [18,51], but instead is in line with the tradition of Yule-type stochastic growth models explaining fat-tailed distributions [26][27][28][29]. The main novelty in our model is that it contains two classes of word-types: a core vocabulary and a noncore vocabulary [46]. At each step a word (i.e. word-token) is drawn (M → M + 1) and attributed to one of the distinct words (i.e. word-type) depending on probabilities specified below, see Fig. 3 for a sketch of the model. The total number of word-types is given by N = N c + Nc, where (Nc) N c is the number of (non)core-words. The new word-token can either be a new word-type (N → N + 1) with a probability p new or an already existing word-type (N → N ) with probability 1−p new . In the latter case, a (previously used) word-type is attributed to the word-token at random with probability proportional to the number of times this word-type has occurred before. In the former case, the new wordtype can either originate from a finite set of N max c corewords (N c → N c + 1) with probability p c or come from a potentially infinite set of noncore-words (Nc → Nc + 1). In our simplest model we consider p c to be a constant, i.e. p 0 c 1, which becomes zero only if all core-words were drawn (N c = N max c ): ( The final element of our model, which establishes the distinguishing aspect of core-words, is the dependence of p new on N . We choose p new (and p c ) to depend on N and not on M because an increase in N necessarily reflects that fewer undiscovered words exist while an increase in M is strongly affected by repetitions of frequently used words. By definition, we think of core-words as necessary in the creation of any text and, therefore, the usage of a new core-word in a particular text should be expected and thus not affect the probability of using a new (noncore) word-type in the future, i.e., p new = p new (Nc). On the other hand, if a noncore-word is used for the first time (N c → N c + 1) the combination of this word with the previously used (core and non-core) words lead to a combinatorial increase in possibilities of expression of new ideas with the already used vocabulary and thus to a decrease in the marginal need for additional new words [47]. In our model, this argument suggests that p new should decrease with Nc. Taking these factors into account, we propose as an update rule for p new after each occurrence of a new non-core word as with the decay rate α > 0 and the constant s 1 which is introduced simply in order to damp the reduction of p new for small N c (for simplicity, we use s = N max c ). The main justification for the exact functional form in Eq. (4) is that it allows us to recover the empirical observations reported in Sec. II, as shown below. An alternative a posteriori justification will be given at the end of this section and shows that Eq. (4) can be interpreted as a direct consequence of an unlimited non-core vocabulary.
We now show how this model recovers Eqs. (1) and (2). We require that 1−p 0 c 1, which simply means that it is much more likely to draw core-words than noncore-words initially. In this case we can obtain approximately exact solutions for N (M ) in the two limiting cases considered in Eq. (2). When N N max c , which implies N c , Nc N max c , it follows from Eqs. (3) and (4) that p new ≈ const. and therefore we trivially obtain that N ∼ M 1 . This case resembles the very beginning of the vocabulary growth, when most new word-types belong to the set of corewords. In the case N N max c , p c = 0 and N ≈ N c so that Eq. (4) becomes in the continuum limit: from which it follows that p new ∼ N −α .
We now obtain the expected growth curve N (M ). Notice that our model can be considered a biased random walk in N , which, as an approximation, can be mapped onto a binomial random walk by the coordinate transformation N (M ) such that p new (N ) = p new (N (M )). The resulting Poisson-Binomial process [52] can be treated analytically, e.g., the transformation N (M ) is then given by the average of the vocabulary growth: Using p new ∼ N −α , this equation holds (self-consistently) by assuming a sub-linear growth for the vocabulary N ∼ M λ , where the relation λ = (1 + α) −1 is established (for details see SM-Sec. IV [32]). In accordance with Eq. (2), we identify the following relation between the parameters: N max c = b and α = γ − 1. The fitting parameters of Eq. (1) can thus be interpreted as: b is the size of the core vocabulary and γ controls the sensitivity of the probability of using a new word to the number of already used words in Eq. (5).
Since the probability of usage for already used wordtypes is assumed to be proportional to the number of times it occurred before, we guarantee that Eq. (2) implies (1) [20], meaning that the double scaling in the Zipf plot is also recovered from our generative model. While the previous arguments show that the correct scalings are obtained by our model, in order to obtain an agreement with the data it is essential to: (i) use the normalization constant C in order to determine the initial probability of finding a new word in Eq. (4); (ii) re-scale the distribution using the threshold n as M/n; and (iii) account for the disproportionally large weight of the first word-types (in the Zipf plot). Taking these points into account, direct simulations of the model in Fig. 3 with the traditional parameters b = b * and γ = γ * lead to Zipf's and Heaps' curves, which resemble the original fits. See SM-Sec. V for all details [32].
It is worth comparing the generative model with the model of random usage of words with fixed frequency, the ZE model discussed in the previous section. While the ZE model allowed us to obtain Heaps' curves from Zipf's distributions (and vice-versa), in the generative model we simultaneously obtain the double scaling regime in both cases. It is important to stress that individual texts or single databases should not be considered as the output of single realizations of our generative model. Instead, we consider that not only texts but also all databases have a negligible size when compared to the language as a whole and therefore should be thought of as a small subsample (M database M ) of the output of our generative model, retrieved after it achieved its stationary state (M → ∞). In this case, changes in word frequencies become negligible (in the scale of M ) during the creation of the database (in the scale of M database ). Therefore, the vocabulary growth of the created database is well approximated by the ZE model with F dp (r).
Finally, we take profit of our previous calculations and provide an a posteriori justification of the key assumption of our model, Eq. (4). Our starting point is the observation -see Fig. 2(d) -that vocabulary is for all practical purposes infinite. We therefore postulate that and by following (in reverse order) the previous calculations we naturally arrive at Eq. (4). From the first line of Eq. (6) we see that in order to fulfill our postulate (7), p new has to decay at least as slow as p new (M ) ∼ M −δ with δ ≤ 1 for M → ∞. In a minimal model it is reasonable to assume such a power-law decay, in which case the first line of Eq. (6) implies that N (M ) ∼ M λ with λ = 1 − δ. Making a transformation of variables from M to N we obtain In turn this is equivalent to Eq. (5), from which we recover Eq. (4) as a discretized version. Thus we see that Eq. (4) is a minimal assumption for an unbounded vocabulary.

IV. HISTORICAL CHANGES
The model described so far has been shown to give a good account for all databases and all years with the same fixed two parameters N max c = b * = 7, 873 and α = γ * − 1 = 0.77 in the case of English. A natural question is, therefore, what actually changes in historical time scales? Considering two different databases (say two different years), our model does not consider any differences in the actual composition of the core-vocabulary. Even if the value of N max c remains constant this does not mean that the same words are observed for all years. From the point of view of our model, the main change a word can experience is to enter or to leave the group of corewords. For instance, comparing the decades 1891 − 1900 and 1991 − 2000, the most frequent words which left the core-vocabulary were majesty, doubtless, furnished, monsieur, napoleon, and hitherto, while the ones which entered were cultural, context, technology, programs, environmental, and computer [53] In order to quantify this effect, we investigate the replacement of words from the core-vocabulary in the yearly databases y(t) in the time t ∈ [1805, 2000] in Fig. 4. We calculate the fraction f (t, ∆t) of core-words (i.e. with rank r < b * = 7873, fixed for all t) from y(t) that remain in the set of core-words in y(t + ∆t). Figure 4(a) shows that all curves can be qualitatively described by an exponential decay independent of whether forward (∆t > 0) or backward time (∆t < 0) was considered. This is further supported in Fig. 4(b-c), where the parameters f 0 and κ obtained numerically from a least-square fit [38] of Eq. in the fit, for each t we performed a fit with the same number of points min{2000 − t, t − 1805} forwards and backwards in time. On closer inspection, two features connected to the interpretation of the parameters f 0 and κ deserve a more careful discussion. The parameter f 0 < 1 represents the discontinuous change of corewords in two subsequent years. It strongly depends on the different selection of books in the construction of the respective databases and can be attributed to the finite size of the database, which leads to a wrong estimation of the "true" core-words. Consistently with this interpretation, Fig. 4(b) shows that f 0 grows over time, due to the fact that database size increases leading to a better sampling of words. Nevertheless, a value of f 0 ≈ 0.98 indicates that this is still far from being negligible (e.g., for N max c = 7, 873 this means that around 150 words of the set of core-words will be different due to finite sampling). In contrast, the decay rate κ describes the continuous replacement of core-words over time with a rate of κN max c ≈ 30 words per year. The most intriguing observation in Fig. 4(c) is that this change experiences an acceleration over time as κ grows by more than 50% from 1805 to 2000.
Finally, it is worth discussing the implications of these findings on our generative model. The characteristic time scale of the core-vocabulary replacement (≈ 1/κ) is on the order of centuries. This means that on the scale of a few decades our generative model holds with the asumption of a constant core vocabulary. On longer time scales our model has to be refined in order to include: i) a probability of replacement of the words belonging to the core-vocabulary; and ii) a finite memory or a distinction between core-and noncore-words in the preferential attachment part of our model.

V. DISCUSSION
In summary, we have shown that the rank frequency distribution and the vocabulary growth of languages can be best described by simple two-scaling functions. The only two free parameters of the functions are related to each other and remain almost unchanged over centuries as well as databases and depend only on the considered language. We have also shown that these empirical findings can be interpreted as the result of a finite number of words belonging to a core vocabulary, which have different properties from the remaining virtually unlimited number of words, as summarized in Tab. II. This conclusion was achieved based on a simple generative stochastic model for the vocabulary growth. Finally, we found that in English the composition of the core-vocabulary experiences an exponential decay with a rate of 30 words per year, which is, remarkably, steadily accelerating in the past decades.

Core Words
Non-core-words It is worth comparing these findings in view to previous results. As far as we are aware, our analysis provides the first rigorous statistical confirmation of similar previous proposals [45][46][47] of the double-scaling generalizations of Zipf's law -Eq. (1). The consequence of this to vocabulary growth and Heap's law (see also [47]), which we drew based on a Poisson usage of words [21], is that the rate of introduction of new words decays but never vanishes with increasing database size. This is in contrast to recent claims which reported a convergence to a maximum vocabulary size [14]. We note that this previous analysis was based on single books and therefore the database sizes were close to our transition point N max c , which we believe could have been misinterpreted as a systematic decay. A generalization of a Yule's type process to obtain double-scaling degree distribution in a network of words was introduced in Ref. [54]. Two crucial differences to our model are that it yields fixed exponents and cannot be understood as a generative model of texts (word by word). Interestingly, in Ref. [6] an analysis of the network constructed from the thesaurus also showed the existence of a set of core-words almost of the same size as ours.
Our simple model and expression for the vocabulary growth as a function of database size has important practical consequences. Simply knowing the database size (in number of words, M , or potentially in bits), and using the language dependent parameters (C, N max c = b * , α = γ * −1) reported above, from Eq. (2) one can immediately estimate the expected number of different words, N , appearing more than n times. This is crucial for search engines and data mining programs because it allows for an estimation of the memory to be allocated prior to the scanning of an unknown database, e.g., in the construction of the inverted index [9][10][11]. Even the fluctuations around this expectation can be easily computed through our generative model or through the Poisson assumption of word-usage. Of course, this strong assumption ignores correlations and typically underestimates the expected fluctuations, so that our model should be considered as the simplest null model. The existence of a transition between two scalings (which is under the reach of even single large books) shows that simple estimations based only on the traditional Zipf's law have to be generalized. For instance, a commonly used index of vocabulary richness of a text is Herdan's coefficient given by the ratio log N/ log M [8]. In view of our results, the coefficient is highly dependent on which of the two scaling regimes is reached with the given size of the text.
We now compare our observations of change on historical time scales to other historical changes in language usage. For the whole vocabulary, we obtain that the vocaulary size is mainly driven by the available database size. This is in contrast to previous conclusions based on the same google-ngram database which detected a growth of vocabulary in time [1]. Here it is important to note that this previous analysis included a substantially different filtering of the listed 1-grams to achieve valid words in the vocabulary, including a frequency criterion and manual classification. Still, our results show that also in this case a more careful analysis of the role of the database size is needed. For the core vocabulary, we observe a fairly constant number of constituents over centuries. The number of words common to corevocabularies of different databases was found to decay exponentially with the time between publication of the databases, e.g., for English the decay rate is approximately 30 words per year and the half-life of the core vocabulary is ≈ 200 years. It is worth to compare these numbers with recent studies which reported half-lifes for: i) the regularization of verbs (750 to 10 000 years) [5], and ii) a fundamanetal vocabulary of 200 words (300 to 38 000 years) [4]. Perhaps our most intriguing finding is the approximately linear increase of the rate in time, which eventually confirms the overall acceleration of language change and society in general, as propagated in Ref. [1].
Our results can be extended in many directions and open new possibilities of studies of vocabulary change. Directly related to our observations and model, it remains to be explained the specific value of the parameter γ * ≈ 1.77, which is intriguingly similar across different languages. Another important point is to assess the limitations of our estimations due to the role of correlations inside real texts and databases, and how this could be introduced into our model. Furthermore, it remains to be shown whether the transition between two scalings due to the existence of a core vocabulary can be related to the phenomenon of phase transitions in ranking stability of complex systems recently reported in Ref. [55]. Finally, we believe that our model provides the correct null model for normalizations due to database sizes and that therefore future investigations of historical effects on the vocabulary should take this into account. The data obtained from the google-ngram database [1] is filtered in two steps. First, we decapitalize each word (e.g. 'the' and 'The' are counted as the same word) and further restrict ourselves to words consisting uniquely of letters present in the alphabet of the corresponding language and the symbol " ' " (apostrophe). This is meant as a conservative approach in order to minimize the influence of foreign words, numbers (e.g. prices), or scanning problems which are present in the raw data. In the second step, when constructing yearly data y(t), i.e., words present in books published in year t, we include only those words in the database y(t), which appear more than 40 times in that particular year. In the same way, for the cumulative data Y (t) we include only those words, which appeared more than 40 times until time t. In this way we avoid a possible bias due to the filtering applied in the construction of the raw data (words had to appear more than 40 times in all times in order to be included in the database [1]). As an example of possible bias, in case we had not applied this filter, take two words (called '1' and '2') with N 1 (t) = N 2 (t) = 21 occurrences in year t. If now ∀t = t : N 1 (t ) = 0 and ∃t = t : N 2 (t ) > 20, word '2' would be present in the raw data whereas word '1' would be not. As a result we would only include word '2' in the yearly database y(t). With our additional filter neither word '1' nor word '2' appears in the yearly database y(t).
In Fig. S1 we show the resulting database size for the yearly data y(t) and the cumulative data where n is the total number of common elements, n c the number of concordant, and n d the number of disconcordant pairs between the two databases with respect to the ranking of frequencies. Clearly, at t = 1800 a noncontinuous change in τ can be identified, from which we conclude that database composition is dramatically different in the years before and after t = 1800. In order not to be affected by this change the yearly data y(t) is only considered in the period t ∈ [1805, 2000]. However, in order to take advantage of the full size of the database, the cumulative data Y (t) is constructed taking into account all the years prior to t = 1805.

A. Theory
In this section we give account of the distributions proposed for fitting the rank-frequency distribution and present the details of the Maximum likelihood estimation procedure. The procedures are standard [2], but here we fit directly the rank frequency distribution originally proposed by Zipf [3] instead of the word frequency distribution considered in Ref. [4].
In Tab. S1 the proposed descriptive models used to fit the rank-frequency distribution are presented. The notation F (r; Ω) means that the distribution F depends on the rank r, and Ω is the set of parameters. The normalization constant C = C(Ω) is a function of the respective parameters and fixed by ∞ r=1 F (r; Ω) = 1. In practice, this is calculated with the Euler-Maclaurin formula available in the package mpmath [5].
The parameters of each distribution are estimated numerically by minimizing the negative of the log-Likelihood i distribution F (r; Ω) set of parameters Ω 1 Power-Law Cr −γ γ 2 Shifted Power-Law C(r + b) −γ γ, b 3 Power-Law with Exponential cutoff (tail) C exp (−br) r −γ γ, b 4 Power-Law with Exponential cutoff (beginning) where In this expression M is the number of tokens, which implies that the sum goes over each observed token i and its corresponding rank r(i). In practice, the minimization is obtained with a Nelder-Mead simplex algorithm (available in the Scipy library [6]). The quality of the fit was evaluated quantitatively by means of a p-value obtained from a χ 2 -statistics [7]: Here the domain is partitioned into Q cells, such that the expected number of observations per cell n j ≥ 5 [8], with N j being the actual observed number of observations in cell j. A recently proposed alternative strategy [9] involving the comparison of the Kolmogorov-Smirnow statistics of the actual empirical data with randomly generated data is computationally not feasible in this case, because it would require us to draw ≈ 10 15 random numbers (p-value precision 0.01) due to the size of the database of > 10 11 tokens.
In the last step we determine which of the proposed models i = 1...R, where R is the number different models considered, is most likely to describe the data. In order to account for the different number of fitted parameters we calculate the Akaike information criterion (AIC) [10] for each model i where K is the number of parameters estimated in the model. The model which gives the minimum value AIC min = min i {AIC i } is most likely to describe the given data. From this we can calculate the relative likelihood l i [11] which states how likely model i is to describe the data in comparison with the best model. This implies that the probability w i that model i (out of the R models considered) describes the data is given by [11]

Google-ngram
In this section we give a detailed overview of the results obtained from fitting the models in Tab. S1 to the rankfrequency distributions for all languages considered, i.e., English, French, Spanish, German, and Russian. In Fig. S3 -S7(a+b) we plot the AIC from the models in Tab. S1 applied to yearly y(t) and cumulative data Y (t) of the respective language. In Fig. S3 -S7(c) we show explicitly the rank-frequency distribution of the data Y (2000) and the corresponding fits of the three models that yield the best description: the double power-law (i = 7), the power-law with an exponential cutoff in the tail (i = 3), and the log-normal (i = 5).
For English, i = 7 yields the best description of the yearly data for t 1950 and for the cumulative data for t 1810. As the databases y(1950) and Y (1810) can be considered independent datasets and by comparing with Fig. S1(a) we conclude that the size of the database needs to exceed a certain threshold (≈ 10 9 tokens) in order to discriminate competing models like the i = 3 in the tail. This is further corroborated by looking at the inset in Fig. S3(c), where it can be seen that i = 7 outperforms i = 3, 5 especially in the description of the tail of the distribution.
For the other languages except English the AIC of the yearly data y(t) favours i = 3. This comes with no surprise since their size is limited to < 10 9 tokens for all t ∈ [1805, 2000], as can be seen in Fig. S1(a). In contrast, the cumulative data Y (t) shows different results. For French and Spanish the AIC favors i = 7 as the size of the database grows, especially for the largest dataset Y (2000). Again, this becomes clear when looking at the deviations of the fits to the real data in the inset of Fig. S4(c), S5(c), which seem to diverge for i = 3, 5 in the tail of the distribution. For German and Russian the AIC identifies i = 7 only as the second best fit for the cumulative data Y (t). This is most probably due to the fact that the size of the database for those languages is still not large enough in order to discriminate a second power-law regime clearly. Additionally, for these languages the critical rank b * , where a transition between the two power-laws occurs, is shifted towards higher values, possibly due to the different degree of inflection (see main text). This in turn implies that the fraction of tokens belonging to the power-law in the tail is much smaller than in English, which means that a larger database is needed in order to discriminate i = 3, 5. This claim is further supported by the insets of Fig. S6(c), S7(c), where we show that especially in the tail of the distribution i = 7 deviates less from the data than the competing models.
Whereas English, French, and Spanish give approximately the same values for the largest database Y (2000), German and Russian show larger values for b and a different power-law exponent in the tail (see main text). The latter might point towards more subtle differences between the languages besides inflection.

Wikipedia
In this section we want to show that our findings related to the double power-law fit are indeed of general validity and do not originate from peculiarities of the google-ngram database, e.g. scanning problems. For this we choose a complete snapshot of the English Wikipedia [12], because i) it contains a large amount of text, ii) the text does not need to be scanned, and iii) the publishing process is inherently different from that of books.
We filter the Wikipedia database in three steps. First, using the WikiExtractor developed by the University of Pisa Multimedia Lab [13], we store only the plain text, neglecting any additional information or annotation such as images, tables, tags, references, or lists. In a second step we remove all punctuation characters (e.g. ",", ";", or "{") and cut the text into words at the whitespace characters in a similar manner as described in the construction of the google-ngram database [1]. The final step consists of decapitalizing each word and restricting ourselves only to words consisting uniquely of the letters a − z, a filter we also applied to the google-ngram database (see SM-Sec. I). The resulting sequence of words consists of N ≈ 3.7 10 6 types and M ≈ 1.3 10 9 tokens.
Following the recipe in SM-Sec. II A, we show that the results for fitting the models in Tab. S1 to the rank-frequency distribution of the Wikipedia database is consistent with the results from the google-ngram database. In Tab. S2 we show the values for the AIC from which we can see that the double power-law is the best fit among the proposed models with a probability 1 − p < 10 −15 . Additionally, in Fig. S8  Wikipedia data and the corresponding fits of the three models that yield the best description: the double power-law (i = 7), the power-law with an exponential cutoff in the tail (i = 3), and the log-normal (i = 5). This corroborates our claim that the double power-law is the best fit for the rank-frequency distribution. Furthermore, the estimated values for the parameters are γ = 1.68 and b = 7830, which closely matches our observations from the google-ngram database (γ * = 1.77, b * = 7873).

A. Theory
The Zipfian Ensemble (ZE) [14] is a simple approach to model the size of the vocabulary depending on the text length given the rank-frequency distribution F (r), r = 1...N max ZE , where N max ZE ∈ [1, ∞) is the hypothetical (maximum) size of the vocabulary. The occurrence of each word-type with rank r is assumed to be governed independently by a Poisson process with an intensity equal to its frequency, e.g., λ(r) = F (r), where time is measured in units of tokens M (text length). This means that the probability that this word-type occurs at least once in the interval T 1 ∈ [0, M ] is given by [15] From this we can calculate the vocabulary size N (M ) by summing over all word-types, which gives the expected (average over realizations) number of words (out of N max ZE different words in total) that appeared at least once up until time M : The variance of the ZE over the different realizations indicates the expected fluctuation around N (M ) in Eq. (S9) and is given by [14]: Although similar, this framework differs from the usual 'bag-of-words' (or shuffled texts) in the sense that i) the expected time of occurrence of a word need not to be an integer and ii) two words can in principle occur at the same time due to the independence of the Poisson processes. This in turn limits the interpretation of the ZE as a model for the creation of a text token by token. However, it allows for an analytic treatment and the continuous time approximation becomes better in the limit of large databases.

B. ZE in the double power-law
In this section we want to show that a double power-law in the rank frequency distribution (Eq. (1) main text) can lead to the double scaling in the vocabulary growth (Eq. (2) in main text) in the framework of the ZE.
First, we generalize the ZE to cases where words have to appear at least n times before they are considered part of the vocabulary. The introduction of a threshold n means that instead of looking at the probability for the time until its first occurrence T 1 , one considers T n , the time it takes until the word occurs n times and Eq. (S8) becomes From this, Eq. (S9) can be directly extended to In the next step we consider the limit n 1. As the stochastic variable T n is the sum of n times the stochastic variable T 1 , which is distributed according to Eq. (S8), one can conclude that by means of the central limit theorem it follows that P (T n = M/n; r) will approach a Gaussian with vanishing variance, such that by rescaling M → M/n Eq. (S11) asymptotically becomes lim n→∞ P (T n ≤ M/n; r) = Θ (M/n − τ (r)) , (S13) where τ (r) = 1/F (r) is the inverse of the frequency F (r) of the particular word-type and Θ(x) is the Heaviside step function. For the vocabulary growth this yields Thus we obtain a direct relationship between the rank-frequency distribution and the vocabulary growth lim n→∞ N (n) (M = 1/F (r)) = r, (S15) whereM = M/n. In Fig. S9 we show the N (n) (M ) curve obtained from the ZE for the double power-law (Eq.(1) main text) with parameters γ * = 1.77 and b * = 7873 for different thresholds n. One can see that the growth curves for n > 8 are almost indistinguishable from the asymptotic solution Eq. (S15), which can be attributed to the fast convergence implied by the central limit theorem.
From these observation we conclude that Eq. (S15) is already a good approximation for n 1, where in practice this can mean n > 10. As a result we obtain Eq. (2) from the main text. This means that the increase of the threshold n leads to a reduction of the fluctuations of the growth curve of the vocabulary and can be explained as a result of a simple stochastic process. In Fig. S10 we show that this claim holds when applied to real texts of the size of single books, as well as for a collection of several million books, as in Fig. S11. (S17) Noting that N (0) = 0, this gives for Eq. (S16): (S18) We find that this equation holds only if λ = (1 + α) −1 and c 2 = λc 1 λ 1 .

V. NUMERICAL SIMULATION OF THE STOCHASTIC MODEL
In this section we show the results of the direct numerical simulation of the model proposed in Sec. III, main text.

A. Parameters and initialization
In order to simulate the model, apart from fixing a number of parameters (N max c , α, p 0 c ), we need to prescribe how the model is initialized, e.g., what is the initial probability of using a new word p 0 new and how many word types exist at the first iteration of the model. Concerning the parameters, the initial probability of choosing a core-word is set to p 0 c = 0.99, such that 1 − p 0 c 1 (see main text) and the two other parameters are fixed by the fitting parameters (N max c = b * = 7873, α = γ * − 1 = 0.77 in English, see main text). Concerning the initialization of the model, an important point that needs to be taken into account is that we are interested in retrieving the Heaps' plot obtained after re-scaling the number of tokens M by the the threshold n asM = M/n, see Fig. S11 (for simplicity and computational efficiency in our simulations we choose n = 1). This implies that the first word type of our model should on average appear not at the first token but instead approximately atM ≈ 1/F (1) (where F (1) is the frequency of the most frequent word). In view of this requirement, we set p 0 new = C = F dp (1) = 0.0922 (for English, see main text) and we start with an empty list of word types (the tokens used before the appearance of the first word type are counted but not attributed to any word type). The simulations were done with a maximum number of M = 10 9 steps in units of tokens, a restriction imposed by the computational effort required. The reported results were obtained as the average of 100 realizations of the model.

B. Heaps' plot
In order to be able to compare the results with the google-ngram data, where a natural threshold of n = 41 is imposed (see SM-Sec. I), we incorporate the threshold n by using the rescaled coordinateM = M/n, as motivated and discussed in SM-Sec. III B. In Fig. S12 we show the expected vocabulary growth N (M ). We can clearly see that the two scaling regimes of Eq. (2), main text, are recovered from our model. Deviations from the data are within 50% over as much as 7 orders of magnitude. The poorer agreement for largeM can be attributed to a slight overestimation by our model of the point of transition between the two scaling regimes. This could be addressed by modifying our model (e.g. modifying our simple choice of p c in Eq. (3) main text) so that the decay in p new and the transition to the second scaling occurs already for shorterM . For even largerM we do not have data for our model due to computational limitations. However, based on our asymptotic calculations in Sec. III, main text, we expect that the observed agreement will extend over the entire range of available data.

C. Zipf 's plot
In the analysis of the Zipf's plot F (r) obtained by our model it is important to take into account that Yule's type processes (already used words are drawn proportional to their previous occurrences) give a disproportionally large weight to the first word types used in the simulation. This happens because in the beginning of the simulation there are only a few existing word types into which all drawn tokens are attributed. Fig. S13(a) illustrates this effect and shows that it is inversely proportional to p 0 new , which sets the time-scale for the appearance of new word types in the beginning of the simulation. This artifact can be easily addressed by excluding a few word types of smallest rank and re-normalizing the remaining distribution, as shown in Fig. S13(b-d). Alternatively, one can modify the preferential attachment part of the model (right-most branch in Fig. 3, main text) so that the very first used word-types follow a different rule and have a vanishing probability of usage for large M (notice that this would not modify the Heap's plot). For the case of English, Fig. S14 shows that the removal of only one word type (r = 1) is sufficient in order to obtain a good agreement with data (less than 50% of deviation over 7 orders of magnitude). As discussed in the case of Heaps' plot above, the transition to the second scaling appears slightly shifted in comparison to the data.
FIG. S1. Size of the database after filtering. a) Number of tokens for yearly data y(t) (x-symbols) and cumulative data Y (t) (line). b) Number of types for yearly data y(t) (x-symbols) and cumulative data Y (t) (line). Each language is marked by a different color.
FIG. S2. Correlation between data in different years for English. Kendall's rank correlation Eq. (S1) between yearly data y(t), y(t ) for t, t ∈ [1500, 2000] with t ≤ t . The dasehd lines show t = 1800 and t = 1800 where a noncontinuous change in the correlation occurs.
FIG. S3. Discrimination between different models with AIC for English. Value of the AIC for a) yearly data y(t) b) cumulative data Y (t). The inset shows the difference ∆AIC = AICi/M − AIC7/M , i = 1..6 meaning that if ∆AIC > 0 the double powerlaw is the most likely model among the proposed describing the data. Numbers refer to the indices of the model in Tab. S1. c) rank-frequency plot for Y (2000) and the fits of the three best models. The inset shows the ratio F data (r)/F fit (r).  FIG. S8. Rank frequency distribution for the English Wikipedia and the fits of the three best models. The inset shows the ratio F data (r)/F fit (r).
FIG. S13. Influece of the first word types on the rank-frequency distribution of our model. Rank-frequency distribution F (r) from our numerical simulation with different values for p 0 new ∈ {0.1, 0.01, 0.001, 0.0001} after filtering the k most frequent types, where a) k = 0, b) k = 1, c) k = 3, and d) k = 10. In this context, filtering means, that i) we neglect all tokens associated with ranks r = 1...k; ii) the rank of all remaining types is lowered by k, e.g., the rank of the k + 1-th most frequent type becomes r = 1; and iii) the distribution is renormalized such that N −k r=1 F (r) = 1, where N is the number of types before the filtering.