Similarity of symbol frequency distributions with heavy tails

Quantifying the similarity between symbolic sequences is a traditional problem in Information Theory which requires comparing the frequencies of symbols in different sequences. In numerous modern applications, ranging from DNA over music to texts, the distribution of symbol frequencies is characterized by heavy-tailed distributions (e.g., Zipf's law). The large number of low-frequency symbols in these distributions poses major difficulties to the estimation of the similarity between sequences, e.g., they hinder an accurate finite-size estimation of entropies. Here we show analytically how the systematic (bias) and statistical (fluctuations) errors in these estimations depend on the sample size~$N$ and on the exponent~$\gamma$ of the heavy-tailed distribution. Our results are valid for the Shannon entropy $(\alpha=1)$, its corresponding similarity measures (e.g., the Jensen-Shanon divergence), and also for measures based on the generalized entropy of order $\alpha$. For small $\alpha$'s, including $\alpha=1$, the errors decay slower than the $1/N$-decay observed in short-tailed distributions. For $\alpha$ larger than a critical value $\alpha^* = 1+1/\gamma \leq 2$, the $1/N$-decay is recovered. We show the practical significance of our results by quantifying the evolution of the English language over the last two centuries using a complete $\alpha$-spectrum of measures. We find that frequent words change more slowly than less frequent words and that $\alpha=2$ provides the most robust measure to quantify language change.


I. INTRODUCTION
Quantifying the similarity of symbolic sequences is a classical problem in information theory [1] with modern applications in linguistics [2], genetics [3], and image processing [4].The availability of large databases of texts sparked a renewed interest in the problem of similarity of the vocabulary of different collections of texts [5][6][7][8][9].For instance, Fig. 1 shows the word-frequency distribution in three large collections of English texts: from 1850, 1900, and 1950.We see that the distribution itself remains essentially the same, a heavy-tailed Zipf distribution [10] where p is the frequency of the r-th most frequent word and γ 1. Changes are seen in the frequency p (or rank) of specific words, e.g., ship lost and genetic won popularity.Measures that quantify such changes are essential to answer questions such as: Is the vocabulary from 1900 more similar to the one from 1850 or to the one from 1950?How similar are two vocabularies (e.g., from different years)?Are the two finite-size observations compatible with a finite sample of the same underlying vocabulary?How similar are the vocabulary of different authors or disciplines?How fast is the lexical change taking place?
Heavy-tailed and broad distributions of symbolfrequencies such as Eq.(1) are typical in natural languages [10,[12][13][14][15][16] and appear also in the DNA (n-grams of base pairs for large n) [17], in gene expression [18], and in music [19].The slow decay observed in a broad range of frequencies implies that there is no typical frequency for words and therefore relevant changes can occur in different ranges of the p-spectrum, from the few large- •: "and" : "see" : "ship" : "genetic" FIG. 1.The English vocabulary in different years.Rankfrequency distribution p(r) of individual years t for t = 1850, 1900, and 1950 of the Google-ngram database [11], multiplied by a factor of 1, 2, and 4, respectively, for better visual comparison.The inset shows the original un-transformed data (same axis), highlighting that the rank-frequency distributions are almost indistinguishable.Individual words (e.g."and","see","ship","genetic") show changes in rank and frequency (symbols), where words with larger ranks (i.e.smaller frequencies) show larger change.
frequency words all the way to the many low-frequency words.This imposes a challenge to define similarity measures that are able to account for this variability and that also yield accurate estimations based on finite-size observations.
In this paper we quantify the vocabulary similarity using a spectrum of measures D α based on the generalized entropy of order α (D α=1 recovers the usual Jensen-Shannon divergence).We show how varying α magnifies differences in the vocabulary at different scales of the (heavy-tailed) frequency spectrum, thus providing different information on the vocabulary change.We then compute the accuracy (bias) and precision (variance) of estimations of D α based on sequences of size N and find that in heavy-tailed distributions the convergence is much slower than in non-heavy-tailed distributions (it often scales as 1/N β with β < 1).Finally, we come back to the problem of comparing the English vocabulary in the last two centuries in order to illustrate the relevance of our general results.

II. DEFINITION
Consider the probability distribution p = (p 1 , p 2 , . . ., p S ) of a random variable over a discrete, countable set of symbols i = 1, . . ., S (where later we include the possibility for S → ∞).From an information theory standpoint, a natural measure to quantify the difference between two such probability distributions p and q is the Jensen-Shannon divergence (JSD) [20] where H is the Shannon entropy [21] H This definition has several properties which are useful in the interpretation as a distance: i) D(p, q) ≥ 0 where the equality holds if and only if p = q; ii) D(p, q) = D(q, p) (it is a symmetrized Kullback-Leiber-divergence [20]); iii) D(p, q) fulfills the triangle inequality and thus is a metric [22]; and iv) D(p, q) equals the mutual information of variables sampled from p and q [3], i.e., D(p, q) equals the average amount of information in one randomly sampled word-token about which of the two distribution it was sampled from [23].The JSD is widely used in the statistical analysis of language [2], e.g. to automatically find individual documents that are (semantically) related [5,6] or to track the rate of evolution in the lexical inventory of a language over historical time scales [7,8].
Here we also consider the generalization of JSD in which H in Eq. ( 3) is replaced by the generalized entropy of order α [24] yielding a spectrum of divergence measures D α parameterized by α, first introduced in Ref. [25].The usual JSD is retrieved for α = 1.In (non-extensive) statistical mechanics, Eq. ( 4) has been first proposed in Ref. [26] and D α is sometimes called Jensen-Tsallis divergence.
While similar generalizations can be achieved with other formulations of generalized entropies such as the Renyientropy [4,27], the corresponding divergences can become negative.In contrast, D α is strictly non-negative and it was recently shown that D α (p, q) is a metric for any α ∈ (0, 2] [28].For heavy-tailed distributions, Eq. ( 1), H α < ∞ for α > 1/γ.We define a normalized version of D α as where is the maximum possible D α between p and q obtained assuming that the the set of symbols in each distribution (i.e., the support of p and q) are disjoint.The main motivation for using the measure ( 5) is that Dα (p, q) ∈ [0, 1], while the range of admissible values of D α depends on α.This allows for a meaningful comparison of the divergences Dα (p, q) and Dα (p, q) for α = α and therefore also for the full spectrum of α's.In general, the metric properties of D α are not preserved by Dα .An exception is the case in which the rank-frequency distribution p(r) underlying all p's and q's is invariant (see Fig. 1).Noting that Eq. ( 6) is independent of the symbols we obtain that D max α (p, q) is a constant for all p's and q's and therefore the metric property is preserved for Dα .

III. INTERPRETATION
In order to clarify the interpretation of D α , it is useful to consider a toy model.As in Fig. 1, we consider two distributions p and q that have exactly the same rank-frequency distribution p(r) but differ in (a subset of) the symbols they use.For simplicity, we consider that symbols that differ in the two cases appear only in one of the distributions.More precisely, denoting by I p = {A, B, C, D, E, . ..} the set of symbols in p we define the set of replaced symbols as I * ⊂ I p .The set of symbols in q is chosen as I q = {i|i ∈ I p \ I * } ∪ {i † |i ∈ I * } with probabilities p i = q i for i ∈ I p \ I * and p i = q i † for i ∈ I * , see Fig. 2 for one example.
For a given distribution p and a set of replaced symbols I * , we compute D α (p, I * ) ≡ D α (p, q) as where The maximum is given by This shows that each symbol i ∈ I * that is replaced by a new symbol contributes p α i to D α .It is thus clear that varying α, the contribution of different frequencies become magnified (e.g. for α 1 large frequencies are enhanced while for α < 0 low frequencies contribute more to D α than large frequencies).
In particular, for α = 0, Dα=0 (p, |Ip| is the fraction of symbols (types) that are different in p and q.Each symbol i counts the same irrespective of their probabilities p i .For |I * | |Ip| 1, Dα=0 (p, I * ) = 1 − J(I p , I q ), where J(I p , I q ) = |Ip∩Iq| |Ip∪Iq| is the Jaccard-coefficient between the two sets I p and I q , an ad-hoc defined similarity measure widely used in information retrieval [2].For α = 1, Dα=1 (p, I * ) = i∈I * p i showing that each replaced symbol is weighted by its probability p i and thus that Dα=1 measures the distance in terms of tokens.
The full spectrum Dα offers information on changes in all frequencies, a point which is particularly important for the case of heavy-tailed distributions because wordfrequencies vary over many orders of magnitude.Figure 3 illustrates how different values of α are able to capture changes at different regions in the frequency-spectrum.In particular, it shows that Dα grows (decays) with α when the modified symbols have high (low) frequency.Furthermore, the comparison between two given changes allow us to conclude about which change was more significant at different regions of the word-frequency spectrum.In the example of the figure, both changes (the two lines) are equally significant from the point of view of the modified tokens ( D1 are the same), the change in the left affects more types ( D0 is larger), and the change in the right affects more frequent words ( Dα is larger for α 1).Illustration of our toy model where p (left) and q (right) have the same rank-frequency distribution, but differ in the probability for individual symbols.In this example, p and q are the same (pi = qi) for i ∈ {A, C, D, E, F, G, H}, while the symbol i = B in p is replaced by i = B † in q with pi=B = q i=B † and p i=B † = qi=B = 0.

IV. FINITE-SIZE ESTIMATION
In this section we turn to the estimation of Dα from data.Even if Dα is defined with respect to distributions p and q, Eq. ( 5), in practice these distributions are estimated from sequences with finite-size N (total number of symbols or word tokens) yielding finite size estimates of the distributions p and q.The main obstacle in obtaining accurate estimates of Dα is that it requires the estimation of entropies for which, in general, unbiased estimators do not exist [29].Accordingly, even if p = q, in practice H α ( p) = H α ( q) and Dα ( p, q) > 0 are measured not only in single realizations, but also on average (the bias).Besides the bias, we are also interested in the expected fluctuation (standard deviation) of the estimations of H α and Dα and how both they depend on the sequence size N for large N .In heavy-tailed distributions such as Eq. ( 1), these estimations are based on an observed vocabulary V (number of different symbols) that grows sub-linearly with N as [30][31][32] This implies that the entropies in Eq. ( 4) are estimated based on a sum of V → ∞ terms (for N → ∞).In practice, γ and the precise functional form of the heavytailed distribution are unknown and therefore, besides Dα , the estimation of H α is also of interest (see Ref. [33] for the case in which a power-law form of p is assumed to be known a priori).

A. Analytical Calculations
Here we extend previous results [34][35][36][37] and generalize them to arbitrary α.Given a probability distribution p and the measured probabilities p from a finite sample of N word-tokens, we expand H α ( p) around the true probabilities p i up to second order as where we used that ] by averaging over the different realization of the random variables pi by assuming that the absolute frequency of each symbol i is drawn from an independent binomial with probability which defines the vocabulary size of order α From Eq. ( 12) we see that the bias in the entropy estima- and N .Similar calculations (see Appendix B) show that the large N behavior of the bias and the fluctuations (variance) of H α , D α , and Dα can be written as simple functions of V (α) and N , as summarized in Tab.I.
Hα Dα, Dα(p = q) Dα, Dα(p = q) Bias: 2 of estimations X.The results are valid for large sequence sizes N and depend on the vocabulary of order α, V (α) as in Eqs. ( 13) and (14).Results are shown for X = Hα [order α entropy, Eq. ( 4)], Dα [generalized divergence], Dα [normalized divergence, Eq. ( 5)], see Appendix B for the derivations.For Dα, we approximate We now focus on the dependence of V (α) on N .The sum i∈V in Eq. (13) indicates that in N samples, on average, V = V (N ) ≡ V (α=1) different symbols are observed.If for N → ∞ the vocabulary V converges to a finite value, V (α) in Eq. ( 13) also converges and the bias scales as 1/N .A more interesting scenario happens when V grows with N .For the heavy-tailed case of interest here, V grows as N 1/γ , Eq. ( 10), and we obtain (in Appendix C) that V (α) scales for large N as where γ > 1 is the Zipf exponent defined in Eq. ( 1) and α is the order of the entropy in Eq. ( 4).
From the combination of Eq. ( 14) and Tab.I we obtain the scalings with sequence size N of the estimators of H α , D α , and Dα in a heavy-tailed distribution with exponent γ.These scalings are summarized in Tab.II.Three scaling regimes can be identified for the bias and for the fluctuations.(i) For large α, the decay is 1/N (except when p = q, where the fluctuations decay even faster as 1/N 2 ) as in the case of a finite vocabulary and short-tailed distributions.(ii) For intermediate α, a sublinear decay with N is observed.This regime appears exclusively in heavy-tailed distributions and has important consequences in real applications, as shown below.
From the exponents of the sub-linear decay we see that the bias decays more slowly than the fluctuations.(iii) For small α, α < 1/γ, H α (p) is not defined thus the estimator for the mean of H α and D α diverge.The growth of H α (and therefore D max α ) and D α with N have the same scaling and therefore cancel each other for Dα , in which case a convergence to a well defined value is found (the fluctuation of Dα still decays in this regime).

B. Numerical Simulations
Here we perform numerical estimations of the normalized divergence spectrum Dα that illustrate the regimes derived above, confirm the validity of the approximations used in their derivations, and show that the same scalings are observed for different entropy estimators.We sample twice N symbols (tokens) from the same distribution (p = q), and therefore Dα = 0 and the expected value E[ Dα ( p, q)] is the bias.(The fact that the bias shows a slower decay with N than the fluctuations makes these two effects distinguishable also in this Dα = 0 case because E[ Dα ( p, q)] V[ Dα ( p, q)] for large N ).We start with the most important prediction of our analytical calculations above: the existence in heavytailed distributions of a regime for which the bias and fluctuations of Dα decay with N more slowly than 1/N .This holds already for α = 1, i.e., for the usual Jensen-Shannon divergence, previously shown for the bias of H α=1 in Ref. [37].One potential limitation of our analytical calculations is that they are based on the pluginestimator obtained from replacing the p i 's in the generalized entropies, Eq. ( 4), by the measured frequencies (i.e.p i → pi = N i /N , with N i being the number of observed word tokens of type i).To test the generality of our results, in the numerical simulations we use four different estimators of the Shannon entropy (i.e., α = 1): i) the Plugin-estimator; ii) Miller 's-estimator [34], which takes into account the approximation obtained from the expansion in Eq. ( 12); iii) Grassberger 's estimator [38]; and iv) a recently proposed Bayesian estimator described in [39] which is an extension of the approach by Nemenman et al. [40] to the case where the number of possible symbols is unknown or even countably infinite [41].The numer- TABLE II.Summary of finite size scaling for distributions with heavy tails.Mean (E) and variance (V) of the plug-in estimator of Hα, Dα, and Dα for samples p and q each of size N drawn randomly from p and q with power-law rank-frequency distributions with exponent γ > 1, Eq. ( 1).The results are obtained combining Tab.I with Eq. ( 14) (for details see Appendix B, C).The constant c depends on α and has a different value in each case but is independent of N .The limit γ → ∞ corresponds to the case in which both p and q have short tails.( p, q)] between two sequences of size N drawn from the same distribution (i.e.D(p, q) = 0) using four different estimators of the entropy (see text) for three representative distributions: (a) Exponential (short-tailed) distribution pi ∝ e −ai for i = 0, 1, . . .with a = 0.1; (b) Power-law (heavy-tailed) distribution pi ∝ i −γ for i = 1, 2, . . .with γ = 3/2; (c) Empirical Zipf-distribution of word frequencies, i.e. rank-frequency distribution p(r) from the complete Google-ngram data, pi = f (i = r) for i = 1, . . ., 4623568, which is well described by a double power-law [10].(d-f) Show the same as (a-c) for the fluctuations V[ D( p, q)].The dotted lines show the expected scalings from Tab. II for short-tailed distributions, i.e.N −1 (N −2 ), and power-law distributions, i.e.N −1+1/γ (N −2+1/γ ), for the bias (fluctuations).In (c) we show the expected scaling for the bias, Vemp(N )/N , where Vemp(N ) is the expected number of different symbols in a random sample of size N from the empirical distribution [32].Averages are taken over 1000 realizations.
) and all values of α (in b,d).Averages are taken over 1000 realizations.ical results in Fig. 4 show that the different estimators are indeed able to reduce the bias of the estimation, but that the scaling of the bias with N remains the same.In particular, the transition from short-tailed to heavytailed distribution leads to the predicted transition from N −1 (N −2 ) to the slower N −1+1/γ (N −2+1/γ ) decay for the bias (fluctuations) for all estimators.The only exception is in the bias of the Bayesian estimator for the exact Zipf's law (1), but since this estimator shows a bad performance for the fluctuation and for the real data we conclude that the slower scaling should be expected in general also for this elaborated estimator.These results confirm the generality of our finding that the bias and fluctuation in Dα=1 decays more slowly than 1/N in heavy-tailed distributions.The consequence of this result to applications will be discussed in the next section.
We now consider the estimation of Dα for α = 1 in the case of heavy-tailed distributions (1).The numerical results in Fig. 5 confirm the existence of the three scaling regimes discussed after Eq. ( 14) and in Tab.II.The panels (b) and (d) show the relative reduction in the bias and fluctuations achieved when the sequence size is doubled.For many different α's the relative reduction is larger than 0.5 (0.25) for the bias (fluctuations), a consequence of the slow decay with N that shows the difficulty in obtaining a good estimation of Dα .In practice, the exponent γ of the distribution is unknown such that the critical values of α that separates these regimes (e.g.α E 1 = 1/γ and α E 2 = 1 + 1/γ for the bias) cannot be determined a priori.Yet, since γ > 1, we know that: 2 and therefore the bias and fluctuations of D α for α ≥ 2 decay as 1/N (or 1/N 2 for the fluctuations in the case of p = q).This suggests D α=2 as a pragmatic choice for empirical measurements because any further increase in α will not lead to a faster convergence.

V. APPLICATION TO REAL DATA
In this section, we show the significance of the general results of the previous section to specific problems.
A problem that appears in different contexts is to test whether two finite-size N sequences, described by their empirical distributions p and q, have a common source (null hypothesis).This involves the computation of a single divergence Dα (p, q), which is then compared to the divergence Dα (p , p ) between two finite-size (random) samplings of a single (properly chosen) distribution p (e.g., p = 0.5p + 0.5q).The probability of observing Dα (p , p ) ≥ Dα (p, q) is then reported as a p-value [3].Besides applications in language, e.g.com-  (1900,1950), (1850, 1950)} (solid lines).The dotted lines with the same colors show the results of a null model in which samples of the same size of the ones in t1 and t2 are randomly drawn from the same distribution (obtained from combining the corpora in t1 and t2) mimicking a minimum distance that can be observed due to finite-size effects.The vertical lines show the three regimes α < 1/γ, 1/γ < α < 1 + 1/γ, and α > 1 + 1/γ in the convergence of Dα(pt 1 , pt 2 ) with N (see Sec. IV), obtained using γ = 1.77 [10].Inset: ratio Dα(pt 12 , pt 12 )/ Dα(pt 1 , pt 2 ).(b) Average divergence as a function of ∆t ≡ |t2 − t1|, calculated as Dα(∆t) = 1 paring the distribution of word-frequencies, this problem appears in the identification of coding-and non-coding regions in DNA [42].The significance of our results for finite-size estimations in Sec.IV is that for the case of heavy-tailed distribution the expected Dα (p, q) of the null model may be much larger than the predicted value based on a 1/N decay (as observed in short-tailed distributions).If the slower convergence in N is ignored, e.g., by applying standard tests [3] to heavy-tailed distributions, one rejects the null hypothesis (low p-value) even if the data is drawn from the same source because the measured distance will be much larger.The example in Fig. 4(c) shows that, even when the size of both sequences is on the order of N ≈ 10 5 , the expected D1 (JSD) is E[ Dα=1 ( p, q)] ≈ 10 −1 .This is two orders of magnitude larger than for the exponential distribution in Fig. 4(a), where E[ Dα=1 ( p, q)] ≈ 10 −3 .
The next problems we consider appear in the analysis of historical data and in the quantification of language change [43].These problems are representative of problems that involve the comparison of two or more divergences Dα (p, q), obtained from different distributions p = q and α = α.As depicted in Fig. 1, the different distributions are obtained based on individual years (t ∈ {1850, 1900, 1950}) and we calculate the normalized spectrum Dα (p t1 , p t2 ) between pairs of years (t 1 , t 2 ).As argued in Sec.II and Appendix A, Dα (p, q) is meaningful even if the sequences used to estimate p and q have different sizes N p = N q .We summarize our results in Fig. 6, from which different conclusions can be drawn: a. Temporal change.The change of English from 1850 to 1950 was larger than the change form 1850 to 1900 and from 1900 to 1950, as seen from the fact that the curve of Dα (p 1850 , p 1950 ) in Fig. 6a lies above the two other curves for all α.This intuitive result (evolutionary dynamics show no recurrences) confirms that the divergence spectrum Dα (p t1 , p t2 ) is a meaningful quantification of language change.The average dependency of Dα (p 1850 , p 1950 ) on ∆t = |t 2 − t 1 |, shown in Fig. 6b, can be thus used as a quantification of the speed of language change.We observe an approximate relationship Dα (∆t) ≈ D(i) α + D(ii) α ∆t 2 for ∆t 1 (see Inset Fig. 6b), where D(i) α and D(ii) α are constants and can be related to words that change due to fluctuations (finite sampling or topical dependencies) which is independent of ∆t and words that show a systematic increase or decrease over ∆t, respectively (see Appendix D for a detailed discussion).
b. Dependence on α.All observed divergences Dα (p t1 , p t2 ) decay with α (e.g., the three curves in Fig. 6a).As discussed in Sec.III, this shows that for words with a high (low) frequency the distributions are more (less) similar and thus the change is slower (faster).This result is consistent with previous works on the evolution of individual words on historical time scales reporting that frequent words tend to be more stable [44,45].This dependence on α is essential when comparing the change 1850 → 1900 to the change 1900 → 1950 (Fig. 6a).While the earlier change was smaller if counted on a token basis, Dα=1 (p 1850 , p 1900 ) < Dα=1 (p 1900 , p 1950 ), it be-comes larger if one focus on the more frequent words [ Dα=2 (p 1850 , p 1900 ) > Dα=2 (p 1900 , p 1950 )].
c. Role of finite-size scalings.Our finding that the scalings (of the bias and of the fluctuations) in Dα with sample size N depend on α allows for a deeper understanding of the Dα (p t1 , p t2 ) measurements discussed above.The expected Dα 's for random sampling of the same distribution (null model shown as dashed line in Fig. 6a) are of the same order as the empirical distance for small α (i.e.Dα (p t12 , p t12 ) ≈ Dα (p t1 , p t2 )) and it is only for α > 1 that the null model divergence becomes negligible compared to the empirical divergence (i.e.Dα (p t12 , p t12 ) Dα (p t1 , p t2 )).This implies that even though the size of the individual corpora is of the order of N ≈ 10 9 word-tokens, the empirically measured Dα is still strongly influenced by finite-size effects over a wide range of values for α, in agreement with our analysis in Sec.IV.In particular, the bias for Jensen-Shanon divergence (α = 1) is important even for the case of the (extremely large) Google-ngram database (e.g., the Inset of Fig. 6a shows that the bias is ≈ 10%).
d. α = 2 as a pragmatic choice.The slow decay of bias and fluctuations with database size suggests that Dα=2 is a pragmatic choice in reducing such finite-size effects when the exponent γ in the power-law distribution is not known.This conclusion is further corroborated in the analysis of the dependence of Dα with ∆t (Fig. 6b).While Dα (∆t = 0) = 0 by construction, Dα does not converge to zero for ∆t → 0 when extrapolating from Dα (∆t > 0), but instead it seems to saturate, i.e.Dα (∆t → 0) ≈ D(i) α > 0. For small values of α, D(i) α is of the same order of magnitude of the expected bias (e.g., shown as dashed line in Fig. 6a) and even of the same order of magnitude of the divergence Dα (∆t = 100) between two corpora separated by 100 years.For small α and ∆t, it is thus difficult to distinguish between finitesize effects ( D(i) α ) and actual language change.Results for α = 2 show the largest relative variation with ∆t and are therefore statistically more suited to quantify language change over time.

VI. CONCLUSIONS
In summary, we investigated the use of generalized entropies H α to quantify the difference between symbolic sequences with heavy-tailed frequency distributions.In particular, we introduced a normalized spectrum of a generalized divergence, Dα (p, q) in Eq. (5), that allows for a comparison between the different distributions p and q and also for different α's.Increasing α, Dα attributes higher weights to high-frequency symbols.The more complete characterization given by the full spectrum Dα is particularly important in the case of heavytailed distributions because in this case symbols do not have a characteristic frequency but instead show frequencies on a broad range of values.
Our main analytical finding is how the systematic (bias) and statistical (fluctuations) errors of finite-size (N ) estimations of H α and Dα scales with N , see Tab.II.The existence of regimes in which these scalings decay slower than 1/N shows that large uncertainties should be expected in H α and Dα estimated even for very large databases.This should be taken into account when comparing two or more Dα 's and when estimating the probability of two sequences having the same source.The fact that for large α we recover the usual scaling (decay with 1/N ) suggests Dα=2 as a pragmatic choice in applications involving heavy-tailed distributions.Previous works using information theoretic measures in language used α = 1 [5][6][7][8] and did not take into account the effect of (finite) database size.Our results show that the bias and fluctuations are significant even in the extremely large Google-ngram database.It is therefore essential to clarify what is the role of finite-size effects in the reported conclusions, in particular in the (typical) case that database sizes change over time.
Our main empirical findings on language change are: i) that least frequent words contribute more to the total vocabulary change; ii) the answer to the question whether the speed of language change is accelerating depends on the emphasizes that is given to either low-frequency or high-frequency words; and iii) the quantification of the speed of vocabulary change in time, ∆t, which shows roughly a dependence Dα (∆t) ≈ D quantifies the degree to which words change due to fluctuations independent of time (systematic increase/decrease of the frequency over time).More generally, our spectrum Dα opens the possibility of studying language change at different resolution, combining aspects from the analysis on the level of individual words (e.g.Refs.[44,45]) and the full vocabulary of a language (e.g.Refs.[7,9]).
Our results are also of interest beyond the cases treated here.First, the finite-size scaling we derive appear already in the entropy and therefore the same scalings are expected in any entropy-based measure, including those based on conditional entropies such as the Kullback-Leibler divergence [1].Second, the analysis is not necessarily restricted to the word level, it can be straightforward extended also to n-grams of words which also show heavy-tailed distributions [46].Third, the spectrum of divergences Dα (p, q) offers a unifying framework which can be applied to problems involving different partitions of texts by varying the parameter α.For example, while in document classification [2] one tries to identify topical words (suggesting the use of low values of α), in applications of authorship attribution [47] it has been shown that the comparison of the most-frequent (function) words yields the best results (suggesting the use of large values of α).Fourth, heavy-tailed distributions appear in different problems involving symbolic sequences (e.g., in the DNA [17], in gene expression [18], and in music [19]), and the significance of our results is that they can be applied in all these cases.is obtained from N identical and independent draws from the distribution p giving an estimator for H α : In order to take the corresponding expectation values we expand pα i around the true probabilities p i up to second order and average over the realizations of the random variables pα i by assuming that each symbol is drawn independently from binomial with probability p i such that (p i −p i ) = 0 and ( 1. Hα Combining Eqs.(B1) and (B3) we obtain for the mean (B4) where we introduce the notation i∈ V p indicating that we average only over the expected number of observed symbols V p in samples p.
For the variance we get (B5) where we used that two different symbols i = j are independently drawn, thus i,j pα i pα j = i =j pα i pα j + i p2α i .

Dα
For D α we have two samples p and q each of size N randomly sampled from the distributions p and q such that we can express the mean and the variance from the expectation values of the corresponding individual entropies.
Introducing the notation P ≡ 1 2 (p + q) we get for the mean .
For the variance we get Now we can see that for p = q = P we get (B12) While for arbitrary p and q the variance of the D α contains the variances of the individual entropies (e.g.V (2α) P /N ) and a covariance term, (only) in the special case p = q all first-order terms (1/N ) vanish yielding a qualitatively different behaviour V (2α−1) P /N 2 .

Dα
The finite size estimation of Dα can be obtained approximately by Dα ( p, q) = D α ( p, q) D α ( p, q) max ≈ D α ( p, q) E [D max α ( p, q)] (B13) such that E Dα ( p, q) ≈ E [D α ( p, q)] E [D max α ( p, q)] , V Dα ( p, q) ≈ V [D α ( p, q)] E [D max α ( p, q)] 2 . (B14) FIG.2.Illustration of our toy model where p (left) and q (right) have the same rank-frequency distribution, but differ in the probability for individual symbols.In this example, p and q are the same (pi = qi) for i ∈ {A, C, D, E, F, G, H}, while the symbol i = B in p is replaced by i = B † in q with pi=B = q i=B † and p i=B † = qi=B = 0.