Maximal Diversity and Zipf's Law

Zipf's law describes the empirical size distribution of the components of many systems in natural and social sciences and humanities. We show, by solving a statistical model, that Zipf's law co-occurs with the maximization of the diversity of the component sizes. The law ruling the increase of such diversity with the total dimension of the system is derived and its relation with Heaps' law is discussed. As an example, we show that our analytical results compare very well with linguistics datasets.

Zipf's law describes the empirical size distribution of the components of many systems in natural and social sciences and humanities. We show, by solving a statistical model, that Zipf's law cooccurs with the maximization of the diversity of the component sizes. The law ruling the increase of such diversity with the total dimension of the system is derived and its relation with Heaps' law is discussed. As an example, we show that our analytical results compare very well with linguistics and population datasets.
Diversity is a central concept in ecology, economics, information theory, and other natural and social sciences. It can be quantified by diversity indices [1,2], such as (species) richness, the Gini-Simpson index or Boltzmann-Shannon entropy, which characterize the system under study from different angles. Loosely understanding the term, high diversity may represent an advantage in terms of resilience and performance. This is the case, for instance, in ecology, where well differentiated ecosystems are often (see, e.g., Ref. [3] for the debate on this topic) considered to be more stable [4][5][6], and in economy as well: strong countries have a well diversified production [7].
In most cases diversity is hindered by limiting factors. For an ecosystem the amount of energy and chemical components available does not allow an unbounded increase of the population. Similarly, the number of different items produced by an economy is limited by its strength. The diversity drift is therefore a complex optimization process.
Elaborating on that, in this Letter we take the aforementioned restrictions into account and, among the possible measures of diversity [1] we consider the richness index D, which turns out to be particularly suited for a quantitative description of such optimization tendency in many complex systems. Richness is a quantity that counts the number of different types which are present in a collection of items. For instance, the set of integers {3, 7, 1, 9, 0, 1} is richer than {3, 2, 3, 7, 7, 2}, because there are 5 different figures in the former and only 3 in the latter. Every diversity measure can be rephrased in terms of Rényi [8] (or, equivalently, Tsallis [9]) entropies (see Ref. [1] and Supplemental Material (SM) [10]). Notice, however, that the index D alone is insensitive to the abundance of each type but only to their presence/absence.
We consider situations where types can be identified by quantitative labels s, as in the example above. D is the richness of the collection of entities {s 1 , . . . , s N }, with arbitrary N , but subjected to the additive constraint S = N n=1 s n . Here s n represents the portion of the total resource S assigned to the n-th entity of the ensemble, i.e. its size. Entities can be cities [11] of a country with total population S, distinct words [12] occurring with absolute frequencies {s n } in a book of size S or genes [13] expressed with abundances {s n } where S is the total number of proteins synthesized in a cell.
These systems are instances where the Zipf's law [14,15] is observed to hold. Other well known examples include [16] GDP of nations [17], firm sizes [18], species in taxa [19] and fragmentation processes [20]. If ranked according to their size s, components obey Zipf's law when where r is the rank, with a 1. A representation in terms of the distribution of sizes [15,21] p(s) ∝ s −τ , with τ = 1 + a −1 , is better suited to our purposes. To explain Zipfian behavior many generative mechanisms have been proposed [22][23][24][25][26][27][28] and it has also been framed in a broader statistical perspective [29][30][31]. For instance, it has been shown to be associated to maximally informative samples in modeling complex systems [30,32].
In this Letter we show that the maximization of the diversity index D and the occurrence of Zipf's law in the distribution of the component sizes {s n } are naturally related. This is achieved by deriving, in a statistical model, a diversity law that can be used to estimate the index D of distributions of empirical data. We put our results to the test showing remarkable agreement with data for quantitative linguistics, taken from the Gutenberg English texts database [33], and for urbanistics from the GeoNames database [34]. Finally, within our approach we also recover in a simple way the expression of Heaps' law [35,36] and discuss its relation with the diversity law. The fact that specifically D, among the possible diversity measures, is extremized, indicates the prominent role played by this quantity in the many and diverse natural phenomena described by the Zipf's law and represents arXiv:2103.09143v2 [cond-mat.stat-mech] 4 Oct 2021 a different and perhaps profitable rationalization for its occurrence.
The model.-Consider sets of independent and identically distributed integer random variables {s n }, sampled from a generic probability distribution p(s). We call s n the size of the n-th component (or entity). p(s) will be denoted as the bare distribution, since the effective (dressed) distribution of the s n is shaped by the presence of a global constraint N n=1 s n = S, where S is the total dimension of the system. N is the fluctuating number of entities that, according to the particular extraction of the {s n }, is needed to fulfill the constraint. The probability of a particular configuration C ≡ [{s 1 , . . . , s N }; N ] is given by where the constraint is enforced by the Kronecker delta.
play the role of partition functions in an ensemble where N is fluctuating or fixed, respectively. One obtains the probability of having a number N of entities as p S (N ) = Z S (N )/Z S . The dressed probability of observing a size s can be written using Eq. (3) as where the factor N appears because we do not distinguish among components. If t s is the number of times the value s ∈ [1, S] is found in a given configuration C, the diversity index D (hereafter also referred to as simply diversity) is defined as namely the number of different values assumed by the entities. The probability p S (D) of observing a certain value of D is formally given in the SM [10]. We are interested in highly diverse configurations, therefore we consider power law bare probability distributions, which grant access to a wide range of sizes, and p(s) = 0 otherwise. The normalization Λ(τ, S) = ζ(τ ) − ζ(τ, S + 1) is a generalized harmonic number and can be written in terms of the Riemann and Hurwitz zeta functions, ζ(x) and ζ(x, y) respectively. In the other limit, small τ , large sizes do get more probable but the total number of entities required to fill S is smaller. Consequently, diversity is again small. The diversity is expected to be maximal for an intermediate value of τ .
Our goal is to compute the average diversity D S and the value of τ which maximizes it (see Fig. 1). Given the complicated expression of p S (D), we directly determine D S as follows. We split the range of sizes into s ≤ s * and s > s * [37], where s * is defined by N S p S (s * ) = 1; these two sectors contribute to D S as Indeed, given an average number of entities N S , there is at least one of them for each size s ≤ s * , contributing to the first term on the r.h.s. of Eq. (7). The second term is the average number of entities with s > s * . Since these are represented at most once this also corresponds to their contribution to D S .
With Eq. (7), the evaluation of D S only depends on the knowledge of N S and p S (s). These quantities can be computed numerically with an exact recursive method, as discussed in the SM [10]. For an analytical treatment of the problem it is possible to approximate the dressed probability distribution with the bare one, i.e. p S (s) p(s) (see the SM [10]). This simplification leads to an asymptotic expression for D S which is accurate for large S. obtain which is in excellent agreement with the exact determination, see the SM [10]. From the definition N S p S (s * ) = 1, we obtain s * (τ, S) [S/Λ(τ − 1, S)] 1/τ and, substituting in Eq. (7), one arrives at the sought after result for the average diversity: Approximating the Riemann zeta function by ζ(x) (x − 1) −1 + γ, where γ 0.577 is the Euler constant, we can write where the appropriate limits for τ = 1 and 2 are taken.
This determination of D S is portrayed in Fig. 2 and compared with the outcome of numerical simulations finding a very good agreement. For large S, the lead-ing contribution to Eq. (10) is (11) One has D S ∼ S α(τ ) with α(τ ) = 0 for τ < 1, α(τ ) = 1 − 1/τ for 1 < τ < 2 and α(τ ) = 1/τ for τ > 2, see inset of Fig 2. In conclusion, for large S, D S presents a pronounced peak at τ = 2. This behavior is due to the competition between the abundance of entities N S , favored by large τ , and the diversity of their sizes which instead is enhanced by small τ , as shown in Fig. 1. We remark that the upper bound obtained by considering the deterministic partition S 1+2+. . .+D with D ∼ S 1/2 overpowers the τ = 2 case only by a logarithmic factor.
Let us mention that, although we explicitly solved the model for power law distributions, which yield maximum diversity, our calculations can be straightforwardly generalized to different p(s). For instance, in the case of algebraic distributions with a lower cut-off, a case often representative of real situations [38], one recovers similar results provided that the cut-off is independent of S (see the SM [10]).
We also stress that, as shown in the SM [10], among the possible measures of diversity usually considered in the literature, D is the only one to be maximized in connection with Zipf's law.
We notice also that the model considered here is related to the random allocation model [39] where the resource S is distributed among an assigned number N of components. The diversity properties of such model, however, are very different and, in particular, the special role played by τ = 2 is missing. This is briefly discussed in the SM [10].
Diversity, Zipf 's and Heaps' laws.-Since the diversity is determined once an empirical distribution of sizes is given, we can use D S given in Eq. (11) to estimate the diversity index D of power law distributed empirical data, regardless of the mechanism whereby they are produced. If this assumption holds, on the basis of our analytical arguments, one can conclude that if a system displays Zipf's law (τ 2) it is at the edge of maximal diversity and vice versa.
As a first example we consider quantitative linguistics, the field in which Zipf's law has been originally observed in almost every human language [12,[40][41][42]. The regime of validity of the law in this context [43], its deviations [44] and the underlying mechanism(s) are still a matter of dispute. Nonetheless, large scale studies have been performed in order to validate that. For example, Moreno-Sánchez et al. [42] considered a very large set of English books (more than 30000) from the Gutenberg Project database. They checked how well some simple, one-parameter forms of the Zipf's law describe these data on the whole interval of frequencies, finding very good agreement with a distribution of exponents centered on τ 2.
We use the filtered data of Ref. [42] and, for each book, measure the diversity index D. The total number of words a book contains is its total size S, the number of distinct words is the number of entities, N , and the size s of each entity is its absolute frequency, i.e. how many times that word appears. The diversity D is therefore the number of different frequencies a given text displays. The result of this analysis is shown in Fig. 3 along with Eq. (11) for τ = 2. Notice that there are no free parameters in the plot. The agreement between our theoretical prediction and the experimental points is consistent with the results reported in Ref. [42] showing that a great deal of the books have τ close to 2.
As a second example, we consider how the total population S of a country is distributed among its cities. We use data for European countries from the GeoNames database [34], for which Simini and James [45] showed that the size s of cities closely follows a Zipf's distribution (τ 2.02). The diversity index D is shown in Fig. 3 (bottom panel). Since cities cannot be smaller than a certain lower cutoff s L , the analytical prediction to compare with is Eq. (29) of the SM, see SM [10], (solid line). Despite the noisy character of the data, there is a very good agreement between the data and our theory.
The content of Eq. (8) is Heaps' law, which gives an estimate of the number of components of a system of total size S given that the empirical size distribution follows a power law with exponent τ . Our expression of the law for τ > 1 is in accordance with Ref. [36] and complements the result with the cases with τ ≤ 1 and with the appropriate prefactors. Heap's law is expected to hold for systems which are robust in the statistics of their component (p S (s) in our notation) at varying S [36,38]. This is captured in our approach, where Eq. (8) is only arrived at using distributions which have the same form for any S (the same applies to Eq. (11)).
In our approach, Heap's law (8) and the diversity law (11) imply each other, encoding dependencies on the system size on equal footings. However, notably, the latter naturally selects the exponent τ = 2 as a special one. Moreover, our analysis of the Gutenberg dataset shows that the diversity law is obeyed up to the largest sizes considered (S 10 7 ), whereas it is known [46] that strong deviations from Heaps' law are caused by the finiteness of the vocabulary. Therefore, at least in the context of language, the diversity law appears more robust and this suggests that its use could be more suited to interpret the size dependence of empirical data.
Discussion.-The partition of a finite resource S  [42]. Each green point is one of the more than 30000 English books in the Project Gutenberg database (accessed July 2014), while the black squares are a running average over 20 points. The solid line is the result D S = 2(S/ ln S) 1/2 , from Eq. (11) for τ = 2, which corresponds to maximal diversity. (Bottom panel) Diversity index using data from the GeoNames database [34] for cities. Each green point is an European country and the black squares are the corresponding running average. The solid line is Eq. (29) of the SM [10] where the presence of a lower cutoff sL is taken into account. sL is estimated from the average of the smallest city in each country (sL 1313), see SM [10]. The dashed line is the behavior D S = 2(S/ ln S) 1/2 , which is approached only asymptotically.
among constituents informs numerous systems in diverse fields of science and humanities. In this Letter, by solving a paradigmatic statistical model, we have shown that a maximally diverse partition is accompanied by Zipf's law. Such co-occurrence is a general property of the empirical distribution, holding irrespectively of the specific mechanisms at work in generating Zipfian behavior in given systems.
Diversity and information are fundamental concepts for the description of complex statistical systems whose formalization led to the definition of a coherent set of quantitative measures, Boltzmann-Shannon entropy above all. Our results show that in the case of system obeying Zipf's law an important role is played by one of such measures, the index D. When framed in terms of extremization of appropriate cost functions, problems are endowed with a complementary description and can be approached with new strategies. Our study suggests that, in some instances where Zipf's law is empirically observed, promoting diversity to the role of a driving force could provide further theoretical insights towards a deeper and more general comprehension.
O.M. is indebted with I. A. Hatton, M. Smerlak