Abstract
It has been observed that the rank statistics of string frequencies of many symbolic systems (e.g., word frequencies of natural languages) follows Zipf’s law in good approximation. We show that, contrary to claims in the literature, Zipf’s law cannot be realized by the central limit theorem(s). The observation that a log-normal distribution of string frequencies yields an approximately Zipf-like rank statistics is actually misleading. Indeed, Zipf’s law for the rank statistics is strictly equivalent to a power law distribution of frequencies. There are two natural ways to perform the infinite size limit for the vocabulary. The first one is the method of choice in the literature; it makes the upper word length bound tend to infinity and leads in the case of a multistate Bernoulli process via a central limit theorem to a log-normal frequency distribution. An alternative and for text samples actually better realizable way is to make the lower frequency bound tend to zero. This limit procedure leads to a power law distribution and hence to Zipf’s law—at least for Bernoulli processes and to a very good approximation for natural languages where it passes the test. For the Bernoulli case we will give a heuristic proof.
- Received 23 April 1997
DOI:https://doi.org/10.1103/PhysRevE.57.1347
©1998 American Physical Society