Abstract
Quantifying the similarity between symbolic sequences is a traditional problem in information theory which requires comparing the frequencies of symbols in different sequences. In numerous modern applications, ranging from DNA over music to texts, the distribution of symbol frequencies is characterized by heavy-tailed distributions (e.g., Zipf’s law). The large number of low-frequency symbols in these distributions poses major difficulties to the estimation of the similarity between sequences; e.g., they hinder an accurate finite-size estimation of entropies. Here, we show analytically how the systematic (bias) and statistical (fluctuations) errors in these estimations depend on the sample size and on the exponent of the heavy-tailed distribution. Our results are valid for the Shannon entropy (), its corresponding similarity measures (e.g., the Jensen-Shanon divergence), and also for measures based on the generalized entropy of order . For small ’s, including , the errors decay slower than the decay observed in short-tailed distributions. For larger than a critical value , the decay is recovered. We show the practical significance of our results by quantifying the evolution of the English language over the last two centuries using a complete spectrum of measures. We find that frequent words change more slowly than less frequent words and that provides the most robust measure to quantify language change.
- Received 2 October 2015
DOI:https://doi.org/10.1103/PhysRevX.6.021009
This article is available under the terms of the Creative Commons Attribution 3.0 License. Further distribution of this work must maintain attribution to the author(s) and the published article’s title, journal citation, and DOI.
Published by the American Physical Society
Focus
How to Compare Books or Genomes
Published 15 April 2016
A mathematical technique for comparing large symbol sets suggests that less frequently used words are mainly responsible for the evolution of the English language over the past two centuries.
See more in Physics
Popular Summary
How similar are two examples of music or text from different authors or disciplines? How fast is the vocabulary of a language changing over time? How can one distinguish between coding and noncoding regions in DNA? The problem underlying all of these questions is how to quantify the similarity between sequences of symbols (e.g., notes in music, words in texts, or base pairs in DNA) and determine the likelihood that two sequences derive from the same parent population. In this work, we address this problem for the important and particularly difficult case in which the frequencies of symbols show heavy-tailed distributions, as encountered in many social, biological, and natural systems (e.g., the famous Zipf’s law that states that the frequencies of words are inversely proportional to their rank in a frequency table).
Information-theoretic measures provide a natural framework for comparing symbolic sequences. Motivated by the wide variation of symbol frequencies in heavy-tailed distributions—for example, “and” is 10,000 times more frequent in the English language than “genetic”—we analytically show how estimating the similarity between two sequences is prone to errors whose magnitudes depend on the sample size . Our primary finding is that, for heavy-tailed distributions, (i) the error decay is often much slower than , illustrating the difficulty in obtaining accurate estimates even for large sample sizes, and (ii) there is a critical value of the order of the entropy for which the normal dependence is recovered. We emphasize the importance of these findings in the example of quantifying how fast the English vocabulary has changed within the last 200 years, showing that these finite-size effects have to be taken into account even for very large databases ( words).
Our results have a direct impact on computer science (e.g., authorship attribution or document classification) and also extend beyond the analysis of natural language because heavy-tailed distributions are ubiquitous in other complex systems.