Similarity of Symbol Frequency Distributions with Heavy Tails

Featured in Physics
Open Access

Similarity of Symbol Frequency Distributions with Heavy Tails

Martin Gerlach, Francesc Font-Clos, and Eduardo G. Altmann

Phys. Rev. X 6, 021009 – Published 15 April 2016

See Focus story: How to Compare Books or Genomes

Abstract

Quantifying the similarity between symbolic sequences is a traditional problem in information theory which requires comparing the frequencies of symbols in different sequences. In numerous modern applications, ranging from DNA over music to texts, the distribution of symbol frequencies is characterized by heavy-tailed distributions (e.g., Zipf’s law). The large number of low-frequency symbols in these distributions poses major difficulties to the estimation of the similarity between sequences; e.g., they hinder an accurate finite-size estimation of entropies. Here, we show analytically how the systematic (bias) and statistical (fluctuations) errors in these estimations depend on the sample size $N$ and on the exponent $γ$ of the heavy-tailed distribution. Our results are valid for the Shannon entropy ( $α = 1$ ), its corresponding similarity measures (e.g., the Jensen-Shanon divergence), and also for measures based on the generalized entropy of order $α$ . For small $α$ ’s, including $α = 1$ , the errors decay slower than the $1 / N$ decay observed in short-tailed distributions. For $α$ larger than a critical value $α^{*} = 1 + 1 / γ \leq 2$ , the $1 / N$ decay is recovered. We show the practical significance of our results by quantifying the evolution of the English language over the last two centuries using a complete $α$ spectrum of measures. We find that frequent words change more slowly than less frequent words and that $α = 2$ provides the most robust measure to quantify language change.

Received 2 October 2015

DOI:https://doi.org/10.1103/PhysRevX.6.021009

This article is available under the terms of the Creative Commons Attribution 3.0 License. Further distribution of this work must maintain attribution to the author(s) and the published article’s title, journal citation, and DOI.

Published by the American Physical Society

Focus

How to Compare Books or Genomes

Published 15 April 2016

A mathematical technique for comparing large symbol sets suggests that less frequently used words are mainly responsible for the evolution of the English language over the past two centuries.

See more in Physics

Authors & Affiliations

Martin Gerlach^1,*, Francesc Font-Clos^2,3, and Eduardo G. Altmann¹

¹Max Planck Institute for the Physics of Complex Systems, D-01187 Dresden, Germany
²Centre de Recerca Matemàtica, Edifici C, Campus Bellaterra, E-08193 Barcelona, Spain
³Departament de Matemàtiques, Facultat de Ciències, Universitat Autònoma de Barcelona, E-08193 Barcelona, Spain

^*Corresponding author. gerlach@pks.mpg.de

Popular Summary

How similar are two examples of music or text from different authors or disciplines? How fast is the vocabulary of a language changing over time? How can one distinguish between coding and noncoding regions in DNA? The problem underlying all of these questions is how to quantify the similarity between sequences of symbols (e.g., notes in music, words in texts, or base pairs in DNA) and determine the likelihood that two sequences derive from the same parent population. In this work, we address this problem for the important and particularly difficult case in which the frequencies of symbols show heavy-tailed distributions, as encountered in many social, biological, and natural systems (e.g., the famous Zipf’s law that states that the frequencies of words are inversely proportional to their rank in a frequency table).

Information-theoretic measures provide a natural framework for comparing symbolic sequences. Motivated by the wide variation of symbol frequencies in heavy-tailed distributions—for example, “and” is 10,000 times more frequent in the English language than “genetic”—we analytically show how estimating the similarity between two sequences is prone to errors whose magnitudes depend on the sample size $N$ . Our primary finding is that, for heavy-tailed distributions, (i) the error decay is often much slower than $1 / N$ , illustrating the difficulty in obtaining accurate estimates even for large sample sizes, and (ii) there is a critical value of the order of the entropy $α^{*} \leq 2$ for which the normal $1 / N$ dependence is recovered. We emphasize the importance of these findings in the example of quantifying how fast the English vocabulary has changed within the last 200 years, showing that these finite-size effects have to be taken into account even for very large databases ( $N ≳ 1, 000, 000, 000$ words).

Our results have a direct impact on computer science (e.g., authorship attribution or document classification) and also extend beyond the analysis of natural language because heavy-tailed distributions are ubiquitous in other complex systems.

Key Image

Article Text

Click to Expand

References

Click to Expand

Issue

Vol. 6, Iss. 2 — April - June 2016

Subject Areas

Reuse & Permissions

Author publication services for translation and copyediting assistance advertisement

Physical Review X