Solvable null model for the distribution of word frequencies

J. F. Fontanari and L. I. Perlovsky
Phys. Rev. E 70, 042901 – Published 25 October 2004

Abstract

Zipf’s law asserts that in all natural languages the frequency of a word is inversely proportional to its rank. The significance, if any, of this result for language remains a mystery. Here we examine a null hypothesis for the distribution of word frequencies, a so-called discourse-triggered word choice model, which is based on the assumption that the more a word is used, the more likely it is to be used again. We argue that this model is equivalent to the neutral infinite-alleles model of population genetics and so the degeneracy of the different words composing a sample of text is given by the celebrated Ewens sampling formula [Theor. Pop. Biol. 3, 87 (1972)], which we show to produce an exponential distribution of word frequencies.

  • Figure
  • Figure
  • Figure
  • Received 2 April 2004

DOI:https://doi.org/10.1103/PhysRevE.70.042901

©2004 American Physical Society

Authors & Affiliations

J. F. Fontanari1 and L. I. Perlovsky2

  • 1Instituto de Física de São Carlos, Universidade de São Paulo, Caixa Postal 369, 13560-970 São Carlos, São Paulo, Brazil
  • 2Air Force Research Laboratory, 80 Scott Road, Hanscom Air Force Base, Massachusetts 01731, USA

Article Text (Subscription Required)

Click to Expand

References (Subscription Required)

Click to Expand
Issue

Vol. 70, Iss. 4 — October 2004

Reuse & Permissions
Access Options
Author publication services for translation and copyediting assistance advertisement

Authorization Required


×
×

Images

×

Sign up to receive regular email alerts from Physical Review E

Log In

Cancel
×

Search


Article Lookup

Paste a citation or DOI

Enter a citation
×