Probability distribution of intersymbol distances in random symbolic sequences: Applications to improving detection of keywords in texts and of amino acid clustering in proteins

Pedro Carpena, Pedro A. Bernaola-Galván, Concepción Carretero-Campos, and Ana V. Coronado
Phys. Rev. E 94, 052302 – Published 4 November 2016

Abstract

Symbolic sequences have been extensively investigated in the past few years within the framework of statistical physics. Paradigmatic examples of such sequences are written texts, and deoxyribonucleic acid (DNA) and protein sequences. In these examples, the spatial distribution of a given symbol (a word, a DNA motif, an amino acid) is a key property usually related to the symbol importance in the sequence: The more uneven and far from random the symbol distribution, the higher the relevance of the symbol to the sequence. Thus, many techniques of analysis measure in some way the deviation of the symbol spatial distribution with respect to the random expectation. The problem is then to know the spatial distribution corresponding to randomness, which is typically considered to be either the geometric or the exponential distribution. However, these distributions are only valid for very large symbolic sequences and for many occurrences of the analyzed symbol. Here, we obtain analytically the exact, randomly expected spatial distribution valid for any sequence length and any symbol frequency, and we study its main properties. The knowledge of the distribution allows us to define a measure able to properly quantify the deviation from randomness of the symbol distribution, especially for short sequences and low symbol frequency. We apply the measure to the problem of keyword detection in written texts and to study amino acid clustering in protein sequences. In texts, we show how the results improve with respect to previous methods when short texts are analyzed. In proteins, which are typically short, we show how the measure quantifies unambiguously the amino acid clustering and characterize its spatial distribution.

  • Figure
  • Figure
  • Figure
  • Figure
  • Figure
  • Figure
  • Figure
2 More
  • Received 17 June 2016
  • Revised 4 October 2016

DOI:https://doi.org/10.1103/PhysRevE.94.052302

©2016 American Physical Society

Physics Subject Headings (PhySH)

Interdisciplinary PhysicsPhysics of Living Systems

Authors & Affiliations

Pedro Carpena*, Pedro A. Bernaola-Galván, Concepción Carretero-Campos, and Ana V. Coronado

  • Departamento de Física Aplicada II, E.T.S.I. de Telecomunicación, Universidad de Málaga, 29071, Málaga, Spain

  • *pcarpena@ctima.uma.es

Article Text (Subscription Required)

Click to Expand

References (Subscription Required)

Click to Expand
Issue

Vol. 94, Iss. 5 — November 2016

Reuse & Permissions
Access Options
Author publication services for translation and copyediting assistance advertisement

Authorization Required


×
×

Images

×

Sign up to receive regular email alerts from Physical Review E

Log In

Cancel
×

Search


Article Lookup

Paste a citation or DOI

Enter a citation
×