Systematic analysis of coding and noncoding DNA sequences using methods of statistical linguistics

R. N. Mantegna, S. V. Buldyrev, A. L. Goldberger, S. Havlin, C.-K. Peng, M. Simons, and H. E. Stanley
Phys. Rev. E 52, 2939 – Published 1 September 1995
PDFExport Citation

Abstract

We compare the statistical properties of coding and noncoding regions in eukaryotic and viral DNA sequences by adapting two tests developed for the analysis of natural languages and symbolic sequences. The data set comprises all 30 sequences of length above 50 000 base pairs in GenBank Release No. 81.0, as well as the recently published sequences of C.elegans chromosome III (2.2 Mbp) and yeast chromosome XI (661 Kbp). We find that for the three chromosomes we studied the statistical properties of noncoding regions appear to be closer to those observed in natural languages than those of the coding regions. In particular, (i) an n-tuple Zipf analysis of noncoding regions reveals a regime close to power-law behavior while the coding regions show logarithmic behavior over a wide interval, while (ii) an n-gram entropy measurement shows that the noncoding regions have a lower n-gram entropy (and hence a larger ‘‘n-gram redundancy’’) than the coding regions. In contrast to the three chromosomes, we find that for vertebrates—such as primates and rodents—and for viral DNA, the difference between the statistical properties of coding and noncoding regions is not pronounced and therefore the results of the analyses of the investigated sequences are less conclusive. After noting the intrinsic limitations of the n-gram redundancy analysis, we also briefly discuss the failure of zero- and first-order Markovian models or simple nucleotide repeats to account fully for these ‘‘linguistic’’ features of DNA. Finally, we emphasize that our results by no means prove the existence of a ‘‘language’’ in noncoding DNA.

  • Received 17 April 1995

DOI:https://doi.org/10.1103/PhysRevE.52.2939

©1995 American Physical Society

Authors & Affiliations

R. N. Mantegna, S. V. Buldyrev, A. L. Goldberger, S. Havlin, C.-K. Peng, M. Simons, and H. E. Stanley

  • Center for Polymer Studies and Department of Physics, Boston University, Boston, Massachusetts 02215
  • Dipartimento di Energetica ed Applicazioni di Fisica, Università di Palermo, Palermo, I-90128, Italy
  • Cardiovascular Division, Harvard Medical School, Beth Israel Hospital, Boston, Massachusetts 02215
  • Department of Biomedical Engineering, Boston University, Boston, Massachusetts 02215 USA
  • Department of Physics, Bar-Ilan University, Ramat Gan, Israel

References (Subscription Required)

Click to Expand
Issue

Vol. 52, Iss. 3 — September 1995

Reuse & Permissions
Access Options
Author publication services for translation and copyediting assistance advertisement

Authorization Required


×
×

Images

×

Sign up to receive regular email alerts from Physical Review E

Log In

Cancel
×

Search


Article Lookup

Paste a citation or DOI

Enter a citation
×