Genome-Wide Motif Statistics are Shaped by DNA Binding Proteins over Evolutionary Time Scales

The composition of a genome with respect to all possible short DNA motifs impacts the ability of DNA binding proteins to locate and bind their target sites. Since nonfunctional DNA binding can be detrimental to cellular functions and ultimately to organismal fitness, organisms could benefit from reducing the number of nonfunctional DNA binding sites genomewide. Using in vitromeasurements of binding affinities for a large collection of DNA binding proteins, in multiple species, we detect a significant global avoidance of weak binding sites in genomes. We demonstrate that the underlying evolutionary process leaves a distinct genomic hallmark in that similar words have correlated frequencies, a signal that we detect in all species across domains of life. We consider the possibility that natural selection against weak binding sites contributes to this process, and using an evolutionary model we show that the strength of selection needed to maintain global word compositions is on the order of point mutation rates. Likewise, we show that evolutionary mechanisms based on interference of protein-DNA binding with replication and mutational repair processes could yield similar results and operatewith similar rates. On the basis of these modeling and bioinformatic results, we conclude that genome-wide word compositions have been molded by DNA binding proteins acting through tiny evolutionary steps over time scales spanning millions of generations.


I. INTRODUCTION
DNA sequences encode information that is read and interpreted through molecular binding by proteins including transcription factors (TFs), nucleosomes, the RNA polymerase complex, and DNA replication machinery.A DNA binding protein must discriminate a small number of functional binding sites on DNA (∼1-100) from the much larger set of possible sites genome wide (∼10 6 -10 9 ), a fundamental molecular process that has been the subject of numerous biophysical studies (see, e.g., Refs.[1][2][3][4]).The size of a binding site varies depending on the DNA binding protein, with typical values in the range of 4-10 base pairs (bp) [5], while subsets of DNA nucleotide "words" or k-mers that are specifically bound by the protein are loosely referred to as its binding motifs.Given this challenging recognition problem, DNA binding proteins are not as specific as one would naively expect.Most DNA binding proteins recognize degenerate patterns; i.e., they can bind strongly to tens or hundreds of different possible words and weakly to thousands or more [6,7].Indeed, genome-wide measurements consistently detect proteins bound to DNA at nonfunctional (or ectopic) sites throughout the genome [8,9].Given these empirical observations, a fundamental question is whether the functional demands of DNA binding proteins-which are dictated by organismal physiology and typically require a high degree of discrimination between functional and nonfunctional sites-impose evolutionary constraints on genomic DNA sequences, e.g., to enhance their global capacity for target discrimination.In general, we would like to know how strongly and over what time scales can natural selection act on such constraints, as well as whether and how such a process has shaped genomes.To address these key questions at the nexus of molecular biophysics and evolutionary biology, we develop a new approach for global genomic analysis, and use it to perform wide-ranging analyses of DNA sequence evolution.
We identify two major consequences of nonfunctional protein-DNA binding.First, there are interference effects, where such binding can disrupt various molecular processes [10], including transcription [11], gene regulation [12], replication [13,14], and mutational repair [15,16].For example, ectopic binding within a protein coding region could hinder transcription of the gene, slowing down or preventing synthesis of complete mRNA molecules.Recent analyses of cancer sequencing data indicate that transcription factors bound to DNA can interfere with mutational repair machinery, which leads to elevated mutation rates at DNA binding motifs in certain cell types [15].Second, there is a titration effect, where nonfunctional binding of a regulatory protein titrates copies of the protein away from its functional sites, thereby reducing the efficiency of gene regulation [17].While these deleterious effects may be small for any given protein, each genome encodes a large number of DNA binding proteins (∼10 2 -10 3 ) that are each expressed at ∼10 1 -10 4 copies within each cell.The aggregate effect of nonfunctional DNA binding genome wide could have a significant impact on cellular physiology.To increase specificity, DNA binding proteins often act cooperatively by binding multiple, closely spaced binding motifs on DNA, which enables the formation of stabilizing protein-protein contacts [18].Although molecular cooperativity has important regulatory functions unrelated to specificity, by lowering the free energy of binding at functional sites it also reduces the frequency of ectopic binding.In eukaryotes, DNA is packaged and organized into higher-scale structures that modulate the accessibility of different genomic regions to DNA binding proteins [18].By rendering parts of the genome inaccessible to DNA binding, genome organization can reduce interference and titration effects.Nevertheless, accessible genomic regions constitute a large amount of DNA (4.5 × 10 8 bp in the human genome) and a substantial proportion of transcribed regions [19], while cooperative binding reduces but does not eliminate nonfunctional binding.
If nonfunctional binding constitutes a significant burden on cellular processes, then evolution could lead to a reduction of the number of ectopic binding sites genome wide.Until now, very few examples of genome-wide avoidance of binding motifs have been found.Analyses in bacteria have shown that promoter elements [20,21], transcription and translation boundary signals [22,23], and restriction sites [24][25][26][27] are often statistically avoided in genome sequences, yet these constitute only a handful of binding motifs that impact a tiny portion of DNA sequence space in any given species.A more general finding was made soon after the publication of the E. coli genome, when it was noticed on the basis of motif prediction algorithms that coding regions tend to avoid predicted recognition sequences for a collection of known E. coli DNA binding proteins [28].In eukaryotes, it remains largely unknown whether genome sequences have been substantially shaped by DNA binding-related evolutionary constraints.Importantly, most previous studies focused on strong DNA binding sites that are genetically well characterized, but overlooked weak DNA binding motifs, which are more difficult to define experimentally.Since for any given TF there are exponentially more weak binding k-mers than strong ones [6,7], weak ectopic binding to a large number of sites genome wide is potentially much more detrimental to cellular functions than strong ectopic binding, which may be comparatively rare.
Here, we set out to determine whether signals of evolutionary constraints imposed by protein-DNA binding of arbitrary strength can be detected in genomes, and if so, what are the magnitudes of the relevant evolutionary forces.By correlating in vitro measurements of DNA binding with genome-wide word statistics, we show that genomes have evolved to reduce the occurrence of weak binding motifs.We demonstrate that the distinct set of DNA binding proteins coded in each species' genome imposes a large set of global, evolutionary constraints that have ubiquitously shaped genome-wide motif statistics.We show that a hallmark signal of this process can be detected in all available genomes.We introduce a mathematical model of this process and use it to infer the time scales over which evolution under DNA binding constraints has shaped genomic motif composition across all domains of life.

A. Protein-DNA binding measurements and genome-wide motif statistics
To determine whether genome sequences exhibit a signature of global evolution to avoid nonfunctional protein-DNA binding, we use in vitro measurements of protein-DNA binding specificities.A number of techniques are available that yield DNA binding specificity information [29]; however, for our purposes it is particularly important to assess not only the strongest binding sequences but the full range of binding strengths to all possible DNA motifs for each DNA binding protein.Currently, the only technique that enables an unbiased measurement of this kind is a protein-binding microarray [30], which consists of ∼44 000 spots, each corresponding to a unique 60 bp DNA probe.The set of probes is designed to span the set of all possible 8-mers (total 32896), such that each nonpalindromic 8-mer is present in at least 32 different spots (palindromic 8-mers occur in at least 16 spots).Measurement of protein-DNA binding is performed by purifying a protein, which is applied to the microarray and detected by a fluorophore-conjugated antibody.A value proportional to binding strength is obtained for every possible 8-mer by averaging the fluorescence measurements over all spots that contain a given 8-mer.Such binding measurements are available for several hundred DNA binding domains from several different organisms [31].We initially study the mouse data set [6], which is one of the most extensive, in which binding measurements are available for over 100 TFs.For each TF, the log binding intensity values are centered to the median, and normalized by their dispersion, yielding a binding score b i for every possible 8-mer word i. Figure 1(a) shows the distribution of binding scores for a single TF (Mafk) across all 8-mers (see File S1 in Supplemental Material [32] for all TFs).Words in the positive or negative tails correspond to very strong or very weak binding, respectively.For most TFs, the majority of words lie along a continuum of binding levels without substantial gaps, consistent with previous observations that TFs typically exhibit degeneracy of their target preferences [33,34].
To correlate binding measurements with the genomic motif composition, we analyze the intron regions of the mouse genome, which constitute ∼750 Mb or nearly 30% of the total genomic DNA.Introns are ideal for detecting binding-related evolutionary constraints because (i) they are largely devoid of local selective pressures that would complicate the analysis (e.g., protein coding-related selection in exons) and (ii) they reside in genic regions and are thus generally accessible to binding factors.Simply correlating in vitro binding scores with genomic word frequencies, however, is highly misleading due to two effects.First, the nucleotide composition (G/C content) of the genome is a major predictor of word usage, and it has been argued that G/C content may be determined by both mutational biases (i.e., unequal rates of mutation between the four nucleotide bases of DNA) as well as natural selection [35,36].TFs that bind words composed of more (less) frequent nucleotides will automatically exhibit positive (negative) correlations between binding scores and genomic k-mer frequencies, regardless of whether or not G/C content has been evolutionarily shaped by binding-related global constraints.In our analysis, we therefore use relative word frequencies, i.e., normalized by expectation based on genome nucleotide composition.Second, in vertebrates and plants, the dinucleotide CpG is hypermutable (e.g., in mice its mutation rate is ∼10 times the average point mutation rate) causing genome-wide depletion of words that contain it [37].The distribution of k-mer frequencies in mouse introns is multimodal due to differences in words' CpG content [Fig.1(b)].To account for this large effect, which masks smaller differences among words, we correlate k-mer binding scores versus relative word frequencies separately for words with different CpG content.
The example in Fig. 1(c) shows pronounced negative correlations in each CpG category, indicating that the stronger binding a word, the less frequently it is used in the genome relative to expectation.Correlating binding scores of all words against their relative frequencies reveals highly significant negative correlations in each CpG category for a majority of TFs (63=115), with comparatively few TFs exhibiting positive correlations in each CpG category (9=115) [Fig.1(d)].For most TFs, we observe that words with below-average binding (b i < 0) exhibit highly significant negative correlations of relative frequencies with binding scores (92=115) while very few have positive correlations (5=115) (see Fig. S1 in Supplemental Material [32]).Similar results are obtained in worm (21 TFs), fly (14 TFs), human (41 TFs), and yeast (89 TFs) genomes (see Fig. S2, Table S1, and File S1 in Supplemental Material [32]).We conclude that statistically significant correlations exist between binding scores and genomic relative word frequencies, and that, in general, weak binding words are avoided compared to even weaker binding words.

B. Genomic hallmark of global protein-DNA binding constraints
We sought a more general method that could be applied in the absence of in vitro measurements to detect global sequence constraints due to protein-DNA binding.We noticed that, consistent with the biophysics of protein-DNA interaction [1], TF binding scores of words that differ at a single nucleotide are strongly correlated: for each word i, we plot its binding score b i versus the average binding score bi of its "mutational neighbors"-all words that differ from i at a single nucleotide [see Fig. 1(e) herein and File S1 in Supplemental Material [32]].For all 241 TFs that we analyze, we find a general statistical rule that similar words have similar binding strengths.Therefore, if genomes have evolved under protein-DNA binding constraints, then we should detect a strong correlation between the frequencies of similar words, since similar words would be under similar constraints.In Fig. 1(f), we show the result for the mouse genome, where for each 8-mer word i, its relative frequency f i is plotted versus the average relative frequency fi of its mutational neighbors.Consistent with the hypothesis, we observe a strong correlation of f i and fi (ρ > 0.85 within each CpG group).Below, we refer to such word frequency correlations, observed in a given genome, as word-neighbor correlations.
We test for word-neighbor correlations in a large collection of fully sequenced genomes spanning all domains of life.Word frequencies are measured separately in exon and intron regions, and normalized by using appropriate null models that account for mutational biases and other compositional effects.For exons, we use synonymous codon and dicodon shuffling schemes to construct partially randomized DNA sequences that preserve amino acid sequences, genomic codon biases, and nucleotide base composition.The dicodon shuffling scheme additionally preserves the frequencies of all k-mers for k ≤ 4 [22].The randomized sequences are scanned to determine expected word frequencies for exons.For introns, we compute expected word frequencies based on genome-wide nucleotide (1-mer), dinucleotide (2-mer), or trinucleotide (3-mer) frequencies (see Sec. IV).
All tested genomes (947 bacterial or archaeal genomes, 1304 eukaryotic chromosomes from 75 species) exhibit striking correlation between f i and fi in exons [Fig.2(a) herein and Figs.S3, S4, S7, S23 of Supplemental Material [32]] and in introns (Figs.S5, S6, S6B of Supplemental Material [32]) for each of the null models.Further analysis indicates that RNA-related constraints, such as secondary structures and RNA binding motifs, do not play a significant role in shaping word frequencies in genic regions, and that DNA binding proteins provide the major set of global constraints (see Sec. 2.2 of Supplemental Material [32]).For noncoding portions of the genomes, we analyze word statistics across intergenic regions (Fig. S25 of Ref. [32]), as well as in the DNaseI hypersensitive regions of the human genome, which are experimentally determined binding-accessible regions (Fig. S22 of Ref. [32]).In all cases, we observe similarly high levels of word-neighbor correlation, and additionally, we find that normalized word frequencies correlate strongly across all tested regions (Figs.S8 and S24 of Ref. [32]).Eukaryotic genomes are further analyzed on a chromosome-by-chromosome basis.
In human chromosomes, for example, the overall shape of the f versus f plots from exons is qualitatively similar across chromosomes [Fig.2(b) herein and Fig. S9 of Ref. [32]].Comparing relative word frequencies f i between different chromosomes, we find high correlation coefficients (ρ > 0.9) for most chromosome pairs within each genome (Table S4 of Ref. [32]).Deviations are observed for short or Y chromosomes [Fig.2(b) herein and Fig. S10A of Ref. [32]; see also Sec.II F].

C. Ancient phylogenetic signal of global evolutionary constraints
Evidence of a persistent global evolutionary process acting on k-mers can be found in a principal component analysis (PCA) we perform on relative word frequencies in introns (2-mer null) across eukaryotic species (Fig. 3).
Chromosomes within a single genome form tight clusters that demarcate species.Although major groups of eukaryotes can be clustered to a certain extent using genomewide di-, tri-, and tetranucleotide frequencies [38,39] as well as codon biases [40], we achieve a significantly higher resolution using data exclusively from the intron regions of individual chromosomes.Closely related species are spatially proximate, with plants, invertebrates, and vertebrates following different directions in the PCA space (Fig. S17 of Ref. [32]).
Our analysis indicates that primates experienced a "jump" in global constraints from the rest of the mammals (Fig. S18 of Ref. [32]), while within this group, no species clusters are detected (Fig. 3), indicating that the global constraints have not significantly changed since the last common ancestor.In prokaryotes, phylogenetic signals are generally very weak or nonexistent when assessed across all genomes (Fig. S19 of Ref. [32]), but may persist at the genus and species level (Fig. S20 of Ref. [32]), indicating that over the longest evolutionary time scales global constraints can change so extensively that the most ancient parts of the signal are extremely faint and difficult to detect.3. Cross-species comparison of intron word frequencies using principle components analysis (PCA).Each point in the PCA plots corresponds to a single chromosome projected on the two principal axes.Principal axes are computed using the 6-mer relative word frequency vectors (2-mer null model) of all chromosomes shown in each plot.See Table S3 of Ref. [32] for further species information.

D. Evolutionary model of global selection
To analyze the role of natural selection in shaping global word frequencies, we examine the consequences of a global selective process acting differentially on words across a genome using an evolutionary model.A sequence of length L is represented by a k-mer composition vector p in the set L ¼ fpjp i ≥ 0; P n i¼1 p i ¼ Lg, where n ¼ 4 k is the total number of possible words and p i is the number of occurrences of word i in the sequence.We denote by s the selection vector whose ith component is the selective coefficient s i acting on each word of type i.We consider a large population of evolving sequences, each represented by a composition vector.At each generation, each sequence contributes offspring proportional to its total fitness, expðs • pÞ.The offspring mutate according to a mutational transfer operator Gðpjp 0 Þ, which gives the probability of a mutational transition p 0 → p.Under these dynamics, the expected frequency of p within the population at generation t, denoted by P t ðpÞ, evolves according to the equation where λ t is a normalizing factor, which measures the average population fitness at generation t, given by summing both sides over all values of p.The irreducibility of the transfer operator G guarantees (by the Perron-Frobenius theorem [41]) that the linear mapping of P t → P tþ1 converges to a unique steady-state distribution PðpÞ, which dictates the population structure at mutation-selection balance.
The fitness model assumes that the contribution of each word to the global fitness is independent of all other words, which is a natural assumption given the small effect sizes one expects for individual words.Since each locus along DNA mutates independently, the transfer operator can be decomposed as a product over all loci, which implies that PðpÞ is a multinomial distribution [26].A multinomial random variable x, denoted x ∼ multðf; LÞ, results when sampling L independent events from a discrete probability distribution f i , where x i is the number of outcomes of type i.With this notation, we have p ∼ multðf; LÞ, where f is the single-locus distribution of words at mutation-selection balance.This distribution f is obtained as the unique positive solution of the eigenvalue problem where S ¼ diagðsÞ and A ij is the probability that word j mutates into word i.Once more the Perron-Frobenius theorem guarantees a unique positive eigenvector f with associated eigenvalue c, which determines the equilibrium average fitness, λ t → λ ¼ c L .For known selective coefficients and mutational transition rates, Eq. ( 2) can be solved numerically.
To invert the relationship, and obtain an expression for s i given f, we use the fact that the point mutation rate u (per bp per generation) is extremely small.We express A explicitly as A ¼ e ukðM−IÞ , where I is the identity matrix and M ij is the probability that a single mutation converts word j into word i; M ij is nonzero only for mutational neighbors, i.e., words i and j that differ by exactly one mutation.If we assume unbiased mutation, M ij ¼ 1=ð3kÞ for any pair of mutational neighbors i and j, and expand A in small u to first order, i.e., A ≈ I þ ukðM − IÞ neglecting contributions from double (or higher order) mutants that occur within a single word locus, then Eq. (2) yields where NðiÞ is the set consisting of the 3k mutational neighbors of word i.The second term within the parentheses is fi , the average frequency of mutational neighbors.Dividing both sides by f i and taking logarithms, we find Selective coefficients relative to u are thus determined up to an additive constant by the ratio of a word's frequency and the average frequency of its mutational neighbors.More generally, if words mutate with different rates ku j , and M ij is not necessarily unbiased, a generalization is easily obtained by defining the diagonal matrix U ≡ diagũ and using A ¼ e kðM−IÞU ≈ I þ kðM − IÞU in Eq. (2), which yields where J i ≡ ku i f i is the mutational flux out of word i and Ji ≡ P j∈NðiÞ ku j f j M ij is the mutational flux into word i from its mutational neighbors [42].
Figures 4(a)-4(c) show examples of the equilibrium solution [Eq. ( 2)] when selective coefficients are randomly sampled from a normal distribution with standard deviation σ.Words experiencing similar pressures fall on the lines in the ð f; fÞ plane predicted by Eq. ( 4) [Fig.4(a)].This result encapsulates a basic insight of our analysis: the genome-wide frequency of a word says little if anything about the global pressure it experiences.Words can be under-or overrepresented simply because their mutational neighbors are under negative or positive pressures, respectively.Indeed, selective coefficients can be inferred only when a word is viewed relative to its mutational neighbors.
When selection is weak relative to mutation [σ < u, Fig. 4(b)], the word cloud collapses toward a point in which all words are effectively neutral, while under strong selection [σ > u, Fig. 4(c)] the word with the maximum selective coefficient dominates the distribution.Only in the intermediate regime [σ ≃ u, Fig. 4(a)] does the solution take the form of an extended word cloud, with frequencies varying from approximately twofold avoidance to twofold enrichment.A pronounced positive correlation between f i and fi is seen, despite the fact that all words experience independent random pressures.This correlation results from selection, which modulates the frequency of neighbor words in order to alter mutational fluxes into words under selection (similar results are obtained with a skewed pressure distribution, Fig. S11 of Supplemental Material [32]).The correlation of frequencies can be increased further by introducing positive correlations in the pressures on similar words, resulting in an extended, rotated word cloud [Figs.4(d) and 4(e)], while the equal-pressure lines remain unchanged.

E. Distribution of global selective coefficients in genomes
We apply the evolutionary model (Sec.II D) to measure the global selective coefficients in each genome.Word frequencies f i are counted separately in introns and exons, and expected frequencies f 0 i are obtained using the null models described above.We measure global selective coefficients using the excess pressures Δs i ≡ s i − s 0 i , where s i and s 0 i are separately inferred from f i and f 0 i , respectively, using Eq. ( 4).One can think of the selective coefficients s 0 i as a kind of "external field" that maintains words at their null expectation.For example, in exon regions the requirements of protein coding impose a set of selective constraints on words, which we view as an external field that can be calculated using the word frequencies obtained from synonymous codon shuffling.By using excess pressures Δs i rather than s i , we measure the strength of selection needed to shift word frequencies to their observed values from their expected values based on the null models (see Supplemental Material [32]).In intron regions, mutational biases such as CpG hypermutation and other less pronounced biases can influence the observed frequencies of dinucleotides.Excess pressures relative to the 2-mer null model in introns measure only the global selective pressures that are not already accounted for by dinucleotide composition.
In the context of protein-DNA binding, Δs i corresponds to the average fitness effect due to occupancy by DNA binding proteins at each occurrence of word i genome wide [43].In Fig. 5, the distribution of selective coefficients is shown for several genomes (see Figs. S3-S7 of Ref. [32] for additional genomes and null models).In all cases, the bulk of the distribution has a width comparable to u.The selective coefficients Δs i measured on different chromosomes are strongly correlated, with overall very similar distributions (Fig. S9 of Ref. [32]).Importantly, Eq. ( 4) allows us to determine the pressures on each word in spite of pressures acting on their mutational neighbors.We find that the selective pressures on similar words are indeed highly correlated (Fig. S13 of Ref. [32]).We conclude that if genome-wide word frequencies have been maintained by natural selection acting globally across a genome, the magnitude of the relevant selective pressures would be on the order of point mutation rates.Given that our mathematical model assumes mutation-selection balance, these correspond to the magnitude of selection necessary to maintain genome-wide motif statistics over evolutionary time scales (known as "purifying selection" in evolutionary biology [44]).Since the effect sizes on individual words are tiny, we asked whether the global fitness differences between individuals are sufficiently large to constitute non-negligible selective differences for a finite size population.That is, these selective differences should be larger than the magnitude of demographic fluctuations, which scales as ∼1=N eff , where N eff is the suitably defined effective population size that accounts for fluctuating population sizes [45].If two individuals differ at l sites, or kl words, and each word contributes an average effect size AEs, their global fitness difference ΔS ∼ s ffiffiffiffi kl p .In human populations, pairs of individuals differ at 4.6 × 10 6 sites on average [46], and taking k ¼ 6 and s ∼ u (Fig. 4), with u ¼ 10 −8 [47], we have ΔS ≈ 5 × 10 −5 .Effective global selection thus requires N eff ≳ 10 4 .In E. coli, isolates differ on average at 1%-2% of sites [48], or approximately 7 × 10 4 sites.Taking s ∼ u, and using u ¼ 2.2 × 10 −10 [49], we find ΔS ≈ 1.4 × 10 −7 ; hence, global selection requires N eff ≳ 10 7 .The per-word selective coefficients we measure in these genomes are thus sufficient to maintain genome-wide statistics given the present levels of diversity in these populations and previous estimates of their effective population sizes [45].We note that the total benefit of global adaptation, compared to nonadapted sequences, scales as Lu, and is therefore much larger than the global fitness differences that exist within any globally adapted population.

F. Effect of recombination on global selection
Our evolutionary model [Eq.( 1)] propagates the expected frequencies of different word vectors P t ðpÞ and, thus, carries the implicit assumption of an infinitely large population.For finite asexual populations of size N eff with genomes of size L ≫ 1, deviations from this equilibrium will occur due to Muller's ratchet [50,51].This general effect has been discussed in a statistical physics perspective [52], and we briefly review the same reasoning here.Because of the large genotype space, the maximally fit sequence-i.e., one in which the most favorable word is present at each locus-is rapidly lost from the population, because the rate of mutation away from this sequence, Lu, is much larger than the rate of back mutation u.Within a few generations, due to the finite population size, every sequence has accumulated at least one deleterious word, and the fittest sequence in the population is now one that carries at least one mutation.This process drives the population away from the fittest sequence, reaching an equilibrium distribution that is different from PðpÞ, which depends in a complex manner on N eff , L, and u [52].Recombination, however, can reverse the ratchet effect, by enabling the previously fittest sequence to be recovered.Similarly, for sexual populations, in genomic regions with low recombination rates, ratchetlike effects of various kinds (known as Hill-Robertson effects [53,54]) will oppose global adaptation.
Accordingly, simulations of a nonrecombining finite population initialized at the predicted mutation-selection balance evolve to lower fitness (Fig. S15 or Ref. [32]), while recombination enables the equilibrium genome composition to be efficiently maintained under global In E. coli (bottom panels), an inset shows the bulk of the distribution, since a small number of words including known restriction sites have large negative coefficients.Pressure distributions are shown using a value of c ¼ 1 in Eq. ( 4), which corresponds to taking the average global fitness of a sequence to be one.selection (Fig. S14 of Ref. [32]).Similar behavior in a different simulation model has previously been shown [53].Consistent with this predicted dependence of global selection on recombination, we find that chromosomes whose word frequencies exhibit deviations from the genome-wide average tend to be nonrecombining sex chromosomes or to have smaller genetic length than average (Fig. S10 of Ref. [32]).In the human genome, the genome-wide average number of mutations per crossover per generation is 0.87 [55], while in bacteria homologous recombination replaces small fragments of a genome with homologous fragments from other cells, and occurs with rates per site that are comparable to or greater than u [56], allowing global selection to maintain genome motif composition.Since global adaptation results in a total fitness benefit that scales as Lu, and which is only achieved for sufficiently high recombination rates, the presence of global selection would thus yield a generic selective advantage for recombination.

G. Neutral mechanisms for evolution of word-neighbor correlations
Using a separate set of analyses, we determine whether neutral mechanisms, i.e., ones that operate in the absence of selective differences between individuals, could account for word frequency distributions and word-neighbor correlations.The spontaneous mutation process can exhibit certain species-specific biases, such that mutational transition rates depend on the type of nucleotide and in some cases on its immediately proximate nucleotides (see Supplemental Material [32] for a detailed discussion).To determine whether these types of mutational biases can account for word-neighbor correlations, we run a wide range of tests, which include performing an analysis of variance on k-mer frequencies using their dinucleotide and trinucleotide composition (Table S5 of Ref. [32]), analysis of wordneighbor correlations using regression residuals (Fig. S16 of Ref. [32]), a dicodon shuffling scheme that accounts for mutational biases in bacteria (Fig. S23 of Ref. [32]), and explicit incorporation of mutational biases into evolutionary models (Fig. S12 of Ref. [32]).In each of these tests, mutational biases are unable to account for the observed word frequency distributions and word-neighbor correlations (see Sec. 2.1 of Supplemental Material [32]).
Another potentially relevant neutral mechanism involves repeat-derived sequences in genomes.These are portions of the genome that have resulted from local duplications and transposon-mediated events, which amplify certain types of sequences and could possibly skew word composition as a result.Since large eukaryotic genomes have a substantial amount of repeat-derived sequence, we test whether wordneighbor correlations might arise from a balance between amplification of specific classes of mobile and/or repeatcontaining elements and mutational degradation.We analyze the repeat-masked sequence of the human genome, a procedure that removes approximately 45% of the sequence (Fig. S21 of Ref. [32]), and find that it exhibits strong word-neighbor correlations (r ≥ 0.92 for all null models, Fig. S21C of Ref. [32]); similar results are found across a wide range of species (Fig. S25 of Ref. [32]).Repeat expansion is, therefore, not responsible for word-neighbor correlations, and word frequencies are strongly correlated between repeat-masked and repeat-derived regions (Fig. S21B of Ref. [32]).A different possibility is that, due to ubiquitous small insertions and deletions (indels) occurring in all genomic regions, word composition could be determined by slippage-based mutational mechanisms in which DNA polymerase inserts or deletes short runs of DNA (see, e.g., Ref. [57]).While repeat masking cannot detect this finer-scale process, it is well known from comparative genomics and mutation accumulation studies in different species that indels occur between 0.03 and 0.13 times as frequently as point mutations [35,49,[58][59][60].Our evolutionary model shows that processes that change word frequencies at much slower rates than the point mutation rate cannot yield the observed word-neighbor clouds [Fig.4(b)].These results indicate that neither mutational biases nor neutral processes involving repetitive DNA can account for the ubiquitous word-neighbor correlations we observe.
Lastly, we consider interference by DNA binding proteins with DNA replication and repair machinery as a process that could in principle introduce species-specific mutational biases on k-mers of various sizes.Detecting such binding-related mutational biases in natural populations would require statistical power beyond that available in current data sets.Recent analysis of sequencing data from human cancer cell lines indicates that such interference effects exist for certain DNA binding proteins in specific cell types under UV exposure, although their molecular mechanisms are currently unknown [15].Assuming in our model that words mutate with rates proportional to their probability of being bound q i , we obtain from Eq. ( 5), in the absence of selection (s i ¼ 0), that mutational fluxes must balance and, therefore, f i ∼ m i =q i , where m i is the right eigenvector of M ij .Equilibrium word frequencies would therefore be inversely proportional to binding probability, as expected if the severity of interference increases with binding strength.This relation may implicitly involve nonlinearities, since q i could depend on the genome composition f and other parameters [43].

III. DISCUSSION
Variation in genome composition across species has been extensively studied, from global G/C content differences to codon usage biases [40,[61][62][63][64][65].Early work by Karlin and co-workers found that dinucleotide frequencies differ between species [38,66], while further differences were later noticed at the level of longer k-mer words [39,67].There has been little consensus on the underlying causes of such differences, and a wide variety of mechanisms and explanations have been proposed [35,36,40,61,63,68,69].Several works have demonstrated that mutational biases and neutral mechanisms cannot explain basic differences in bacterial genome composition [36,63,69], in codon usage biases [62], or in genome-wide patterns of variation in flies [44].The possibility of multiple layers of "overlapping codes" that constrain coding sequences was proposed and tested in a wide range of prokaryotic and eukaryotic species in Ref. [22], which along with several other studies revealed constraints on coding regions resulting from translation initiation signals [23], promoter elements [20,21], restriction sites [24][25][26][27], repetitive sequences [70,71], RNA secondary structures [72], and microRNA target sites [73].Viral genomes are known to experience global constraints on motif usage due to host immune mechanisms [74,75], and the effect of such constraints on the coevolution of specific host innate immune genes and viral genomes has also been noticed [76].Here, we propose that a large number of species-specific constraints exist in all genomes due to the ubiquitous molecular process of protein-DNA binding.
We present several lines of evidence indicating that genome-wide word frequencies have been shaped by global protein-DNA binding constraints over long evolutionary time scales.We show that extensive in vitro binding data exhibit statistically significant correlations with genomewide relative word frequencies.These correlations are usually negative, indicating that word avoidance scales with binding strength.We identify a general hallmark of global binding constraints-the strongly correlated frequencies of similar words-which we observe in all genomes.We argue that this correlation results from the biophysics of protein-DNA binding: since similar words have similar binding strengths, their protein-DNA binding constraints should likewise be correlated, resulting in the observed word-neighbor correlations.We analyze a wide range of null models, which demonstrate that wordneighbor correlations cannot be attributed to known mutational biases.Finally, we analyze the phylogenetic signal of global sequence evolution, which we now discuss.
The persistence of a phylogenetic signal in relative word frequencies over the large evolutionary distances seen in Fig. 3 herein and Figs.S17 and S18 of Ref. [32] cannot be explained by an unconstrained neutral divergence process.For example, D. melanogaster and M. musculus diverged over 600 million years ago, yet their relative word frequencies (2-mer null) exhibit a correlation of ρ ¼ 0.43 (Table S4 of Ref. [32]).Billions of generations separate these genomes, and considering that their per bp per generation mutation rates are ∼10 −8 [77], on average every neutral position will have mutated one or more times.Since our analysis is performed in introns, most of the ancient signal would have been destroyed, leaving essentially no correlation.We propose that this phylogenetic signal persists because DNA binding proteins' specificities are conserved over long evolutionary time scales, and they continue to exert similar global constraints on genomes.Consistent with this prediction, the motif specificities of homologous transcription factors from flies and mice have been shown in vitro to be highly conserved [78].
The ability to sensitively resolve eukaryotic species based on word usage arises from two major factors: (a) that global word usage is influenced by a large number of DNA binding protein motifs and thus provides a summary signature of each genome's DNA binding protein repertoire, and (b) that introns are largely free of other evolutionary constraints, such as protein coding pressures, which allows these genomic regions to adapt much more fully to global constraints.Much further analysis would be necessary to determine which specific features of the global word usage signal are responsible for phylogenetic resolution.In the well-resolved portions of the bacterial phylogeny, the γ proteobacteria (Fig. S20 of Ref. [32]), we test whether the extremes of the word usage distribution are necessary for phylogenetic resolution.We remove all known restriction sites from the analysis, since these are generally the most strongly avoided words, and find no change in our ability to resolve these species (data not shown).Likewise, how changes in DNA binding proteins map to divergence of global word usage and how the total number of DNA binding proteins in a genome influences the strength of this divergence remain to be studied.To this end, an interesting case is presented by bacteria with strongly reduced genomes and tiny cell sizes, such as Mycoplasma, which are known to have very few transcriptional regulators per genome [79], yet which show typical word-neighbor correlations (Fig. S4 of Ref. [32]).Since individual DNA binding proteins each exhibit a continuum of binding scores (File S1 of Ref. [32]), the concentration of DNA binding proteins in the cell may be as important as the overall binding repertoire of these proteins in determining the magnitude of DNA binding constraints acting on a genome.
At the level of individual transcription factors, the correlations of binding scores with genomic word frequencies are highly significant, but the magnitudes of the correlations are not particularly high (Fig. 1 herein and Figs.S1 and S2 of Ref. [32]).This finding suggests that a typical word experiences weak global constraints from a number of distinct binding proteins, with each word's frequency reflecting an average over a collection of different binding scores, or, equivalently, that the global word distribution is not driven by a small subset of the genome's DNA binding proteins, but rather reflects an average set of constraints determined by a substantial fraction of them.While these constraints appear to be predominantly due to selection against weak binding motifs, a number of strong positive correlations are observed across the different data sets.Positive correlations are expected when a DNA binding protein functions in a genome-wide process that involves localization to a large number of sites on DNA, e.g., DNA binding factors that mediate chromatin remodeling on a large scale, as occurs during developmental transitions.Intriguingly, the Zic proteins, which correspond to the three largest positive correlations in the mouse data set [Fig.1(d), right edge], were recently shown to be involved in binding a large number of sites in the genome that become accessible to DNA binding during the differentiation of cerebellar granule neurons [80].
We consider the possibility that genomic word composition is maintained by global selection, and show that if this were the case, the magnitude of selective coefficients would be on the order of the point mutation rate u.Likewise, we show that if word composition evolves due to binding-related mutational biases, the rate of this process would be on the order of u.In either case, the relevant evolutionary rates are extremely small, since, e.g., u∼10 −10 per generation (E. coli [49]) or ∼10 −8 (H.sapiens [47]).In both cases, the phylogenetic signal of word composition (Fig. 5 herein and Figs.S17 and S18 of Ref. [32]) is preserved over long evolutionary time scales due to strong conservation of DNA binding proteins' binding preferences.However, the two mechanisms make several different predictions that may become testable as larger DNA sequencing data sets become available in different species.
We show that the efficacy of global selection depends on recombination (Figs.S14 and S15 of Ref. [32]), and consistently we find that chromosomes with low recombination rates deviate in word composition from the rest of the genome (Fig. S10 of Ref. [32]).The binding-related mutational bias mechanism predicts no effect of recombination on word composition.However, low recombination rates might be correlated with low accessibility to DNA binding factors in certain cases [81]; hence, it is not straightforward to rule out binding-related biases as a possible explanation.A much finer-scale analysis of recombination rates genome wide combined with reliable DNA accessibility measurements could in principle be used to determine which mechanism is operative in any given species.
Under both mechanisms, the evolutionary rates scale with u, but the underlying cause of this scaling is potentially quite different.For binding-related biases, such scaling is automatic, since the biased mutation rates are all multiples of the point mutation rate.For global selection, the selective coefficients s i could in principle be arbitrary, and their scaling with u in all genomes we examine would imply some universal set of constraints related to the molecular processes that protein-DNA binding interferes with.While the origin of such scaling is not obvious, neither are the evolutionary forces that have tuned the point mutation rate itself, which varies in nature across 7 orders of magnitude and appears to scale inversely with genome length in microbes [77,82].Indeed, one can consider the possibility that point mutation rates themselves have evolved to be on the order of global selective coefficients, which would enable genomes to gain the fitness benefit conferred by global adaptation.Another possibility is that protein-DNA binding has a tiny impact on DNA replication rates, in which case removing one binding site would decrease replication time by an average time unit scaling proportionally with the binding strength and inversely with the genome length, leading to global selective coefficients scaling with u.It is also possible that protein-DNA binding interferes with mutational repair processes, but does not lead to mutational biases.Instead, it could slow down or impair the cell division process, resulting in global selective coefficients that are proportional to both u and the binding strength.Some of these possibilities are testable by highly sensitive laboratory measurements.
Lastly, the two mechanisms differ in their response to population size.Binding-related mutational biases are not predicted to be sensitive to population size, and would operate identically in large or small populations.The efficiency of global selection in maintaining genome composition, on the other hand, will be sensitive to population size, though the effect may be small and the amount of recombination will determine precisely how much more effective global selection will be in larger populations.It may, therefore, be possible, with sufficiently high resolution data and detailed population genetic modeling, to distinguish the mechanisms using measurements from related species of substantially different population sizes.
Our extensive analysis of genomic data demonstrates that protein-DNA binding constraints constitute a universal evolutionary force that acts on genomes.We propose that this force arises from the functions of a diverse and distinct set of DNA binding processes within each genome that together generate a characteristic set of global constraints.As these protein-DNA binding constraints maintain genome-wide motif compositions over long evolutionary time scales, their hallmark can be detected in the correlated frequencies of words and their mutational neighbors.Our findings introduce a new view on genomic evolution, in which the global composition of a genome is constantly being molded by the complement of DNA binding proteins that it expresses.

A. Genomic sequence data sets
Sequences are downloaded from Ref. [83].We manually curate a large, representative set of genomes across all bacterial and archaeal groups (947 genomes).For eukaryotes, all fully assembled and annotated genomes are included in the intron analysis (75 species, 1304 chromosomes; mitochondria and plastid sequences are not included; see Table S3 of Ref. [32]).Among eukaryotes, exon analysis is performed for 17 species (251 chromosomes; see Table S3 of Ref. [32]).To extract the intron segments, coding sequence coordinates are obtained from the annotation files and then mapped to the full-length genome.For both exon and intron extraction, only one of the splicing isoforms is selected at random.

B. Measuring genomic word statistics
Word frequencies are collected by a sliding window of k bp (k ¼ 2; …; 8) across the sequences.For the coding region analysis, frequencies are collected on each open reading frame (i.e., multiple exons are joined for eukaryotes) and then combined for all open reading frames on a chromosome.Start and stop codons are ignored.For the intron regions, each intron segment is read separately and then the frequencies are combined.Gaps (N tracts in the eukaryotic sequences), if encountered, are replaced randomly by four bases at equal frequencies.Word frequencies are counted on the plus strand of DNA, and the counts of reversecomplement words are combined.

C. Constructing the null models in exon and intron sequences
For the coding region analysis using codon shuffling as the null model, synonymous codons are shuffled across the entire chromosome sequence.Start and stop codons are ignored.Each shuffled sequence is then scanned as above.The expected word frequencies f 0 i are calculated as the average of word counts from 1000 such shuffled sequences.For the intron analysis using nucleotide base composition as the null model (1-mer null), base compositions are calculated on the chromosomal scale.where pðcjabÞ is the probability of observing trinucleotide ðabcÞ conditional on the dinucleotide ðabÞ occupying the first two positions, and p w 1 w 2 is the measured frequency of dinucleotide ðw 1 w 2 Þ.

D. Principle components analysis of genomic word frequencies
Principle component analyses are performed on the correlation matrix between relative word frequencies using the 2-mer null model.For eukaryotic intron regions (Fig. 3 herein and Figs.S17 and S18 of Ref. [32]), relative word frequencies of 1049 chromosomes from 61 species are used for the PCA (Fig. S17 legend and Table S3 of Ref. [32]).Excluded are chromosomes that are shorter than 1 Mb or having <0.8 correlation between the measured relative frequencies or excess pressures of reverse complement words on the plus strand, as well as the chromosomes of the bird Ficedula albicollis, because its word frequencies take on a split distribution in the ð f; fÞ plane.The word frequencies on these chromosomes are so extreme that they dominate the PCA analysis.The plots for subgroups shown in each panel are produced by PCA on each subgroup alone.For bacterial coding regions (Figs.S19 and S20 of Ref. [32]), relative word frequencies are computed using codon shuffling as the null model, and counts are combined on both strands of all analyzed exon sequences.

FIG. 1 .
FIG. 1. Mouse genomic binding landscape and in vitro DNA binding measurements.(a) Distribution of binding scores b i for the Mafk TF over all 8-mer words i.(b) Distribution of 8-mer word frequencies f i in mouse introns (black curve); f i are shown normalized with respect to expectation based on genome-wide nucleotide composition.Words are shown as separate histograms according to their CpG counts (colored bars).(c) Correlation of b i and log f i for the Mafk TF in separate CpG categories.Color indicates density of points.(d) Correlation coefficients (Spearman's ρ) of b i versus f i are shown for each mouse TF separately computed conditioned on word CpG content.Bars are shown only for statistically significant correlations, with p value < 10 −6 .(e) Binding scores of words (b i ) are correlated with the average binding score of their mutational neighborhoods ( bi ); results shown for Mafk, ρðb; bÞ ¼ 0.87.(f) Correlation of f i and fi over all 8-mer words i for mouse introns; words are colored according to CpG content, and frequencies are normalized as in (b) and (c).

FIG. 2 .
FIG.2.Word-neighbor correlations are observed in genomes across all domains of life.(a) Word frequency plots of f i versus fi in exons of representative bacterial and eukaryotic genomes.Points correspond to all possible 6 bp words.Frequencies are shown relative to the null expectation from synonymous codon shuffling, where a value of 1 means the observed frequency is equal to the expected frequency.For eukaryotes, frequencies are computed across all chromosomes.(b) Word frequency plots from exons of individual human chromosomes (see Fig.S9of Supplemental Material[32] for all data).

FIG. 4 .
FIG. 4. Evolutionary model of global selection.(a)-(c) Solution of the model at mutation-selection balance.Plots in the ð f; fÞ plane of the equilibrium word frequencies for independent, normally distributed selective coefficients s i ∼ N ð0; σ 2 Þ, using the indicated values of σ.We numerically solve Eq. (2) to determine the equilibrium word frequencies.Color shows different bins of selective coefficients, and predicted equal-pressure lines [Eq.(4)] are drawn using each interval's average pressure.Inset in (c) shows a zoomed view.Parameters are k ¼ 6 and u ¼ 10 −4 , and frequencies are shown relative to the neutral expectation 4 −k .(d) Equilibrium for strongly correlated pressures on neighboring words.(e) Correlation is reduced by shuffling 50% of words' pressures from (d).

FIG. 5 .
FIG. 5. Distribution of global selective coefficients in different genomes.(a) Selective coefficients of 6-mers in four eukaryotic species.Intron data are used and Δs i values are given with respect to the 2-mer null model.(b) Selective coefficients of 6-mers measured in two bacterial species.Exon data are used and Δs i values are given with respect to the synonymous codon-shuffling null model.In E. coli (bottom panels), an inset shows the bulk of the distribution, since a small number of words including known restriction sites have large negative coefficients.Pressure distributions are shown using a value of c ¼ 1 in Eq. (4), which corresponds to taking the average global fitness of a sequence to be one.
For a word w with n A , n T , n C , n G counts of A, T, C, and G bases, respectively, the expected frequencies are calculated as f 0 w ð1 − merÞ ≡ p n A A p n T T p n C C p n G G , where p A , p T , p C , p G are base compositions measured from the full-length sequences.For analysis using the dinucleotide (2-mer null) or trinucleotide (3-mer null) models, a Markov chain model is used to compute expectations.A k-mer word w ¼ ðw1 w 2 w 3 …w k Þ is composed of an overlapping set of dinucleotides ðw 1 w 2 Þ; ðw 2 w 3 Þ; …; ðw k−1 w k Þ.The probability of observing each dinucleotide ðabÞ is measured across all introns conditional on a being at the first position, and denoted pðbjaÞ.Using these measured values, the expected frequency of the word w is given byf 0 w ð2 − merÞ ≡ p w 1 pðw 2 jw 1 Þpðw 3 jw 2 Þ Á Á Á pðw k jw k−1 Þ;where p w 1 is the measured frequency of nucleotide w 1 .Similarly, for the trinucleotide model, the word is composed of an overlapping set of trinucleotides, and f 0 w ð3 − merÞ ≡ p w 1 w 2 pðw 3 jw 1 w 2 Þpðw 4 jw 2 w 3 Þ Á Á Á × pðw k jw k−2 w k−1 Þ;