Dynamics of transposable elements generates structure and symmetries in genetic sequences

Genetic sequences are known to possess non-trivial composition together with symmetries in the frequencies of their components. Recently, it has been shown that symmetry and structure are hierarchically intertwined in DNA, suggesting a common origin for both features. However, the mechanism leading to this relationship is unknown. Here we investigate a biologically motivated dynamics for the evolution of genetic sequences. We show that a metastable (long-lived) regime emerges in which sequences have symmetry and structure interlaced in a way that matches that of extant genomes.

a. Introduction.Transposable elements (TEs) are DNA sequences that can relocate themselves in new sites of the genome.They were firstly discovered in maize by B. McClintock in the mid-1940s and initially considered as parasites with no functional roles [1].Nowadays TEs are known to be ubiquitous in both prokaryotes and eukaryotes genomes [2,3] and little doubts are left of their prominent role in genome evolution, shaping structure and function in a multitude of ways [4,5].As TEs constitute more than half of the sequence in many higher eukaryotes, a fingerprint of their presence can be quantitatively extracted from the statistical properties of their host DNA.Indeed, TEs properties were shown to be crucial in explaining structural global features of genome sequences [6][7][8][9][10][11].
Recently, Albrecht-Buehler [12] suggested that TEs were the main driving force for the emergence of the second Chargaff parity rule.This rule states that, in each strand of the DNA, the frequencies of a short oligonucleotide w is approximately equal to that of its symmetrically related ŵ, obtained from w by reversing the order of the symbols and substituting each nucleotide with its conjugated A ↔ T and C ↔ G (e.g.w = ACT GGCT , ŵ = AGCCAGT ).It has been first observed by Chargaff in the 1950s [13] and since then detected across different organisms leading to different proposals for its origin and function [14][15][16][17][18][19][20][21][22][23][24][25][26][27][28][29].The importance of Albrecht-Buehler explanation is that it shows how this symmetry naturally emerges as an asymptotic outcome of the cumulative action of inversions/transpositions, one of the main mechanism of relocation of TEs.As we will show, while the proposed mechanism nicely induces Chargaff symmetry in the asymptotic DNA, it does it at the cost of trivialisation of the structural properties of the sequence: symmetry is obtained because of the complete randomization of the full double-stranded DNA.In view of the ubiquity of complex structures in genomes [30][31][32][33][34][35][36][37][38], this result raises the question whether symmetry can appear without a full randomization of the sequence and in a way that is compatible with the existence of structure.The importance of this question is enhanced by our recent findings [39] that Chargaff symmetry extends beyond the frequencies of short oligonucleotides -remaining valid on scales where non-trivial structure is present -and that an hierarchy of other symmetries exists, nested at different structural scales.This findings are confirmed in Fig. 1, which shows how commonly used indicators of structures, such as recurrence-time distribution (panel a) and correlation functions (panel b), coincide for symmetrically related observables at different scales.
In this work we present a biologically motivated dynamical process that explains the observed relation between symmetry and structure in DNA sequences.In particular, we propose a model that mimics the action (inversion/transpositions) of TEs on DNA and we analytically describe its dynamical behavior.Using indicators to quantify both symmetry and the presence of non-trivial structure in symbolic sequences, we show that the co-occurrence of symmetry and structure is an emergent statistical property in sequences generated by such model, reproducing the same hierarchical relation detected in extant genomes.
b. Quantifying structure and symmetry.We consider symbolic sequences s = {s i } N i=1 of length |s| = N with s i ∈ A = {A, C, G, T }.Given a subsequence a of s (a word) we denote its corresponding reversecomplemented word as â, obtained from a by reversing the order of the symbols and substituting each nucleotide {A, C, G, T } by its complementary one A ↔ T and C ↔ G.We call f x (s) the percentage of the nucleotide x in the sequence s.Finally, we denote by CG(s) := f C (s) + f G (s) (the so called CG-content).In the following, it will be useful to partition the full set We introduce the following simple indicators of the presence of Chargaff Symmetry and of non-trivial structure composition of a given sequence s.
To quantify the compliance of s with Chargaff symmetry, we average the normalized difference of the abundance between a nucleotide and its symmetric one (see [21] where a similar measure was firstly introduced) I sym = 0 indicates a fully Chargaff-symmetric sequence, , and I sym = 0.08 is obtained for a 2% variation of equal frequencies (e.g., For simplicity, we consider I sym > 0.08 to be a violation of Chargaff symmetry. To quantify the presence of non-trivial structures in a given symbolic sequence s we first compute the distribution P (τ ) of distances τ between two successive occurrence of the same nucleotide x.For random sequences, P (τ ) decays exponentially as In contrast, the presence of a fat tail (standard deviation much larger than the mean) is considered a signature of a complex organization.We thus quantify structure as the distance of s from random sequences by where µ τ ≡ τ and σ τ ≡ τ 2 − τ 2 are the mean and standard deviation of the measured P (τ ), and √ 1 − f x is the expected σ τ /µ τ for nucleotide x in a random sequence.For random sequence we thus have I str (s) = 0, while departure from this value mark the presence of nontrivial structure.For simplicity, we consider I str > 0.01 to be a signature of structure.
c. Dynamics.We investigate symmetry and structure of sequences that evolve through the following dynamics, that maps one sequence s(t) ∈ A N into another sequence s(t + 1) ∈ A N by mimicking the action of TEs [12].The dynamics is defined composing two actions: (i) pick a random position j of s and a random size ≥ 0, with = L [? ].
(ii) replace the subsequence b ≡ {s i } j+ −1 i=j of size starting at position j, by its reverse complement b.
The couple (j, ) parametrizes the effect of an inversion/transposition, which we denote by g (j, ) : A N → A N .Its action has interesting properties: g (j, ) is an involution for every (j, ) and the total number of C and G (or, equivalently, of A and T ) is invariant under g: CG(s t ) = CG(s 0 ) ∀t.This implies that the dynamics is restricted to the invariant subspace of sequences with constant CG-content B N (CG(s 0 )).
d. Asymptotic equilibrium.The dynamics can be equivalently described as an ergodic Markov chain over the space of sequences B N (CG(s 0 )).The fact that g (j, ) is an involution forces the transition matrix to be bistochastic and thus in the asymptotic equilibrium all sequences are equiprobable.This means that, for t → ∞ and irrespective of the initial ancient DNA sequence, the evolution asymptotically leads to sequences that can be equivalently considered generated by an independent and identically distributed (iid) process with p(G) = p(C) = CG(s 0 )/2 and p(A) = p(T ) = (1 − CG(s 0 ))/2.Therefore, the expected value of our indicators of symmetry and structure Eqs. .This shows analytically that the TE dynamics asymptotically leads to Chargaff symmetric sequences, in agreement with previous claims [12].However, this symmetric equilibrium is a (trivial) consequence of a full randomization.Therefore our results show also that the current explanations of the second Chargaff parity rule [12] is not satisfactory as it is not compatible with any structure, which is known to remain significant at distances of several thousands of nucleotides [30][31][32][33][34][35][36][37][38] (see also Fig. 1).Next we show that the same TE dynamics is rich enough by showing that symmetric sequences with non-trivial structure are generated pre-asymptotically as long-lived metastable states of TEs dynamics.
e. Symmetry and structure over time -three regimes.We now investigate symmetry and structure of the sequences s(t) by computing how our indicators I sym an I str depend on time t (i.e., their values after t applications of g (j, ) ).We show that Chargaff symmetry emerges much before equilibrium, together with a complex domain-like structure.
We first investigate structural properties of sequences after a finite number t of iterations.We define a domain of s(t) as a subsequence of consecutive sites that have been involved in the same series of reverse/complement events.We then distinguish between domains of type Γ and Γ, depending on whether the number of transformations g they were involved is even or odd, respectively.By definition, the starting sequence is composed by a single domain of type Γ.After one iteration it is split into three domains, two of type Γ and one of type Γ of length 1 , corresponding to the subsequence involved in the first reverse/complement event.We now compute the average sizes Γ (t) and Γ (t) of domains after t iterations.Three regimes can be identified: (i) For short times t, if L N , the probability that the first few iterates all involve different subsequence is very high[?].At each iterate, a subsequence of a domain of type Γ of average size = L is created, cutting a domain of type Γ.Thus we have that in this regime: This regime lasts until iterates start overlapping, which happens when N/t ≈ L and average domain-sizes equalize Γ (t) = Γ (t) = L.This regime is thus valid for 0 < t t metastable = N/L.(ii) For t t metastable = N/L a typical reverse/complement event will overlap with more than one domain.In this case all the domains that lie fully inside the subsequence involved in the reverse/complement event will change type (and position) without changing length; the domains at the border are instead split in two sub-domains of different type.The randomness of this process guarantees that the already reached balance between the number and average length of the two domains types Γ and Γ is not broken while their common average length decreases in time as This second regime ends after a number of iterations t ∼ t equilibrium = N when equilibrium is reached.
(iii) For t > t equilibrium = N the average lengths stabilize at the stationary value and the sequence can be thought as a realization of the asymptotic equilibrium discussed above.We now explain how structure I str (s(t)) and symmetry I sym (s(t)) depend on the domain sizes Γ and Γ and thus on the different regimes.I str (s) : in order to identify the contribution of the dynamics in generating complex structural features, we consider an initial s(0) generated by an iid process (no structure, I str (0) = 0).With this choice, a value I str = 0 signal the construction, under the action of the dynamics, of different domain-types.In particular, at t metastable and for L >> 1, the total variance σ 2 τ can be estimated, using the law of total variance, as the sum of two components: one that measure variability of the mean of returns between domain-types and the other measuring variability of returns within each type.Accordingly I str (t) grows from 0 to the value I str (t metastable ) > 0 at the end of the first regime.In the second regime the domain sizes decay and I str (t) decreases to zero at equilibrium (at t equilibrium ).In terms of regimes we thus expect: (i) I str grows; (ii) I str decays; (iii) I str = 0.I sym (s) : each domain of type Γ is a subsequence of the ancient sequence s(0).If average size of such domains at time t is large enough, the frequency of each nucleotide are approximately the same as their frequency in s(0); similarly for Γ and ŝ(0).No constraints are imposed to the symmetry of the ancient genome.In particular, if the original sequence is not Chargaff symmetric I sym (s(0)) > 0 then the symmetry remains broken for all t t metastable as quantified by I sym (s(t)) t N Γ (t) − Γ (t) I sym (s(0)).In terms of regimes we thus expect: (i) Altogether, the estimations and calculations above lead to the following predictions for the presence of symmetry and structure as a function of time t (regimes i-iii): Structure I str > 0 but no symmetry I sym > 0.
(ii) t metastable = N/L ≤ t ≤ t equilibrium = N Structure I str > 0 and symmetry I sym = 0.
(iii) t equilibrium = N < t; Symmetry I sym = 0 but no structure I str = 0.
In Fig. 2 we confirm these predictions in a numerical simulation.
f.The metastable regime.The crucial feature of the TE dynamics discussed above is that in regime (ii) both non-trivial structure and symmetry co-exists in the generated sequences.The time (measured in number of iterations) for which this regime is valid is orders of magnitude larger than that of the first regime, as the ratio t equilibrium /t metastable = L corresponds to the average size of transposable elements (for example L 10 2 in Homo Sapiens [46]).We thus denote such long-lived regime as metastable and we expect it to be generically observed, even though it does not correspond to the stable equilibrium of our model.
The DNA sequences in the metastable regime are characterized by a symmetric domain-like structure.Domain models have been already introduced in literature to reproduce the complex structure generically observed in extant DNAs [11,[40][41][42][43][44][45][46].In particular if the distribution of domain sizes has a fat tail, this will lead to a long-range correlated sequence [11], signalled by a slow decay of P (τ ).The novelty of our approach is twofold: firstly, the domain-like structure in the metastable regime is an emergent property of the TE dynamics (it is not imposed a priori); secondly, such complex structure is intertwined with symmetry, that itself is an output of the dynamics.In particular, we have shown that sequences in the metastable regime are not only Chargaff symmetric (I sym = 0), they reproduce the hierarchical relation between symmetry and structure that is a distinctive feature of extant genomes (see Fig. 3 ).
g. Different organisms.In Fig. 4 we report I sym and I str computed for genomes of different families, together with the values obtained from our dynamics.It shows that symmetry and structure coexist in most cases.The sequences from Animals shows enhanced structure while the cases of Archaea and Bacteria shows a moderate signatures of structure, in agreement with the temporal behaviour of our model (i.e., associating t with the age of the genomes).Note that symmetry and structure properties are both statistical observations we made on the full DNA sequence.Any evolutionary constraint that pertains a small percentage of an organism genome does not affect these statistical observation in a sensible way.As an example, the protein-coding regions of Homo-Sapiens account for 1.5% of the full sequence.On the other hand, care should be taken when dealing with many different organisms: extensions of the model incorporating additional aspects of DNA evolution will be required for a quantitative comparison with the empirical data.
h. Conclusion.We have shown how a model that captures the action of transposable elements (TEs) is able to reproduce the intricate relation between symmetry and structure present in DNA sequences.We find that symmetry and structure change differently at dif- Istr for different genomes belonging to the families Archaea, Bacteria, Animals [50].Superimposed are the values of the sequences evolved via our model (starting in (Isym, Istr) = (0.39, 0) and evolving to (0, 0); parameters as in Fig. 2a).
ferent time scales (i.e., for different number of actions of TEs).For a large (pre-asymptotic) time interval, the sequences obtained in our model show the same nontrivial structures and an hierarchy of symmetries (including Chargaff) as in actual DNA sequences (confront panels (b) of Fig. 1 and Fig. 3).Our mathematical model is extremely simplified and includes the essential elements to explain the onset of symmetry and structure.In particular, it mimics only a simple action of TEs (reversecomplement), ignoring the fact that TEs are classified in different families, have different properties, and act according to different mechanisms [47][48][49].We expect that incorporating more details of the TE dynamics in our model will refine our understanding of their role in shaping statistical properties of DNA sequences, in particular in an evolutionary viewpoint that would lead to refinements in the data-model comparison presented in Fig. 4.

FIG. 1 .
FIG. 1. Symmetry and structure are intertwined in DNA.Results are shown for Homo-Sapiens chromosome 1 (symbols) and its randomly shuffled version (dashed lines).Each curve corresponds to one observable.Symmetrically related observables appear in the same box in the legend.(a) Distribution P (τ ) of recurrence-times τ (measured in number of basis) between successive occurrences of the same nucleotide.(b) Probability fX A ,X B ( ) that the bigrams XA and XB appear separated by a distance .Plotted is the normalized cross-correlation z [X A ,X B ] ( ) = fX A ,X B /(fX A fX B ) as a function of , for symmetrically related couples [XA, XB](see legend).Different nested symmetries are valid at different scales (see Ref. [39] for further details): for 150 Chargaff z [X A ,X B ] = z [ XB , XA ] , for 150 1500 Chargaff and reverse symmetry z [X A ,X B ] = z [X B ,X A ] , and for 1500 complement z [X A ,X B ] = z [ XA , XB ] and reverse symmetry.

FIG. 2 .
FIG. 2. Temporal evolution of symmetry and structure in the model.(a) Numerical evaluation of the average sizes of domains of the two types Γ (t) and Γ (t) as a function of the number of iterates t of TE's dynamics.(b) Numerical evaluation of the symmetry and structural properties of the sequence generated by the dynamics and quantified by the indicators Istr(t) and Isym(t).The filled symbols in Istr indicate that these values are statistically different from a random sequence (p − value < 0.01, equivalent results are obtained using as an alternative definition of Istr the Jensen-Shannon divergence between the P (τ ) obtained in the model and in random sequences).The sequences s(t) have length N = 10 5 and the size of reverse/complement events is L = 500, thus leading to time-scales t metastable = 10 2 and t equilibrium = 10 5 .The starting sequence is fully random with fA = 0.1, fC = 0.2, fG = 0.3, fT = 0.4.

FIG. 3 .
FIG. 3. Symmetry and structure in the metastable regime.Same observables as in Fig 1 are computed for a sequence in the metastable regime of our dynamics.Data show that this regime is characterized by a similar co-occurrence of symmetry and structure as in extant genomes.Results in panel (a) are for a sequence of length N = 5 × 10 6 initialized as in Fig. 2 and evolved using our model with TE sizes all equals to L = 5000 until t = 2048 t metastable = 1000.Results in panel (b) are for a sequence of length N = 10 5 initialized as in the artificial sequence reported in Ref. [39] and evolved using our dynamic model with fixed L = 500 until t = 256 t metastable = 200.The more generic initial sequence in panel (b) (i.e., Markov chain instead of fully random) allow us to distinguish between the different types of scale-dependent symmetries generated by the dynamics.