Evolutionary Tinkering Enriches the Hierarchical and Nested Structures in Amino Acid Sequences

.


Introduction
Bioinformatics approaches based on sequencing data have effectively demonstrated that DNA and amino acid sequences are encodable.This encodability has been illuminated by employing a range of potent mathematical and statistical techniques, revealing their biological significance.Various studies have suggested strong correlations between the structural features in sequences (such as regularity and nestedness) and the functional properties of proteins, indicating the profound link between sequence structure and biological function [1][2][3].
One commonly used approach to characterize the sequential features is the Shannon entropy (defined as H = − p i log 2 p i where p i is the probability of observing letter i) and its variants [4][5][6].It was originally proposed to describe the uncertainty of a random variable, but later adopted to characterize the sequential randomness, behind the idea that a sequence can be thought of as a realization of a sequential array of this random variable.Shannon entropy is often applied to assess sequence divergence and sequence polymorphism [7][8][9][10][11].It represents a statistical notion of information and is insensitive to the internal structure and pattern of an individual sequence.Shannon entropy could also be pushed forward to analyze the frequency distribution of short subsequencesnamely the k-mer method-instead of individual letters, and investigate simple and non-overlapping repetitions [5,6,[12][13][14][15].However, this type of approach overlooks the internal hierarchical relationships in a sequence that was found to be very important at the protein domain level [1,13,16,17] or even in language [18].
Instead of focusing on the statistical notion, another group of methods used to characterize sequential features and structural information includes approaches such as Kolmogorov complexity and its variants.These methods aim to provide the shortest description for a specific target object.One particularly insightful variant is the "effective complexity" proposed by Gell-Mann [19,20].It suggests that the complexity of a sequence can be gauged by analyzing its regularities or repetitive subsequences.It has been found that effective complexity is closely related to a sequence's functional features.This perspective on complexity underscores the significance of sequential repetitions and duplications.Intriguingly, considering the abundant presence of repetitive structures in the human genome, one might argue that the genomes of higher eukaryotes, including humans, exhibit greater complexity from this structural viewpoint [21][22][23].
Nonetheless, while describing complexity is essential, offering insights into how this complexity evolves is another side of the coin.The amino acid sequence of a protein not only embodies information about its thermodynamics, folding, and other properties (Anfinsen's principle) but also encapsulates details related to its evolutionary trajectory and history, which could be extracted.In 1977, François Jacob posited the abstract idea that evolution is akin to "tinkering" [24], or more specifically, innovations arise from the opportunistic reuse or recombination of existing elements.Much of this tinkering occurs during replication errors, for example, through point mutation and DNA duplications.The latter is associated with various replication events, such as duplicons and transposable element expansion [22,23], leading to increased complexity in both protein families and genomes [21,25].Various examples that reflect this tinkering process exist: the length of bacteriophage tails determined by TMP [26], the needle length of bacterial injectisome by YscP [27,28], cytochrome P450 in insects [29], antifreeze glycoprotein in codfish [30], the widespread presence of zinc finger proteins [31,32], and extensive core duplications in primates [33].Many of these proteins have undergone significant expansion and mutation, either actively or passively.Yet, the challenge of quantifying such a tinkering process remains.The Ladderpath theory is a recently proposed framework to quantitatively describe the structural information of objects such as sequences, molecules, proteins, and images [34].It considers the shortest path to generate the target object as the way to characterize it, with the key assumption that the building blocks, once generated, can be reused in any amount in subsequent steps.These reused building blocks are called ladderons, which could also be viewed as modules, as defined in ref. [34].This aligns with the "tinkering" process proposed by François Jacob [24].The number of steps required for ab initio generation of a target object indicates its generation difficulty, defined as the ladderpath-index, λ.Hence, when considering a set of amino acid sequences with the same length, one can discern which sequence is more straightforward or easier to generate.Additionally, to characterize the degree of order in sequences of varying lengths, another useful index, called the order-index, is defined as ω := S − λ, where S is the size-index (namely, the length) of the amino acid sequence.By deconstructing the target object into a partially ordered multiset-or equivalently, the laddergraph (as shown in Fig. 1a -c)-the Ladderpath theory characterizes the structural intricacies rooted in the hierarchical and overlapping relationships formed by the target object's repetitive substructures.
The concept of the Ladderpath theory aligns with several other theories, including Kolmogorov complexity, addition chain, assembly theory, and the "adjacent possible" [35][36][37][38].While these theories have their own measures of complexity, the Ladderpath theory posits that "complexity" should be assessed using both the ladderpath-index and the order-index [34].A sequence is not necessarily complex if it only has a high ladderpath-index with a low order-index (Fig. 1c), or vice versa (Fig. 1b).A sequence can be deemed complex if both indices are simultaneously high.Of the three real proteins examined, the one with both a high ladderpath-index and order-index (Fig. 1a) exhibits the most intricate and complex hierarchies.For a more encompassing view, Fig. 1d shows the distribution of human proteins with lengths below 500 amino acids (AA).The Ladderpath theory underscores nature's propensity to innovate through tinkering and reusing existing structures, a trend exemplified in processes like the evolutionary creation of new proteins.
This paper is organized as follows.In Section 2.1, we provide a rigorous definition of the order-rate and ladderpath-complexity, and present a systematic comparison with a commonly used k-mer related method.Sections 2.2 and 2.3 present two statistical observations.The former reveals that human protein sequences exhibit higher ladderpath-complexity.The latter notes that proteins containing a significant portion of intrinsically disordered regions, on average, possess a higher order-rate.Both observations are statistically significant.In Section 2.4, we begin by detailing a statistical observation that there are almost no super long sequences with low order-rate values.We speculate that this might be due to the different frequencies of duplication and mutation across different evolutionary stages.This, in turn, suggests that the evolution of protein complexity follows a zigzag pattern.We offer several examples of protein families to support this speculation.The paper concludes with a discussion and a methods section that describes the algorithm for computing ladderpath-associated information.Open-source code is also available.

Two indicators that characterize amino acid sequences
Firstly, we have developed an efficient algorithm to compute the ladderpath-associated information of sequences, details of which can be found in the Methods Section, with codes available on GitHub for immediate use.This algorithm can effectively handle sequences of around or below 10,000 AA (beyond which an approximation can be made), in contrast to the previous algorithm (see ref. [34]) that was limited to sequences of approximately 20 AA.The statistics displayed in Fig. 1d were derived using this new algorithm.
Moving on to Fig. 1d, we noted a distinct lower boundary for the order-index ω as the sequence length S increases.This lower boundary stems from the finite number of basic building block types (which in this context are the 20 amino acid types), because as the length of amino acid sequences increases, repetitive subsequences become inevitable, resulting in a non-zero value for ω.This is purely a mathematical property, which we need to compensate.Hence, we introduce two new indicators-the order-rate η and ladderpath-complexity κ-to better characterize the system with a finite number of basic building block types.
Order-rate η.We define the order-rate η of a sequence x as where ω(x) is the order-index of sequence x, S is the size-index of x (namely, the length of x), ω max (S) is the maximum order-index among all the sequences with length S, and ω 0 (S) is the average order-index of all possible sequences with length S, roughly corresponding to the average level of the least ordered sequences, referring to Supplementary Information (SI) section 1 for the calculations of ω 0 and ω max .The order-rate η characterizes the hierarchical and overlapping relationships among the subsequences of a sequence, describing the pattern regularities and repetition in the target sequence.Values of η close to zero mean that the degree of order of the sequence is close to the average level of random sequences, indicating that the sequence does not exhibit any significant pattern.As η gets larger and larger, meaning that the repetitive parts become more dominant and the sequence exhibits more hierarchical structures (see Fig. 1a).η reaches 1 only when the sequence exhibits exponential elongation of a single letter, e.g., T → TT → TTTT → TTTTTTTT.
Ladderpath-complexity κ.Another indicator we put forward to characterize the internal structure of sequences is the ladderpath-complexity κ, defined as: where λ(x) is the ladderpath-index of sequence x, and η(x) is the order-rate of x.As mentioned, the order-rate η is a relative indicator of the regularities (compared with the average level of totally random sequences and the most ordered sequence), so its relevance might diminish across sequences of disparate lengths.This indicator ladderpath-complexity κ, instead, takes into account the minimum number of steps required for the generation of the sequence that is characterized by λ, thereby including the length effect.As demonstrated in the Ladderpath theory that the "complexity" of a sequence should incorporate two aspects, that is, one is the difficulty in generating the target, and the other aspect focuses on the hierarchical and interlaced relationships within the internal sequential structure [34], the definition of κ integrates these two aspects, hence its name: ladderpath-complexity.
For a given length (namely, size-index S), the maximum value of the ladderpathcomplexity κ can be anticipated (see SI section 2 for the mathematical properties of κ).That is, when ω = (S + ω 0 )/2 and λ = (S − ω 0 )/2, the ladderpath-complexity κ(S) reaches its maximum value (S − ω 0 ) 2 /[4(ω max − ω 0 )].In the special case where ω 0 = 0, κ reaches its maximum when ω = λ = S/2 (note that ω 0 appears in the general case because of the baseline effect mentioned above).It indicates that when both ω and λ are large, the ladderpath-complexity κ could be large (if only one of ω or λ is large, κ cannot reach its maximum).This is consistent with the notion that complexity incorporate two aspects.
Examples and comparative analysis.Next, we take a few protein sequences as examples (with diverse η and κ values) to more clearly and intuitively illustrate what η and κ characterize (Tab. 1 and Fig. 2).We can observe that: (1) PO5F1 MOUSE has an order-rate η close to 0, meaning that the characteristic features of its internal structure are indistinguishable from those of random sequences (from Fig. 2a we can see its few hierarchical structures).( 2) As the order-rate η increases, the sequence starts to exhibit richer hierarchical and interlaced structures, with diverse and overlapping ladderons (Fig. 2b) while, as η approaches 1, the hierarchy becomes more like a simple layer-by-layer structure (Fig. 2c).(3) Although PO5F1 MOUSE and SDK2 MOUSE have similar small order-rate η, the latter has a much higher ladderpath-complexity κ, just because the latter is much longer.Meanwhile, although SRY MOUSE is much shorter than SDK2 MOUSE, its ladderpath-complexity κ is even slightly higher because of its greater order-rate η (from Fig. 2b we can see its much richer hierarchical and interlaced structures).This indicates that length affects complexity but is not the sole determinant.
. Laddergraphs of the four example protein sequences presented in Table 1.Unlike in Fig. 1, space constraints prevent the explicit display of ladderons in this figure.Instead, ellipses are used to symbolize ladderons, with the size of each ellipse corresponding to the length of the ladderon.Subfigures (a), (b), and (c) are scaled identically, as evidenced by the corresponding size of the largest ellipse that represents the target sequence in each.In subfigure (d), due to the excessive length of the protein SDK2 MOUSE, only a zoomed-out version of its laddergraph is displayed.A detailed version can be found in SI section 3.
Now, we will compare the indicators proposed in this study with another commonly used method.As mentioned, a commonly used tool to describe the sequential feature is the Shannon entropy, which is, however, based on the statistical notion of the frequency and the uncertainty of single letters, rather than the internal structure of a sequence.Nevertheless, the k-mer method has been employed to extend the notion for single letters to substrings of a certain length.Chen et al. introduced a normalized indicator named Informational Complexity (C) to characterize the relative uncertainty of substrings [5].C k is calculated based on a sliding window of a fixed length k, and thus, the internal sequential structure has been taken into account, at least within the range of k.In fact, C 1 is the Shannon entropy of the sequence (because 1-mer is just the single letter), normalized to the maximum Shannon entropy of the same length.To draw a linguistic analogy, the Shannon entropy functions at the alphabet level, while the k-mer version C k constructs a dictionary comprising words of a certain length k, quantifying the Shannon information conveyed by these fixed-length words.Consequently, the quantity (1 − C k ), denoted as R k , represents the degree of regularity, partially aligning with what the order-rate η describes.
Then, we systematically compare the order-rate η with R k (Fig. 3).We observe a correlation between η and R 1 , and the correlation increases as k increases to 2 and 3; After k > 3, the correlation begins to drop sharply (Fig. 3a).The correlation exists when k = 1, 2, 3 because both indicators, η and R k , correctly describe certain aspects of the sequence's regularity.Note that the order-rate η quantitatively describes the hierarchical and interlaced relationships among the substructures of a sequence.Therefore, it has a higher correlation with R 3 and R 2 , while the correlation with R 1 is lower.This is because 3-mer and 2-mer take substructures into account, while R 1 merely focuses on single letters, neglecting the internal structure.Further, the correlation decreases after k > 3 because the whole set of all possible k-mers expands exponentially with k, and thus the Shannon information contained in k-mers becomes submerged in the whole set, resulting in R k becoming less and less informative.Another observation is that while a general correlation exists, different proteins exhibit varying tendencies as k increases.For instance, the proteins represented by the red points in Fig. 3b, which have large η values, tend to retain their position along the x-axis as k increases from 2 to 6; In contrast, proteins represented by the blue points descend rapidly along the x-axis.This suggests that these different protein sequences have distinct internal structures.To further probe the influence of these internal structures, we chose several representative proteins to analyze how R k changes as k increases up to 50. Figure 3c illustrates this, where red curves correspond to the proteins represented by the red points in Fig. 3b, and similar associations are made for the green and yellow curves (referring to SI section 4 for the ladderpath-associated indicators of these representative proteins).We observe that: (1) The red proteins are actually those that have large repetitive segments, but lack rich hierarchical and interlaced relationships (e.g., Ubiquitins), and thus having a relatively high η but low κ.For them, we observe that R k remains virtually unchanged as k increases.( 2) The green proteins have very "chaotic" sequences (i.e., almost no repetitive subsequences), resulting in a low η.For these proteins, R k approaches zero after k > 3. (3) The yellow points fall between these two categories, exhibiting a distinct feature: They decrease slowly with k, hinting at intriguing internal structures.
To summarize, for proteins with distinct internal structures (such as the three exemplified categories), the characterizing capability of different R k varies.As a species likely contains at least these three categories of proteins, it remains largely arbitrary to determine which k should be used to characterize the sequential features of the species as a whole.Our approach, instead, effectively characterizes the internal structure and provides a global indicator without predefining a characterizing range.Intuitively, the Ladderpath-associated indicators liberate "confined length words" (k-mers) to "variable length words" (the so-called ladderons, as defined in ref. [34]), adeptly capturing the hierarchical and interlaced structures within sequences.

Statistical observation: Human protein sequences have higher ladderpath-complexity κ
Here, we present the density distribution of ladderpath-complexity κ for sequences with lengths below 2500 AA across six typical species (Fig. 4a).The statistical differences between distributions reflect species-specific features.We observe that the distribution for human is the flattest, i.e., having the highest proportion of proteins with large κ.In contrast, the distribution for E. coli appears to be more concentrated, i.e., having the highest proportion of proteins with small κ.To put it another way (referring to Tab. 2), in terms of the density of proteins with large κ (e.g., κ = 80, 60, 40), human and mouse rank at the top, forming the first group, followed by the second group (yeast, mouse-ear cress, and C. elegans), and finally, E. coli.However, for proteins with small κ (e.g., κ = 5, 10), the first group consists of E. coli, C. elegans and mouse-ear cress, followed by yeast, and finally the third group of mouse and human.This observation aligns with previous findings suggesting that more complex species (such as human and mouse) tend to have longer protein lengths, more segmental repetitions, and more types of cells [39,40], and that E. coli has fewer internal duplications [2] (and hence lower ladderpath-complexity).Considering that ω 0 comes from the average ω of numerous random sequences with homogenous amino acid content, could the large difference primarily result from the species-specific and inhomogeneous content rather than the internal sequence structure?To test this speculation, we randomly shuffled all sequences-aiming to preserve the amino acid composition but disrupt the internal structures-then recalculated their ladderpath-complexity, and compared the changes before and after, denoted as ∆κ (Fig. 4b).The results indicate that human sequences still have the most significant reduction (the order remains the same), suggesting that human sequences possess the richest hierarchical and interlaced structures overall, with E. coli having the least.So, the statistical differences in ladderpath-complexity arise from the internal sequence structure rather than the inhomogeneous content.
With these results, we provided a quantitative description at the protein level that the protein sequences of more complex species tend to possess richer hierarchical and interlaced structures.
Showcase: Top list of large-κ proteins.Let us now examine the list of proteins with the highest κ values, considering only those sequences with lengths below 2500 AA (Tab.3).Interestingly, despite this length limitation, our κ-selection results show a similarity to the findings of the repeat finder: human proteins dominate [41].Adjusting this length limit to 2000, 1500, or even 1000 AA does not change this observation (see SI section 5 for more data).
Another notable observation is the length range that spans from 1457 to 2496, indicating that length is not the determining factor for κ; Instead, repetition in the sequence plays a significant role.For example, DMBT1, flocculin and mucin in Tab. 3 are protein classes that are famous for tandem repeats [42][43][44].

Statistical observation: Proteins containing intrinsically disordered regions (IDRs) have higher order-rate η
Now, let us consider the relationship between the amino acid sequence and its corresponding 3D structure.Intuitively, duplicated sequences could be expected to adopt identical structures.Therefore, a long sequence with many duplicated subsequences (thereby tending to have higher order-rate η) may be considered to have a consistent structure comprising explicit identical substructures [45].For instance, the protein depicted in Fig. 5a exhibits a consistent and regular structure [46]; another notable example is the much larger protein DMBT1 HUMAN, shown in Fig. 5b, which also has a high η value, as shown in Tab. 3. Nevertheless, there are proteins with high η values but are structurally disordered, as the example depicted in Fig. 5c, which exhibits regions predicted by AlphaFold2 with low confidence, implying structural disorder.To uncover statistical patterns, we utilized data from the DisProt database [47] to calculate the average order-rate η for proteins with a significant proportion of intrinsically disordered regions (IDRs), and compared it with other proteins without a significant proportion of IDRs.The results are shown in Fig. 5d (the right part with darker colors).It is evident that, generally, proteins with IDRs have higher η values than those without such regions, which is statistically significant for four out of six species analyzed.Yet, due to the limited data available in DisProt, we also employed the Metapredict software [48] to predict the presence of IDRs in all proteins of their proteomes for these species, and then matched them with their respective η values.The outcomes of this analysis are presented in Fig. 5e.The pattern remains consistent, with clear statistical significance observed for five out of these six species.These findings are consistent with previous studies that tandem repeats, especially perfect ones, tend to be structurally disordered [49].
Nevertheless, previous studies show that proteins containing IDRs have a greater amino acid abundance bias [50,51].It is thus possible that the high η value arises from this bias rather than from the orderliness of the internal sequential structure.To investigate this, we compared the η value before and after shuffling, denoted as ∆η, as shown in Fig. 5d (the left part with lighter colors).We observed that, statistically speaking, ∆η is larger for proteins containing IDRs.From this observation, we can suggest that the internal sequential structure plays a role in the high η values.Therefore, the degree of orderliness may serve as a new feature of disordered regions at the sequence level.

Evolution of the complexity of protein sequences follows a zigzag pattern
We now present statistics that encompass all sequences (Fig. 6), not just those shorter than 2500 AA.Generally, most of these sequences have η values confined below 0.1 (Fig. 6a).However, a closer examination of the η distribution (Fig. 6c) reveals that the proteins of human, mouse and C. elegans exhibit a significant tendency: as the length of the sequence increases, there tend to be a higher number of sequences with larger η, indicating more ordered sequences.This trend is also observable in Fig. 6a, where, for extremely long lengths, sequences with low η values are even absent.
An immediate question is why there are almost no super long but low-η sequences.Later we shall see that this question strongly relates to how protein sequences elongate.Now, imagine there is initially a short sequence or segment, and consider how this sequence elongates and how the order-rate η evolves, via specific biological processes: • Duplication: It refers to the process where a segment of a sequence, either short or long, is copied onto itself.This creates a repetitive subsequence, corresponding to a ladderon as defined in Ladderpath theory.As a result, η of this sequence increases.The longer the segment, the greater the increase in η.
• Substitution: It refers to the replacement of a base.This does not alter the sequence's length, but it may disrupt a ladderon, thereby slightly decreasing the value of η.
• Insertion: It could be thought of as either the addition of a foreign segment or a single amino acid, or as a duplication of a segment immediately followed by substitutions occurring at every base.
Note that for simplicity, we only consider the processes that does not shorten the sequence, thus neglecting deletion.We now simulate the process of elongation in three cases: (1) completely driven by duplication, (2) completely driven by insertion, or (3) driven by a combination of duplication and substitution.The simulation results are displayed in Fig. 6d (referring to SI section 6 for details on how the simulation was conducted).Although the simulation focuses solely on the elongation of protein sequences, it provides insight into the question of why there are virtually no extremely long sequences with low η values.The red trajectories in Fig. 6d represent case (1), where η increases the most rapidly during elongation.The green trajectory represents case (2), where the order-rate η remains consistently low.The yellow trajectories, representing case (3), lie in between and closely resemble real-world scenarios where infrequent duplications of relatively large segments heavily increase η, while frequent substitutions consistently reduce η, forming a zigzag pattern.We also see from Fig. 6e that κ increases the most in case (3), namely, the yellow trajectories.
In summary, the evolution of protein sequences follows a zigzag pattern.Specifically, the duplication of segments increases the order-rate of the sequence as it elongates, while this increment in order-rate is gradually counteracted by various mutations, either partially or completely, depending on the relative frequencies of duplications and mutations.Now, we could consider the emergence of a new gene or pseudogene: (1) Occasionally, a replication error leads to the duplication of a segment at a different location within the sequence, resulting in higher η and κ values and contributing richer raw materials for further evolution.(2) Subsequently, this elongated sequence undergoes various "tinkering" processes across generations, reducing η and κ.Over time, this sequence gradually diverges from its ancestor and may eventually become a new gene or a pseudogene.
Examples: Ubiquitin, Titin and NBPF family.Now we can return to the observation mentioned at the beginning of Section 2.4 and ask why there are almost no extremely long but low-η proteins.Here we provide three representative examples to address this question.
The first example is Ubiquitin, which is used to emphasize the effect of duplication.Ubiquitin is a highly conserved, small regulatory protein widely found in eukaryotes, which functions as a post-translational modifier, mainly in protein degradation.Polyubiquitin (UBB and UBC) has an extremely high η value because it contains almost no mutations and has several tandem head-to-tail repeats of ubiquitin, each being 76 AA long [52] (referring to Fig. 2c for the laddergraph of UBC HUMAN).The distribution of η and κ values for this protein family is shown in Fig. 6a and 6b.We can see that while some members of this family have an extremely high η value approaching 1, their corresponding ladderpath-complexity, κ, is not particularly high.This observation can be attributed to the nearly error-free duplication events, aligning with case (1) discussed earlier, and the lengths of these proteins which are not particularly long.
The second example is another extreme, the ancient protein Titin, which is used to emphasize the effects of mutations along protein elongation.Titin serves as a structural support in muscles and is of immense length (e.g., TITIN HUMAN contains 364 exons) [53].This gigantic protein consists of numerous domains, some of which belong to the PEVK region, which is rich in highly repetitive sequences (this PEVK region forms a distinct structure in the center of the protein, functioning as an entropic spring [54]).Nevertheless, the η values of Titin are not very high and exhibit variations among different species [55,56], as shown in Tab. 4. This suggests that the effects of duplications, which can increase the hierarchical and nested structures of sequences, have been largely counteracted by long-term and consistent mutations.
The third example is an emerging family at the evolutionary scale, Neuroblastoma BreakPoint Family (NBPF), which lies in between the two extremes mentioned above.NBPF is known for its members having varying numbers of Olduvai repeats, with approximately twenty members in humans, playing a certain role in human brain development and cognition [57,58].These young proteins seem to be predominantly found in proteomes of primates, whereas in non-primate mammals, their counterparts exist as single-copy Olduvai.As an amplicon, Olduvai has undergone a significant gene amplification within a relatively short time span [57,59].Thus, η and κ increased significantly, and mutations had not had enough time to largely lower η and κ to counteract the effect of duplication (Tab.5).Therefore, from Fig. 6a, we can observe that the NBPF family members form a clear pattern, exhibiting their evolutionary trajectory.The aforementioned classes of proteins illustrate the elongation seen in ancient proteins and the emergent core duplication found in longer proteins, which can be metaphorically described as an Odyssey-like journey.These examples suggest that, over a long duration, there is a certain degree of synchronization between size expansion and increased complexity; while between expansion events, complexity tends to decrease.Thus, the evolution of sequence complexity appears to follow a zigzag pattern.Most long proteins do not exhibit the same level of extremity as Ubiquitin and Titin, but instead fall somewhere in between, e.g., NBPF.For more examples of such proteins, refer to SI Section 7.

On definitions
The newly developed Ladderpath theory aims to decode the information concealed within the hierarchical and interlaced relationships among the recurring subsequences found in a specified set of target sequences.It achieves this by iteratively identifying recurring subsequences (termed the ladderons) and rearranging them into a treelike hierarchical structure (termed the laddergraph), which distills and encodes the evolutionary information.In the context of biological sequences, recurring subsequences, or ladderons, could represent motifs, domains, or signify transposable elements, satellite DNA, microduplications within genome scale, and the like.To better encapsulate the tree-like hierarchical structure, two indices were derived.The first is the order-index η, which, in a normalized manner, quantitatively measures the orderliness of a sequence, ranging from close to 0 (completely disordered, as illustrated in Fig. 2a) to 1 (fully ordered, as illustrated in Fig. 2c).When η sits centrally, the structure exhibits significant order while the ladderons display intricate overlaps and nested relationships (as illustrated in Fig. 2b).At this point the other derived index, κ, reaches its maximum, signifying the utmost complexity.The ladderpath-complexity κ gauges complexity by factoring in both orderliness and the length.While sequence length does contribute to complexity, longer does not necessarily equate to more complex.
Ladderpath differs from Shannon entropy in that the latter primarily focuses on the statistics of individual letters, although extensions, such as the k-mer method [5,6], can be adapted to consider substructures.These approaches did not factor in the intricate hierarchical relationships among these substructures.Thus, our order-rate η shows a correlation with R k , an index derived from the k-mer method, but this correlation varies with different internal sequential patterns.Further, Shannon entropy and its variants operate under a strong assumption that the sequence in question represents a realization of a random variable, implying that the sequence should be infinitely long.However, in reality, amino acid sequences invariably have finite lengths.On the other hand, finite lengths mean that methods like the Lempel-Ziv lossless compression (those aiming to describe a form of absolute information) cannot achieve their optimal or shortest description [60].This introduces significant variability when trying to deduce genuine evolutionary histories.In contrast, Ladderpath does not rely on assumptions of infinite length.

Statistical observations on sequential orderliness and complexity
The first statistical observation, based on our examination of the ladderpath-complexity of proteomes, reveals differing complexity distributions among species.Among the six species analyzed, humans, followed by mice, possess relatively more sequences of high complexity that exhibit richer hierarchical and interlaced structures.We also confirmed using shuffling methods that this complexity does not stem from content differences (e.g., the so-called C-value paradox or enigma [61]) but arises from internal sequential patterns.From the perspective of protein structure, studies have shown that species with higher complexity possess more proteins with larger radii of gyration (signifying increased flexibility) and a higher degree of modularity [62].On the other hand, our analysis from the sequential perspective implies that the more complex a species is, the higher the tendency for sequence complexity.(It is worth noting that although the definition of species complexity remains debated, in practice, biologists often employ varied metrics like the total cell types, genome size, or proteome size to gauge species complexity, whereas these metrics are often interrelated [62,63].)Collectively, these results hint at positive correlations between amino acid sequence complexity, protein structural modularity, and overall species complexity.
Another statistical observation is that proteins with a significant proportion of intrinsically disordered regions (IDRs) tend to exhibit higher order-rate, η, with statistical significance.It is crucial to note that these elevated η values usually do not exceed 0.1 (as shown in Fig. 5e).Within this range, a higher η invariably indicates richer hierarchical and interlaced structures, akin to transitioning from proteins examplified in Fig. 2a to Fig. 2b (not possible to Fig. 2c since such an extreme hierarchical relationship requires an η of approximately 0.8 or higher).Thus, our findings suggest that, at the sequence level, proteins with IDRs tend to have richer hierarchical and interlaced structures compared to typical proteins.This correlation between sequence orderliness and structural uncertainty is somewhat unexpected but intriguing, meriting further investigation.On the other hand, understanding that a higher η often originates from segment duplication, another intriguing question arises: Does the evolution of intrinsically disordered proteins (IDPs) involve more duplication events?Lastly, building upon the earlier point that more complex species have more proteins with higher modularity, a bold idea might be developed: Could IDPs be an essential stage in the evolutionary journey towards increasing protein modularity?Specifically, an amino acid sequence "core", through occasional duplication events, generates repetitive subsequences along elongation (which naturally leads to an increase in η), resulting in structures becoming more disordered and flexible, facilitating the exploration of various interactions, and ultimately leading to the fixation of structural modules.

On evolution
Our results suggest that as the protein elongates, its complexity follows a zigzag pattern, originating from the interplay of duplication and mutation (the latter refers to processes such as substitution and insertion).Duplication results in a sharp increase in sequence orderliness and length, while mutation leads to a decline in orderliness, with the length remaining more or less unchanged, together leading to a significant diversity in the internal patterns of sequences.Owing to the interplay of these mechanisms and their varying occurrence rates, the internal structure of the sequence can become highly hierarchical and interlaced.This might result in proteins having distinct values of κ, η and S (e.g., leading to different distributions between long and short proteins), potentially promoting a range of structures and functions.Statistically speaking, we did observe that η distributions diverge when protein length exceeds 2000 AA (Fig. 6c).This hints that various species, or those in varied environments, might adopt different elongation strategies or, in other words, different "tinkering" processes.For instance, the trend of evolving into multi-domains is more pronounced in eukaryotic proteins than in prokaryotes [64], suggesting that distinct biological elongation dynamics might be at play.The evolution of human-specific segmental duplications (HSDs) seems to exhibit varied patterns across different periods.During the human-chimpanzee divergence, there was a period of relative quiescence, succeeded by a spike in HSD occurrences and the emergence of new genes [65][66][67].The previously mentioned NBPF experienced rapid, widespread duplications.The Olduvai domain, in particular, stands out as one of the most extreme and fastest copy number expansions in the human genome (with humans having about 300 copies, great apes 90-120, monkeys 30-40, and single or a few copies in non-primate mammals, while being absent in non-mammals), which has been strongly linked to human brain evolution and cognitive function [68].Variations in elongation mechanisms, especially under diverse or rapidly changing environmental stresses, might be advantageous for quick adaptation [69], potentially accelerating the emergence of new structures or inducing dose-dependent effects [70,71], among other outcomes.
In Fig. 6c, more detailed analysis and intriguing insights can be observed.The length of the E. coli proteome is interrupted around 2000 AA.Beyond this length, yeast and mouse-ear cress (as species with cell walls) show no increase in order-rate.Meanwhile, for the mouse and C. elegans, both multicellular species without cell walls, there is an evident rise in order-rate, with their trends aligning closely.At lengths greater than 3000 AA, the order-rate of human proteome experiences a sudden and significant surge.Based on these observations, we make the following speculations: (1) The juncture at which the E. coli length halts could be a pivotal point in the shift from prokaryotes to eukaryotes.Eukaryotes might have developed additional tools for sequence expansion and tools that augment the hierarchical and interlaced structure of sequences, especially those facilitating intragenomic duplication.This transition could be a landmark event differentiating eukaryotes from prokaryotes.(2) The patterns observed in yeast and mouse-ear cress indicate that being unicellular or multicellular may not be a key factor affecting proteome orderliness, while having a cell wall might pose as an obstruction to increasing the order-rate.We speculate that the cell wall might hinder horizontal gene transfer between species, preventing elements with capabilities such as translocation and duplication from integrating and emerging as evolutionary tools.(3) The spike in order-rate seen in humans, relative to other eukaryotic species, raises the question: have humans undergone certain critical events or acquired novel genetic tools?If a stark contrast remains when comparing humans at this point with other non-human primates (taking the example of NBPF, as previously discussed), it might explain the profound impact of social development within human evolution.Collectively, these findings suggest a deeper exploration of evolutionary data using this approach or similar methodologies.It also underscores that the Ladderpath theory could harbor significant potential for more in-depth applications in evolutionary biology.
On the other hand, the simulated evolutionary process obtained through alternating segmental duplication and mutation provides a better fit to actual evolutionary data than considering mutations alone.This phenomenon poses a significant challenge to the neutral theory [72] and constructive neutral evolution [73].At the very least, it suggests that from the time of Darwin to current evolutionary biology theories [74], there has been an overemphasis on the role of mutations, neglecting the effects of gene duplication and transfer.As inferred from the sudden shifts shown above, gene duplication and transfer are likely the main ingredients for significant evolutionary leaps.This resonates with the endosymbiotic theory [75] and horizontal gene transfer [76] applied to explain genome expansion.Such observations hint at two important applications of the Ladderpath theory: (1) Identifying critical shifts and branching points in the entire evolutionary tree, examining whether new gene modules have been added, and pinpointing which of these modules have undergone extensive duplication and transfer in subsequent evolutionary bursts.The Ladderpath theory may address the problem of phylogenetic lineages that are obscured by chimeric, symbiotic, or reticulate evolutionary events, which may provide crucial insights into phenomena like the Cambrian explosion [77].(2) In fields such as synthetic biology and enzyme engineering [78], as well as pharmaceutical engineering [79], the practice of directed evolution is mainly based on point mutations and mutation libraries.There is limited application of strategies involving extensive gene segmental duplication.Introducing the Ladderpath theory into these fields, and adopting the "alternating segmental duplication and mutation" strategy simulated in this study, may significantly enhance the rate and success of directed evolution.Furthermore, while gene duplication has found many applications in plant and animal breeding, issues like the adaptability of inserted duplicate fragments and their loss in subsequent generations have consistently hampered successful breeding rates [80].Using the Ladderpath theory to determine the optimal ratio and strategy for duplication and mutation might offer improved tools for targeted breeding [81] and related biotechnological endeavors.
The Ladderpath theory provides a theoretical framework and specific computational methods to quantitatively describe the complexity of target objects, such as sequences.
It focuses on "how objects are generated" rather than on emphasizing uncertainty, as in the case of Shannon entropy, or the efficiency of compression, as seen in lossless compression algorithms like Lempel-Ziv.The Ladderpath theory embodies the evolutionary tinkering process, highlighting the importance of "reuse" and "modularity".While this paper demonstrates the usefulness of derived indicators such as order-rate and ladderpath-complexity, it is even more crucial to note that comprehensive information is stored in the laddergraph, which depicts the hierarchical and interlaced relationships among recurring subsequences, resulting from the evolutionary tinkering process.In practice, we can learn from the tinkering mechanisms of innovation that nature employs (along with sophisticated and powerful reductionist-like innovation) to help us construct complex targets or systems from simpler ones, e.g., peptide drug design (to be discussed in an upcoming paper) and synthetic biology.Using the Ladderpath theory as a tool to reverse-engineer species evolution might also offer valuable insights, facilitating the design of more effective directed evolution strategies, which could then be applied to fields such as crop breeding and even the design of bioprocesses.

Algorithm for computing ladderpath-associated information
Here we show how the algorithm works by taking a target sequence CUCGACGACUAU-CUCGACAAUGACU as an example (Fig. 7a).Firstly, we search for the longest repetitive subsequence in the target sequence and find CUCGAC, marked in blue.Secondly, we cut the target sequence into pieces so that the repetitive subsequences are isolated.As a result, we obtain a set of shorter sequences: [CUCGAC, GACUAU, CUCGAC, AAUGACU].In the third step, we place one CUCGAC into a separate bag, which will then be used to construct the ladderpath.After this step, we have a set of sequences [GACUAU, CUCGAC, AAUGACU] remaining.These three steps constitute the module which we call "SEARCH, CUT, and REMOVE", marked in green in Fig. 7a.
Next, we treat the remaining set of sequences [GACUAU, CUCGAC, AAUGACU] as a "target" sequence and apply the "SEARCH, CUT, and REMOVE" module to this target.From this, we obtain another longest repetitive subsequence, GACU, which we place into the separate bag.We continue to apply the module until the original target sequence is completely segmented into its most basic building blocks.Finally, based on the order of removal, we construct the ladderpath as { G, A(3), C(3), U(3) / / AU, GAC / / CUCGAC, GACU }.It characterizes the hierarchical and interlaced relationships within the original target sequence, and has a one-to-one correspondence with a laddergraph shown in Fig. 7b.

Figure 1 .
Figure 1.Laddergraphs and a distribution for human proteins.(a) The laddergraph for the protein SPR2B MOUSE, where the string at the bottom represents this target protein and shorter strings above are ladderons.The most basic building blocks, namely, individual amino acid, are omitted for better visualization.Its S = 98, λ = 50 and ω = 48.(b) The laddergraph for the protein ATX8 HUMAN, with S = 80, λ = 12 and ω = 68.(c) The laddergraph for the protein A0A075B674 MOUSE, with S = 98, λ = 94 and ω = 4.(d) The distribution of order-index ω vs. size-index S, for human proteins with lengths below 500 AA.

Figure 3 .
Figure 3. Systematic comparison between R k and the order-rate η.(a) Spearman correlation between η and R k , as k increases, for six distinct species.(b) Scatter plots of η vs. R k for k = 1, 2, 3, 4, 5 and 6.Each row corresponds to a different species.Individual dots within the plots represent individual proteins.(c) Several representative proteins are chosen (denoted in red, green, and yellow colors) to show how R k changes as k increases up to 50.Note that the red curves in this subfigure correspond to the red dots in (b), and similar associations are made for the green and yellow curves; Each row represents a species, corresponding to (b).

Figure 4 .Table 2 .
Figure 4. Overview of the ladderpath-complexity of protein sequences across six typical species.(a) Density distribution of protein sequences with lengths below 2500 AA, with respect with ladderpath-complexity κ.(b) The average κ and the change in κ after shuffling.

Figure 5 .
Figure 5. Statistics related to the order-rate, η, of proteins containing a significant proportion of IDRs.(a) Structure of an artificial protein DeNovoTIM15, from the Protein Data Bank (PDB:6wvs).(b) Predicted structure of the protein DMBT1 HUMAN by AlphaFold2.(c) Predicted structure of HORN MOUSE by AlphaFold2.(d) The right part with darker colors shows the average η for proteins containing a significant proportion of IDRs compared to proteins without a significant proportion of IDRs.The left part shows the changes in η after shuffling the sequences.The corresponding data size, n, from the DisProt database is indicated.(e) This is similar to subfigure (d), but the data are from calculations using the disorder predictor software Metapredict, for the six proteomes.Note that * * * * means p < 0.0001, * * * means p < 0.001, * * means p < 0.01, and ns means "no significance".

Figure 6 .
Figure 6.Observations and simulation experiments related to protein elongation.(a) Scatter plot of protein lengths S vs. order-rate η, for all proteins across the six species.See the legend of subfigure (c) for the six species.(b) Similarly, a scatter plot of S vs. ladderpath-complexity κ.(c) The average η values of proteins vs. protein length S, for the six species.Each dot (S, average η) is calculated using proteins within a sliding window centered at a specific length S. (d) Results of simulation experiments showing how η evolves as the protein sequence elongates, for the three different cases elaborated in the main text.(e) Similarly, simulation experiments showing how κ evolves.

Figure 7 .
Figure 7.The algorithm for computing ladderpath-associated information.(a) Flowchart illustrating the algorithm with a specific example.(b) The laddergraph of the exemplified target sequence, calculated by this algorithm.

Table 3 .
Top 25 large-κ protein sequences with length limit below 2500 AA. belongs to the DMBT1 family. 2 mucin protein.3belongs to the flocculin family. 1

Table 5 .
Ladderpath-associated indicators of the gene family NBPF.