Dynamics-Evolution Correspondence in Protein Structures

The genotype-phenotype mapping of proteins is a fundamental question in structural biology. In this Letter, with the analysis of a large dataset of proteins from hundreds of protein families, we quantitatively demonstrate the correlations between the noise-induced protein dynamics and mutation-induced variations of native structures, indicating the dynamics-evolution correspondence of proteins. Based on the investigations of the linear responses of native proteins, the origin of such a correspondence is elucidated. It is essential that the noise- and mutation-induced deformations of the proteins are restricted on a common low-dimensional subspace, as confirmed from the data. These results suggest an evolutionary mechanism of the proteins gaining both dynamical flexibility and evolutionary structural variability.

The genotype-phenotype mapping of proteins is a fundamental question in structural biology. In this Letter, with the analysis of a large dataset of proteins from hundreds of protein families, we quantitatively demonstrate the correlations between the noise-induced protein dynamics and mutation-induced variations of native structures, indicating the dynamics-evolution correspondence of proteins. Based on the investigations of the linear responses of native proteins, the origin of such a correspondence is elucidated. It is essential that the noise-and mutation-induced deformations of the proteins are restricted on a common low-dimensional subspace, as confirmed from the data. These results suggest an evolutionary mechanism of the proteins gaining both dynamical flexibility and evolutionary structural variability. DOI: 10.1103/PhysRevLett.127.098103 Phenotypic states of all living systems are shaped through dynamics, which are changed through evolution. The possible connection between the changes by dynamics in a short term and those by genetic evolution in a long term has been a crucial topic in evolutionary and developmental biology [1][2][3]. Recent developments in computational, systems, and theoretical biophysics have enabled us to investigate the issue quantitatively in terms of phenotypic variations. The state variables are subjected to both external noises and internal genetic changes. The possible relationships between the variations induced by noise and genetic mutations, i.e., correlations between the changes by dynamics and evolution, have been investigated at the molecular, cellular, and organism levels both theoretically [4,5] and experimentally [6][7][8].
Native structures of globular proteins are also shaped as a result of folding dynamics. Herein, the amino acid sequences are determined by genes. Understanding the genotype-phenotype mapping of proteins is one of the fundamental challenges in structural biology. The conformational dynamics are subjected to thermal noises, which can trigger functional motions [9][10][11][12][13], whereas the mutations in the sequences may lead to deformations in their native structure, resulting in changes in the equilibrium dynamics of the proteins.
In this Letter, with the database of hundreds of thousands of proteins from hundreds of protein families, we analyze the correlated deformations of proteins subjected to both thermal noises and mutations. The data analysis quantitatively demonstrates the correlations between the noiseinduced protein dynamics and mutation-induced variations of native structures, indicating the dynamics-evolution correspondences of proteins. Subsequently, from the study of the linear responses of native proteins, the origin of such correspondence is elucidated, to which dimensional reduction of the noise-or mutation-induced deformations of the proteins is shown to be related.
To characterize the fluctuations of a protein subjected to external noises, we introduce the vector ⃗ r ¼ ½⃗ r 1 ; ⃗ r 2 ; …; ⃗ r N ⊺ as the variable to describe the phenotypic state of the proteins, that is, the three-dimensional Cartesian coordinates of the N amino acid residues (represented by their C α atoms). To describe the variations in the native structure subjected to internal mutations, we take the native structure ⃗ r 0 ¼ ½⃗ r 0 1 ; ⃗ r 0 2 ; …; ⃗ r 0 N ⊺ , which is determined by the gene, as the model parameter. Here, the variable ⃗ r and parameter ⃗ r 0 are a pair of conjugated variables. We denote the noiseinduced deformation vector as Δ⃗ r ¼ ⃗ r − ⃗ r 0 . The covariance matrix C D describing the time-averaged correlated motions of a single protein can be expressed as C D ¼ hΔ⃗ r · Δ⃗ r ⊺ i [34]. Similarly, one can introduce the vector Δ ⃗ r 0 to describe mutation-induced deformation of the native state. By averaging the correlations in the structural variations of a group of proteins belonging to the same family of structurally similar proteins [35,36], one can introduce the covariance matrix C E ¼ hΔ ⃗ r 0 · ðΔ ⃗ r 0 Þ ⊺ i to describe the fluctuations of native structures in the evolution.
Herein, as an example, the structure of the coronavirus main proteinase (3CLpro) is shown in Fig. 1(a). The noise-induced motions in protein dynamics are shown in Fig. 1(b); further details are provided in Ref. [37]. The mutation-induced deformations in protein evolution are illustrated in Fig. 1(c). To find the connections between the dynamics and evolution, we calculate the two covariance matrices: matrix C D is obtained based on the elastic network model (ENM) [66][67][68], and matrix C E is calculated based on the ensemble of 511 structural homologs. We first compare the diagonal elements C D ii and C E ii , which reflect the magnitude of the noise-induced fluctuations (mobility) and mutation-induced deformations, respectively [69]. We standardize the C D ii and C E ii into z scores [70] and obtain the corresponding mobility and structural variability profiles. As shown in Fig. 1(d), the two standardized profiles show high similarities. Further correlations between them are shown in Fig. 1(e). Although there is not a perfect linear correlation between the two profiles, the Spearman correlation coefficient ρ s is rather high [71], indicating that the residues with high dynamical mobility exhibit high structural variability, which is related to evolvability (see Ref. [37], Sec. 2). This result is in line with previous observations on the correlation between sequence evolution and structural dynamics [72][73][74][75][76][77]. Next, we normalize the diagonal elements of C D and C E to unity (see Ref. [37], Sec. 4.1) and obtain the crosscorrelation matricesĈ D (dynamics) andĈ E (evolution), respectively. As shown in Fig. 1(f), matricesĈ D andĈ E share similar patterns. These results indicate a correspondence between the dynamics and evolution of protein structures.
To further validate this correspondence, we employ the DALI server [35] and the PRODY package [78,79] to fetch and align the structures [80] from more protein families. Our database contains protein structures from 415 protein families. In each protein family, there are more than 20 structural homologs (see Ref. [37], Sec. 1). For demonstration, we take four protein families with different chain lengths as examples to analyze the correlation lengths in protein dynamics and evolution. Similar to previous studies [81][82][83], we assume that the entriesĈ D ij andĈ E ij are the functions of the mutual distance r ij between the residue i and j. Then, we introduce the distance-dependent correlation functions ϕ D ðrÞ and ϕ E ðrÞ by averagingĈ D ij (orĈ E ij ) for residue pairs at mutual distance r ij ≈ r. As shown in Figs. 2(a) and 2(b), the correlation functions ϕ D ðrÞ and ϕ E ðrÞ behave similarly. We define their first zero points as the correlation lengths ξ D and ξ E , respectively. As shown in the insets of Figs. 2(a) and 2(b), ϕ D ðrÞ and ϕ E ðrÞ can be scaled by their correlation lengths, respectively. As shown in Fig. 2(c), for protein families at different sizes, the correlation length ξ D (dynamics) is always proportional to ξ E (evolution). Such a relationship suggests the correspondence between the long-range correlations in protein dynamics [81,84] and the epistasis (mutations at one site may alter mutations at a distant site) in protein evolution [26][27][28]85].
Previously, power laws were observed in the vibration spectra of the protein dynamics [86,87]. Herein, we evaluate the eigenvalues (σ E 1 ≥ σ E 2 ≥ Á Á Á ≥ σ E N−1 ) of the evolutionary cross-correlation matrixĈ E . As shown in Fig. 2(d), the rank-size distribution of the largest 20 eigenvalues of the four different protein families is plotted at a double logarithmic scale, and the slopes of the fitted curves are close to −1, demonstrating a power-law-like behavior. Such a distribution, which is known as Zipf's law,  [88,89].
The aforementioned data analysis has clearly demonstrated the correspondence between dynamics and evolution. However, the origin of this correspondence is still unclear. To elucidate that, let us introduce the potential energy Eð⃗ x; ⃗ θÞ, which is the function of the gene-related state ⃗ θ and phenotypic state ⃗ x of the proteins. Here, parameter ⃗ θ represents a group of genetic control parameters that determine the native structure, and variable ⃗ x consists of a set of generalized coordinates describing the conformational dynamics of a protein [90]. We take structure ⃗ r 0 ð ⃗ θÞ as the native state where Eð⃗ x; ⃗ θÞ reaches the minima. With the linear approximations, we neglect the nonlinear interactions related to protein dynamics and evolution. In this way, the energy function Eð⃗ x; ⃗ θÞ can be approximated as is introduced to characterize the energy change by mutation-induced deformations of the native structure, where C E ∼ F −1 . The matrix F is known as the observed Fisher information matrix in statistical inference [91] and information geometry [92] (see Ref. [37], Sec. 3.2). According to Eq. [66], around the native state, we have ð∂ 2 E=∂r i ∂r j Þ ¼ ð∂ 2 E=∂r 0 i ∂r 0 j Þ, that is, H ¼ F. Thus, a direct correspondence between dynamics and evolution around the native structure is shown.
With the correspondence H ¼ F, and referring to the reciprocal relations C D ∼ H −1 and C E ∼ F −1 , one would expect the relation C D ∼ C E . In fact, due to the nonlinearity in the real protein data, such a relation may not be strictly valid. Still, we can prove that the determinant of the dynamical covariance matrix C D is proportional to the determinant of the evolutionary covariance matrix C E [93], that is, det , and similarly for λ E i 's. As illustrated in Fig. 3(a), 1=λ D i (or 1=λ E i ) provides an estimation of the magnitude of fluctuation along the direction of the eigenvector ⃗ v D i (or ⃗ v E i ). Thus, the determinant (product of the nonzero eigenvalues) gives an estimation of the total "volume" of the conformational space accessible by dynamical or evolutional changes in the vicinity of a given native structure (see Ref. [37], Sec. 5.1). The correspondence between the determinants indicates that the total volume of the conformational space accessed in the dynamics is proportional to that accessed in the evolution. The former is related to the sensitivity to variable perturbations [87], and the latter is associated with the robustness to parameter mutations [88,89]. Hence, high sensitivity in variable perturbations  corresponds to high robustness in parameter mutations. A conceptual sketch is provided in Fig. 3(b). For the flat basin, thermal noises can trigger large fluctuations, while mutations can hardly change the energy around the minima. For the sharp basin, thermal noises can only trigger small fluctuations, while mutations can lead to large energy differences. Similar relations were also observed in the evolution-development correspondence [5,94,95] and the generalization of deep learning [96,97].
More generally, one may expect to analyze the dynamics-evolution correspondence with variables ⃗ x and parameters ⃗ θ. For instance, with parameters ⃗ θ, one shall introduce the corresponding Fisher information matrix F ⃗ θ with entries F ⃗ θ ij ¼ ð∂ 2 E=∂θ i ∂θ j Þ. By introducing the Jacobian matrix J with entries J ij ¼ ∂r 0 i =∂θ j , according to the chain rule, we have F ⃗ θ ij ¼ ð∂ 2 E=∂θ i ∂θ j Þ ¼ ð∂r 0 j =∂θ i Þð∂ 2 E=∂r 0 i ∂r 0 j Þð∂r 0 i =∂θ j Þ ¼ ðJ ⊺ FJÞ ij . Thus, Similarly, one may introduce another Jacobian matrix J 0 to reparameterize ⃗ r as ⃗ x. The fluctuations of ⃗ x and ⃗ θ can be described as the covariance matrices C ⃗ x ∼ ðH ⃗ x Þ −1 and C ⃗ θ ∼ ðF ⃗ θ Þ −1 , respectively. If vectors ⃗ x, ⃗ θ, ⃗ r, and ⃗ r 0 are all in the same dimension, then det C ⃗ x ∼ det C ⃗ θ is still valid. For more general cases, when the vectors are not in the same dimension, if covariance matrices C ⃗ x and C [37], Sec. 5.2). These results suggest that the correspondence between the determinants is universal and does not rely on the selection of the coordinates [98].
The correspondence between the determinants reflects the connection between the dynamical and evolutionary conformational spaces in the measure of volume. However, it is not sufficient to explain the high coincidence in the principal components of the dynamics and evolution of proteins. To demonstrate such a coincidence, we first take 3CLpro as an example. By comparing the dynamical modes given by ENM with the evolutionary modes obtained by the principal component analysis, a mode-bymode overlap matrix can be obtained. The overlap matrix of the first six modes is shown in Fig. 4(a). One can find that the first (or the second) dynamical mode v D 1 (or v D 2 ) has high overlap ð> 0.7Þ with the second (or the first) evolutionary mode v E 2 (or v E 1 ). In Fig. 4(b), the scattering plots of the first two evolutionary modes (v E 2 and v E 1 ) vs dynamical modes (v D 1 and v D 2 ) show a high correlation, indicating the correspondence in the subspaces spanned by the first two modes.
To further quantify the overlap between the subspaces spanned by the dynamical and evolutionary modes, let us consider the projection of the mutation-induced deformations Δ ⃗ θ on the dynamical modes ⃗ v D i . We first normalize the Hessian matrix H so that the average eigenvalue hλ D i ¼ 1 [84,99]. Thus, we introduce the average Rayleigh quotient where the deformation from structure p to q is denoted as Δ ⃗ θ qp , and m denotes the total number of q-p pairs.
Notably, when Δ ⃗ θ qp is parallel to the slow dynamical Thus, if the mutation-induced deformation can be well described by a few dynamical modes, hλ D i E yields a small value. The magnitude of hλ D i E is consistent with the measurement of the overlap between the d-dimensional (herein we set d ¼ 6) subspaces [100] spanned by dynamical and evolutionary modes (see Ref. [37], Sec. 6). As shown in Fig. 4(c), for protein families with smaller hλ D i E , there is an increasing overlap between the subspaces spanned by dynamical and evolutionary modes. For our dataset, the average hλ D i E ≈ 0.26, and the mean subspace overlap is approximately 0.41. Compared with random systems or permuted systems with similar degrees of freedom, the dynamical and evolutionary modes of proteins exhibit a significantly high overlap in the low-dimensional subspaces [101]. These results are in line with previous observations on the genotype and phenotype of proteins [25,26,28].
To understand the structural (topological) origin of low dimensionality, we analyze the modularity Q of the elastic network of the proteins (see Ref. [37], Sec. 7). Intuitively, a network that can be divided more easily into modules would have higher modularity Q. As shown in Fig. 4(d), hλ D i E has a negative correlation with Q, indicating that the modularized structures contribute to the low dimensionality in protein dynamics and evolution. Owing to such low dimensionality, the total conformational space can be approximated as the subspace spanned by a few leading dynamical or evolutionary modes. For example, − lnðλ D 1 λ D 2 Þ or − lnðλ E 1 λ E 2 Þ can be assumed to approximate the log volume of the dynamical or evolutionary conformational space, respectively. As shown in Fig. 4(e), − lnðλ D 1 λ D 2 Þ has a negative correlation with hλ D i E , suggesting that proteins maximize their conformational changes by reducing the dimensionality of the dynamics and evolution. In this way, proteins gain both flexibility in dynamics and structural variability in evolution.
Finally, we concentrate on the size (chain length N) dependence of the hλ D i E of different protein families. As shown in Fig. 4(f), hλ D i E decays gradually as the chain length N increases, and a power-law relation hλ D i E ∼ N −ζ (ζ ≈ 0.23) holds. As a comparison, the size dependence of the slowest mode (λ D 1 ∼ N −1 ) is also plotted. These scaling relations reveal an evolutionary constraint for proteins of different sizes. Larger proteins exhibit slower relaxations, and their mutation-induced deformations show higher overlap with slow dynamical modes.
Overall, our data analysis quantitatively demonstrates the dynamics-evolution correspondence in protein structures. Our theoretical study elucidates the origin of this correspondence and establishes the plasticity-robustness relation between dynamics and evolution. These results suggest a universal link between the functionality and evolvability of proteins, which provides a quantitative framework leading toward the rational function-driven protein design [102,103]. In such a framework, the designed function will determine the mobility of the residues, and the information of residue motions will determine the features in the sequences together with evolutionary constraints.