Variational Identification of Markovian Transition States

We present a method that enables the identification and analysis of conformational Markovian transition states from atomistic or coarse-grained molecular dynamics (MD) trajectories. Our algorithm is presented by using both analytical models and examples from MD simulations of the benchmark system helix-forming peptide Ala 5 , and of larger, biomedically important systems: the 15-lipoxygenase-2 enzyme (15-LOX-2), the epidermal growth factor receptor (EGFR) protein, and the Mga2 fungal transcription factor. The analysis of 15-LOX-2 uses data generated exclusively from biased umbrella sampling simulations carried out at the hybrid ab initio density functional theory (DFT) quantum mechanics/ molecular mechanics (QM/MM) level of theory. In all cases, our method automatically identifies the corresponding transition states and metastable conformations in a variationally optimal way, with the input of a set of relevant coordinates, by accurately reproducing the intrinsic slowest relaxation rate of each system. Our approach offers a general yet easy-to-implement analysis method that provides unique insight into the molecular mechanism and the rare but crucial (i.e., rate-limiting) transition states occurring along conformational transition paths in complex dynamical systems such as molecular trajectories.


I. INTRODUCTION
Recent advances in both parallelizable computational software and the development of highly-efficient supercomputers have extended the timescale accessible to atomistic molecular dynamics (MD) with explicit solvent representations to simulations up to the order of milliseconds (1)(2)(3).While this enables state-of-the-art computational modeling of complex molecular processes such as folding and binding (4), the vast amount of complex, high-dimensional data obtained from these simulations requires novel analysis methods, on one hand, to make use of all the available information and, on the other hand, to extract comprehensible and relevant information.The use of Markov State Models (MSMs, (5,6)) allows to directly extract thermodynamic and kinetic information from MD trajectories, such as free energy profiles, equilibrium probabilities and transition rates, and has proven to be a useful tool to analyze data from both experiments and simulations (7)(8)(9)(10)(11).
A variety of methods have been proposed that attempt to aggregate the conformational state space, projected along certain reaction coordinates (RCs) of interest into macrostates, proposing approaches that can be implemented automatically (12)(13)(14).Most of these methods rely on kinetic network descriptions employed by MSMs.However, while clustering algorithms such as PCCA+ (15) and its advances (16,17) are designed to identify metastable states (e.g., in protein folding terms: the folded, the unfolded and longlived intermediate states), none is aimed specifically at the automatic, reliable identification and characterization of the rate-determining transition states (TS), which are crucial for the thorough understanding of underlying molecular mechanisms and which control the overall kinetics of the system.
A method that allows the identification of TS, defining them as unstable states with equal probability to transition to neighboring metastable states on either side along the reaction coordinate, while concurring maximum flux, makes use of the commitment probability of states (18) (i.e. in one dimension, the probability to reach one state before it reaches another).However, in complex processes, it is not clear how many metastable states are important to describe the overall slow dynamics, and where the key kinetic bottleneck is located.
Here we propose a method for the automatic identification and analysis of both metastable conformational states (MS) and TS regions, by optimal and hierarchical construction of MSMs using a set of discretized RCs.The use of key RCs has long been a successful approach in many of the major enhanced sampling methods (19)(20)(21)(22), and it provides a crucial framework to allow the intuitive understanding of the data (23).Our algorithm enumerates all possible clusters that could be formed along an ordered set of states for each RC, and selects the optimal clustering that maximizes the slowest relaxation time of the reduced system.This is both a physically and mathematically meaningful optimization process, as the full system's relaxation time provides an upper bound to the slow relaxation processes of this system, and the better the coarse graining, the closer our objective function will be to this ideal value.Furthermore, our algorithm does not require higher order eigenvectors besides the second eigenvalue (unlike e.g., PCCA+).By increasing the number of coarse-grained states used to describe the system, we systematically identify key MSs, until we exhaust the number of relevant MSs, and the first optimal TS is identified.This approach helps define the minimal required number of MSs relevant for the kinetics of the process.We subsequently combine all the RCs in corresponding direct product states to finally obtain a reduced kinetic model of the system described by the complete set of RCs, and the corresponding TS that is most relevant to the slowest intrinsic relaxation time (24).

II. TIMESCALE OPTIMIZATION CLUSTERING
Based on the fact that any clustering of MSs decreases the measured relaxation times of the intrinsic dynamics of a system (25), and considering our main aim of estimating most accurately the slowest relaxation process, we target our proposed clustering method towards the maximization of the slowest extracted relaxation time (15,24).

A. 1D Clustering
Complex, multidimensional trajectories are often analyzed using their projection on simpler 1D RCs.
Here we start by assuming that a set of degrees of freedom is available, which accounts for the slow dynamics of the system.Any continuous 1D RC can be discretized into adequately small "microstates" (-states), denoted here as , where is their total number, which capture the same intrinsic dynamics as the continuous RC.In practice, a 1D RC, x(t), is often binned in equidistant -state intervals over RC windows of length x  , such that, for a trajectory that samples the interval we have .This discretization has an intrinsic ordering inherited from the continuous RC.
To reduce the dimensionality, the typically large number of -states can be clustered into coarse-grained "macrostates" (M-states).Given the projection of a trajectory on a 1D RC, x(t), for clustering it into M sorted M-states over a finite domain (i.e., with no periodic boundaries), b i cluster boundaries (   , and we are 1 ,..., , xx searching for the one that maximizes the coarse grained relaxation time ( When the exact full rate matrix K is known (i.e., not only the transition probabilities from trajectories), the reduced rates for each choice of a set of state boundaries, b i , along a Markov chain can be obtained using optimal, lagtime-independent expressions (e.g., Eq. 12 in Ref. (24)).When the full rate matrix K is not available, approximate reduced Markov matrices can be built, see for example Fačkovec et al. (10).Here we considered a lag time τ, for which the corresponding transition count matrix C(τ) can be calculated, with elements C ij (τ) corresponding to the number of transitions observed from S i to S j during the observation window τ.Subsequently, the corresponding Markovian transition probability matrix can be built, with entries or using a reversible estimator with an iterative approach (26)(27)(28)(29).Solving the eigenvalue problem of the Markov matrices T(τ) for each possible set of boundaries, the slowest relaxation time is obtained using 22 / log   t , where 2  is the second largest eigenvalue, and 1 1   .Finally, the optimal M-state clustering corresponds to the M−1 boundary positions, for which the slowest relaxation time, t 2 , is maximum.Importantly, by increasing the number of -states one-by-one, we first obtain all the key MSs, corresponding to diagonally-dominated reduced transition probability matrices ( ).Our algorithm automatically evaluates the transition probabilities to neighboring states and these are compared with the probability to remain in the same state.A TS is defined as an unstable state from which the outgoing probabilities to either directions are greater than the probability to stay in the same state, thus in our implementation, as we identify new M states, we test for this property and we stop adding additional M states once a TS is found.In addition, we also verify the nature of our TS by calculating and ensuring that most of the reactant flux goes through these states (see SI Section I) rather than around them, in accordance with the maximum flux criteria proposed by Huo and Straub (30).This characteristic qualitative change in the properties of the last -state allows us to identify the minimal number of key MSs that have to be accounted for in the description of the process along each 1D RC, and more generally in N-D.

B. N-D Clustering
To generalize our approach to the analysis of a trajectory in the N-D state space (i.e., described by N 1D RCs), each dimension is first clustered into M macrostates as described above.Assuming for simplicity that the number of M-states is the same, M, for each dimension (an assumption that is eigenvalue of the Markov model with N M states (15, 18, 24)) for a computationally feasible algorithm analogously to the one presented by Hummer and Szabo (24).We note here that the commitment probability cannot in general be used to define transition states in complex systems with multiple states (see also Figure S2).Instead, we use the second eigenvector here only to order the states, and we define subsequently the boundaries between the coarse-grained states based on this ordering initially.The resulting, sorted M-state trajectory are then analyzed one more time according to the 1D clustering procedure described above, to find a smaller and coarser grained optimal clustering into L larger "states" (i.e., with a maximum corresponding slowest relaxation time, 2 1/ ).The resulting 2 nd -level coarse graining into -states (i.e., with 2  N LM) is denoted here by 1 ,,


L .The ordering according to the second right eigenvector is not optimal in all cases, however, we verified in all examples presented here, that single state variations do not significantly affect the obtained clustering results.We also note that by using an ordering along the second right eigenvector, the continuity of the states is no longer ensured, and therefore kinetically disconnected states might be directly adjacent to one another.This problem may be overcome by using alternative approaches, for example diffusion maps (31) to identify an ordering that is based on kinetic vicinity of the states.This final coarse-graining procedure is presented above by considering all the RCs at the same time, globally.However, importantly, for complex systems with a large number of RCs it may be necessary

C. Analytical Investigation
We first tested our clustering method using Markov state models created for a set of analytical free energy functions that can be tuned to switch continuously and monotonically between 2-state-like and 5  N

3-state-like dynamics. To define the kinetic model underlying the analytical free energy function F(x),
we construct an -state Markov chain.The corresponding rate matrix, K, is given by: (1a) KK .
Here, we calculated the reduced Markov matrix both using Eq. 12 of Hummer and Szabo (24), and also with respect to a lag time τ using (see also SI Section III.): With   0, , 0,1, ,1, 0, , 0   Application of this method to a multitude of free energy profiles (Fig. 1) revealed that the slowest timescale optimization clustering successfully identifies all meta-stable states in a system.Once we exhausted the number of available important MSs as explained in Section IIIA, the next optimal -state is qualitatively different from the metastable macrostates obtained before, and it represents a TS with large probabilities to jump to the neighboring M-states (Figs. 1a-b), and most of the flux going through the TS.

III. APPLICATIONS
We apply our clustering algorithm to the analysis of MD trajectories obtained for two different systems (i) the relatively small helix-forming peptide Ala 5 , and (ii) a larger system, the epidermal growth factor receptor (EGFR) protein (32,33).

A. Conformational Dynamics of Ala 5
We first use our algorithm to study the dynamics of alanine pentapeptide (Ala 5 ) -a commonly used test system for evaluating the intrinsic conformational kineticsusing atomistic MD trajectories (32,34).The Ala 5 system (Fig. 2) has the advantage of being sufficiently small for generating converged MD sampling, even with an explicit representation of water, using relatively modest computational resources.At the same time, it is the smallest peptide that can form a full helical "i,i+4" loop between the amide carbonyl of its first residue and the amide hydrogen of its last residue allowing the study of secondary structure formation in both theoretical (32,(35)(36)(37)(38) and experimental (39,40) studies.We analyzed MD trajectories of Ala 5 with TIP3P (41) water molecules initialized from 4 initial conditions, of 250 ns length each, and performed at two different temperatures T 1 = 300 K, T 2 = 350 K, for a total simulation time of 2 μs.The simulations were performed using GROMACS (42) with the Amber-GSS (43) force field, as described in detail in Ref. (32).Here, the final coarsegraining in -states is performed for all the RCs M-states (though it can also be done sequentially, see text).
To cluster the MD trajectory of the Ala 5 peptide, the five Ramachandran Ψ angles have been chosen as RCs (Fig. S3), as the free energy barriers along the Φ angles are much smaller and thus, they contribute less to the slowest relaxation modes (32).In Fig. 2A   For convenience, we have labelled the S i M-states with 0 if they are in a helical (H) region of Ψ, 1 if they are in a coil (C) region of Ψ, and 2 if they are in a TS region of Ψ.Thus, the all-helical macrostate of Ala 5 is denoted as [00000], while, for example, [02221] will denote an M-state with the N-terminal residue helical, the C-terminal residue coil and the middle residues in their respective TS regions of their Ψ angles.
The 3-state clustering of all 5 Ψ RCs (with boundaries given in Table S1) at 350 K leads to the identification of only 8  macrostates at lag times τ = 10 ps as depicted in Fig. 3B and presented in Table S2 (including the corresponding -state populations and the equilibrium probabilities of their most populated M-states).The macrostates   and   have the largest population and form the "folded" ensemble.In this case,   consists of only one all-helical M-state   0 00000  that has an equilibrium population of ~40.84% (Table S2).The 2 nd most populated state,   , consists of several macrostates that add up to 6.91% of the total equilibrium population, of which 00001 is predominant (~6.66%).Thus, unfolding at the last C-terminal residue of Ala 5 is not sufficient to perturb the overall stability of helical conformations, and it maintains Ala 5 in its folded ensemble.
On the other hand, the unfolded ensemble appears to consist of 5  states (  to   ) connected with noticeably slower rates than those within the folded states, and accounting for more than 52% of the total equilibrium population.The "unfolded" ensemble is separated from the helix-rich "folded" ensemble by a TS,   , that consists of 10 M-states with a total population of only ~0.19%, of which the most populated M-states are [2000*], [0200*] and [0002*], where the "*" symbol means that at that position either 0 (H) or 1 (C) conformations are observed.Thus, the range of transitions near the TS is largely dominated by conformational dynamics at the 1 st , 2 nd , and 4 th residue of Ala 5 .Interestingly, at 300 K the 4 th residue is the only residue that significantly modulates the TS dynamics (Table S3).
As illustrated in Fig. 3B, the "unfolded ensemble" states   to   presents at 350 K (i) a specific connectivity that depends on how many residues can change cooperatively their state, and (ii) transition rates that are lower than in the folded ensemble.
These observations, enabled by our method agree well with previous experimental and computational studies of Ala 5 (32), while offering a more detailed, automatic analysis, also identifying TS states.
. Discretizing x(t) into M macrostates in the interval requires M − 1 boundaries.Any positioning of the boundaries defines a distinct set of Mstates along the RC:


with no loss of generality), a 1D trajectory with N M states is obtained.We can thus define N M unique states (e.g., of the form SSSSS when the trajectory is in the M-state S 0 in all dimensions).The resulting N M macrostates, are relabeled from M-state 0 our 1 st -level coarse-grained trajectory.Finally, to perform a 2 nd -level coarse graining, the M-(e.g., by their commitment probability values, calculated from the right eigenvector corresponding to the 1 1   to perform the final clustering step by hierarchically approximating the coarse graining of the M-states, into the -states, 1 ,,  L , by clustering the M-states, 01 ,,  M SS , of only a few (e.g., two) RCs at a time in a stepwise manner, and include additional RCs sequentially.This avoids the clustering of an unfeasibly large number of M-states left eigenvector of K corresponding to eigenvalue 0. Varying the cluster boundaries b i and maximizing the slowest relaxation time for all reduced Markov  N matrices defines the clustering algorithm for the analytical model, with the parameters  and  N .

Figure 1 .
Figure 1.Illustration of automatic optimal clustering into M=3 macrostates for a set of 1D analytical potentials (panels a to e, see SI Section III.) in which the intrinsic kinetics is tuned continuously as a function of a control parameter from being 2-state-like (a) to 3-state-like (e).The middle region (yellow) between the two boundaries identified by the algorithm (vertical lines) is either a TS, in a and b, or it becomes the third MS in c-e.Vertical lines correspond to the Hummer and Szabo rates (blue, (24)), and to transition probability matrices at 1000  

Figure 2 .
Figure 2. A Free energy for the backbone angle Ψ 1 of Ala 5 calculated for N  200 -states at 350 K and lag time τ=1 ps.Resulting M-states (vertical boundary lines: b1, b2, b3) are shown for 3-state clustering with periodic boundaries ((b1, b2) being identical for 2-state clustering).B All-helical00000, and all-coil, 22222, conformations of Ala 5 .The five Ψ backbone dihedral angles (yellow) are used as RCs.C For N-D clustering (N=5), -state clustering identifies M=3 M-states along each RC (H -red, Cblue, and TSyellow).Here, the final coarse- the clustering result for the 1D profile for Ψ 1 is demonstrated, where cluster boundaries are shown along the free energy profile calculated from N  200 Markov -states at 350 K and lag time τ = 1 ps.For all 5 Ψ angles, the two-state assignment successfully identified the two meta-stable states corresponding to the two free energy basins and the slowest relaxation time is in good accordance with the full 200-state model.As demonstrated in Fig.2A,3-state clustering for meaningful lag times up to the magnitude of the relaxation time (~τ = 500 ps) leads to the detection of the small TS state, which is characterized by a small equilibrium population and survival probability, and similar transition probabilities to either of the H or C states.For example, 3state clustering of the backbone angle Ψ 1 at 350 K (Fig.2A) with N  100 -states leads to a slowest relaxation time of t 2 = 212.45ps, and the equilibrium probabilities are, π 0 (H) = 0.662, π 1 (TS) = 0.001, and π 2 (C) = 0.337.The macrostate boundaries are listed in TableS1, and the transition probabilities are T 22 = 0.0984, T 20 = 0.4756, T 21 = 0.4260.

Figure 3 .
Figure 3.A Transitions near the TS are dominated by conformational dynamics at the 4 th alanine residue of Ala 5 .Representative conformations are shown for helical [00000] (yellow cartoon), and TS states [00020] (blue cartoon), and [00021] (grey cartoon).B Kinetic network for the optimal coarse grained -states for the Ala 5 (350 K, Tables .  eigenvalue of K (SI, Section II.), which generally corresponds to a separation into MSs.However, for 2-state-like systems, the solution of the eigenvalue optimization problem will lead to boundaries that identify a TS.Thus, when the 2 nd state has a TS-like character, this corresponds to