Deep Learning Protein Conformational Space with Convolutions and Latent Interpolations

Open Access

Deep Learning Protein Conformational Space with Convolutions and Latent Interpolations

Venkata K. Ramaswamy, Samuel C. Musson, Chris G. Willcocks, and Matteo T. Degiacomi

Phys. Rev. X 11, 011052 – Published 15 March 2021

Abstract

Determining the different conformational states of a protein and the transition paths between them is key to fully understanding the relationship between biomolecular structure and function. This can be accomplished by sampling protein conformational space with molecular simulation methodologies. Despite advances in computing hardware and sampling techniques, simulations always yield a discretized representation of this space, with transition states undersampled proportionally to their associated energy barrier. We present a convolutional neural network that learns a continuous conformational space representation from example structures, and loss functions that ensure intermediates between examples are physically plausible. We show that this network, trained with simulations of distinct protein states, can correctly predict a biologically relevant transition path, without any example on the path provided. We also show we can transfer features learned from one protein to others, which results in superior performances, and requires a surprisingly small number of training examples.

Received 8 June 2020
Revised 15 December 2020
Accepted 26 January 2021

DOI:https://doi.org/10.1103/PhysRevX.11.011052

Published by the American Physical Society under the terms of the Creative Commons Attribution 4.0 International license. Further distribution of this work must maintain attribution to the author(s) and the published article’s title, journal citation, and DOI.

Published by the American Physical Society

Physics Subject Headings (PhySH)

Conformation changes Protein structure

Machine learning Molecular dynamics

Physics of Living SystemsPolymers & Soft MatterInterdisciplinary Physics

Authors & Affiliations

Venkata K. Ramaswamy ^1,*, Samuel C. Musson ^1,*, Chris G. Willcocks ^2,‡, and Matteo T. Degiacomi ^1,†

¹Department of Physics, Durham University, Stockton Road, Durham DH1 3LE, United Kingdom
²Department of Computer Science, Durham University, Stockton Road, Durham DH1 3LE, United Kingdom

^*These authors contributed equally to this work.
^†Corresponding author. matteo.t.degiacomi@durham.ac.uk
^‡Corresponding author. christopher.g.willcocks@durham.ac.uk

Popular Summary

Proteins carry out a range of essential biological functions including catalysis, sensing, motility, transport, and defense. These molecules function independently or in unison with other molecules, such as DNA, drugs, or other proteins. To perform these functions, a protein must often change its shape (or conformation), but identifying the possible conformations of a protein is not an easy task. We present a methodology that combines molecular simulation and machine learning to discover transition paths between protein conformational states.

Current experimental techniques provide a good picture of the most stable conformations and little to nothing on the transition path or intermediate states. Despite the importance of these intermediate conformations for pharmaceutical drug targeting purposes, determining them with high reliability remains a challenging problem.

We overcome this hurdle with a neural network that can generate new protein conformations after being trained with examples from experiments or molecular simulations. The network is capable of generating structures that respect physical laws and features the first universal architecture, capable of handling any protein without having to be modified.

Our network can predict a physically realistic and biologically relevant transition path between different conformations of the protein MurD, a potential antibacterial drug target. The network is also capable of transfer learning, whereby its training with a sparse dataset is facilitated by pretraining on a different, larger dataset. This opens the door to a new class of computational methods capable of characterizing simultaneously every existing protein.

Key Image

Article Text

Click to Expand

Supplemental Material

Click to Expand

References

Click to Expand

Issue

Vol. 11, Iss. 1 — January - March 2021

Subject Areas

Reuse & Permissions

Author publication services for translation and copyediting assistance advertisement

Images

Figure 1
Neural network design. (a) The generative architecture is composed of an encoder $f$ and a decoder $d$ , and is trained with a collection of protein conformations. The loss function couples a geometric term $L_{MSE}$ to ensure original and encoded-decoded structure are similar, and physics-based terms $L_{path}$ to ensure that latent space interpolations $z_{m}$ between any pair of conformations produce protein structures of low energy. (b) Protein atoms can be sorted in a list so that atoms that are adjacent in the list are also adjacent in the Cartesian space. The convolutional neural network operates on this list, that can be of arbitrary size. (c) The first 1D convolution layer learns $32 \times$ feature detectors, each with a kernel size of 4. The stride is set to 2; therefore the output sequence is half the input size for any input length. Each subsequent layer further reduces the spatial length of the molecule, warping the input such that it becomes progressively deeper and thicker (more ribbonlike) as well as more abstract.
Reuse & Permissions
Figure 2
Performance of the neural network trained with to minimize MSE alone or in conjunction with physical terms (labeled on the left). At every epoch, each network was asked to generate 20 conformations interpolating from MurD closed to open state. (a) At each epoch, we calculate the DOPE score of each conformation and report the mean values from the 10 repeats on the vertical axes with color corresponding to the standard deviation. Physics-based loss functions lead to interpolations with better structural quality reflected by lower DOPE scores. (b) The network-predicted protein conformations of open (left), intermediate (center), and closed (right) states at the last epoch, shown in sausage representation with the thickness and color corresponding to the percentage error at the residue level.
Reuse & Permissions
Figure 3
Analysis of latent spaces of networks trained on conformations of MurD “open” and “closed” states. (a) The physics-based loss function used to train our neural network correlates with the DOPE score, making it a reasonable estimator of protein structural quality. (b) DOPE scores corresponding to the latent spaces of our neural network trained with two different loss functions. On the left, the network is trained to only minimize mean square error (MSE) between input and output structure; on the right, the loss function combines MSE and a physics-based loss function (DDN). Yellow regions indicate structures of poor quality, black points report on the projection in the latent space of all training examples, including the generated midpoints $z_{m}$ used to train DDN. The transition path predicted by the networks is projected as connected white dots onto their respective latent space. The latent space of the network based solely on MSE features two basins associated with closed and open states, separated by a region of poor quality models. In the network trained with both MSE and physics ( ${DNN}_{Phys}$ ), the two states are connected by acceptable protein models. Also see Fig. S4 comparing the structures from the identified basins (Supplemental Material [27]).
Reuse & Permissions
Figure 4
State transition path prediction by the neural network. (a) Side view of MurD showing the transition of its mobile domain from the closed to open state (with the time evolution marked as beads colored from yellow to violet). (b) The conformational change of MurD can be described in terms of spherical coordinates between its three domains (see Methods). We report the opening angle of each conformation in the training set (light green), test set (dark green), intermediate crystal structures (palatinate stars), as well as interpolations generated by PCA (gray circles), purely geometry-based neural network ( ${DNN}_{MSE}$ , gray triangles), and neural network combining geometry and physics ( ${DNN}_{Phys}$ , gray squares). The interpolation produced by ${DNN}_{Phys}$ best reproduces the path described by the test set and transits in the vicinity of intermediate crystal structures. (c) Superimposition of intermediate MurD conformation 5A5E (in palatinate) and an intermediate conformation generated by the neural network (in white), with a RMSD of 1.3 Å.
Reuse & Permissions
Figure 5
Transfer learning of our best network trained on MurD to four different proteins. (a) Side view of p24, TBE-sE, HSP, and SurA showing example alternative conformations with transparency and the highly mobile domains of p24, TBE-sE, and SurA in color (yellow and green). (b) Moving average (window of 3) mean (solid line) and standard deviation (shaded region) of the total loss (log scale) comparing the performance of the pretrained network with its counterpart trained from scratch using a training dataset of size 1000 and 100, all using the same loss function featuring both MSE and physics-based terms.
Reuse & Permissions

Physical Review X