Propagation of uncertainty in physicochemical data to force ﬁeld predictions

The solvation free energy (SFE) is a key property in the thermodynamics of chemical processes. It can be evaluated using molecular simulations with good statistical accuracy. However, force ﬁeld predictions exhibit systematic errors due to uncertainties in the parametrization. Here we evaluate how the uncertainty in physicochemical data underlying force ﬁelds propagates to SFE predictions. We ﬁnd that the data contribution to the uncertainty in SFE is up to 25 times larger than the statistical uncertainty. The total uncertainty in the SFE in water is higher than in cyclohexane.


I. INTRODUCTION
Accurate prediction of absolute or relative free energies of molecular systems in the liquid phase remains a challenging endeavor.The multistate Bennett acceptance ratio [1] estimator based on equilibrium molecular dynamics (MD) simulations makes it possible to get converged results.Recently, there has been renewed interest in nonequilibrium methods based on the Jarzynski equality [2] or the Crooks Gaussian intersection method [3,4], although both the uncertainties in results and efficiencies of such methods remain under scrutiny [5][6][7].While aleatory uncertainties due to fluctuations and sampling are now largely under control and standard protocols are available [1,8], studies of uncertainty quantification are now being focused on epistemic uncertainties, i.e., those due to a lack of knowledge of the potentials, the simulations methods, and the parameters [9][10][11][12].Here we assess the epistemic uncertainty due to propagation of measurement errors in electric multipole moments and liquid density.
Most research on computing solvation free energies (SFEs) is based on classical empirical potentials because describing the electronic structure of molecules is too expensive in the liquid phase to make it feasible to obtain adequate sampling.However, the accuracy and reliability of classical potentials are largely dependent on the parameters in conjunction with the form of the potential.The parameters are typically finetuned to reproduce specific properties for a limited number of molecules [13][14][15].This means that a unique set of parameters is chosen from a parameter space that is often high dimensional with a rugged landscape [16].As a result, it is important to systematically quantify the uncertainty in the outcome of the molecular simulations that are caused by the uncertainty in the calibrated force field parameters.In this context, we consider here the finite accuracy of reference data for building force fields, a mostly overlooked aspect of predictions from force field simulations.Uncertainties in reference data limit the accuracy that can be obtained in SFE predictions, even in the hypothetical case of infinite sampling and (hypothetical) perfect models, that is, models that reproduce the experimental data exactly.
The level of accuracy in the SFE and other properties obtained from MD simulations depends on the force field chosen as well as on other factors concerning computational procedure such as simulation time and size of the system [17,18].Modification of existing force fields or development of new force fields can in principle improve the level of accuracy in the free energy predictions [19].The SFE in water, or hydration free energy (HFE), is an essential property in the development and assessment of the accuracy of the force fields [20][21][22][23][24][25].The fast convergence and sampling of HFE calculations enables the widespread use of those calculations for the evaluation of the accuracy of sampling methods and comparison of different theoretical approaches [5].
Here we establish how the SFE of 11 small organic molecules in water and cyclohexane vary due to changes in the molecular volume (by scaling the van der Waals radius σ ) and due to changes in the electric moment by scaling the atomic charges q relative to the force fieldcharges q 0 .Both the magnitude of the molecular dipole μ and the quadrupole θ scale linearly with q, which means the relative change in the charges is equal to the relative change in the electric moments.The molecular volume V is a nonlinear function of the van der Waals radii σ and it is computed using the atomic and bond contributions of van der Waals volume method [26].The experimental observables for all solutes are listed in Table I.We then combine the simulation results with experimental data and use error propagation to show how the uncertainty in experimental properties affects the final SFE results.

A. Uncertainty quantification
The variance var( f ) in SFE due to the uncertainty in the molecular van der Waals volume V and in the charge q can be calculated by the error propagation formula [41] var where f is the free energy function, ∂ f ∂V is the derivative of the SFE with respect to V , VV is the variance of V , and ∂ f ∂q is the derivative of the SFE with respect to q.Here qq follows from the experimental uncertainty in the dipole and quadrupole using Eq. ( 1), that is, and similarly for quadrupolar compounds.Here V q is the covariance between V and q and to estimate it we can apply the Cauchy-Schwarz inequality [42], which reads If there is a correlation between the variables then the upper limit is the number on the right-hand side of this equation, while the lower limit (no correlation) is zero by definition.We assume there is no correlation between the values of the electric moments, which are measured in the gas phase, and the liquid density.Force fields are often parametrized on a combination of the liquid density and the enthalpy of vaporization and in such cases the correlation may be important [43].Finally, the variance of the molecular volume V is estimated from the variance in the density ρ of the solute in the liquid phase This notion connects the microscopic variable from the simulation system to macroscopic observables.With this, we can compute the uncertainty

B. Simulation details
All simulations were carried out with the moleculardynamics software package GROMACS [44][45][46][47][48].The generalized AMBER force field (GAFF) [13] topology parameters and the initial coordinates of compounds were taken from FIG. 1. Plot of G solv for all solute-solvent combinations with all combinations of ±q and ±σ as indicated in the benzene-water combination, averaged over three force fields.The Gibbs energy difference plotted is relative to the default q and σ .the FreeSolv database [49,50].We used the CHARMM general force field (CGenFF) server [51] for the CGenFF parameters and the LigParGen server [52,53] for the all-atom optimized potential for liquid simulations (OPLS-AA) parameters with localized bond-charge corrected CM1A charges for condensed-phase simulations [14].In all simulations, we used the transferable intermolecular potential three-point water model [54], the linear constraint solver algorithm [55] for hydrogen bond constraints, and the SETTLE algorithm [56] to keep the water bonds and angle rigid.The leapfrog stochastic dynamics integrator [57] with a time step of 2 fs was used to integrate the equations of motion.Temperature was held constant with Langevin dynamics [57,58] with a coupling constant of 1.0 ps and a reference temperature of 298.15 K.At the equilibrium stage of the simulations the Berendsen barostat [59] was used to establish a pressure of 1 bar and the Parrinello-Rahman barostat [60] was used at production stage of the simulations, with a time constant of 1.0 ps and a compressibility of 4.5 × 10 −5 bar −1 .To treat the electrostatic interactions, the particle mesh Ewald algorithm [61] was used with a cutoff of 1.2 nm, a grid spacing of 0.12 nm, and an interpolation order of 6 for long-range electrostatics.The van der Waals interactions were computed with a cutoff radius of 1.1 nm.

III. RESULTS AND DISCUSSION
The SFE was computed for all combinations of the variables σ and q of the solutes modified by ±1% in both solvents in three different force fields.The MBAR method, which produces good statistical accuracy at a modest computational cost, was used for SFE calculations.The changes in SFE due to changing q and σ are plotted in Fig. 1.Interestingly, the two solvents behave qualitatively different from each other.In cyclohexane, increased molecular volume yields to lower    SFE, while the charge has less influence.This is expected [65] since the cyclohexane model does not include polarizability.
In water, the SFE is dominated by electrostatic interactions and therefore both increasing the charge and reducing the volume leads to lower SFE, except for the large apolar solutes napthalene and anthracene, where increased volume decreases the SFE, likely due to loss of water-water interactions [66].
With equations and data in place, we can now inspect the SFE, its statistical error due to sampling σ SFE , and the uncertainty due to the data.Tables II in water varies from approximately −3 to −37 kJ/mol and in cyclohexane from approximately −3 to −42 kJ/mol, depending slightly on the force field used.The uncertainty due to sampling varies between 0.06 and 0.16 kJ/mol.However, the uncertainty due to the data σ ( f ) used in model derivation is much larger (Fig. 2).For cyclohexane it varies from 0.1 to 2 kJ/mol and in water the data-derived uncertainties σ ( f ) vary between 0.4 and 4 kJ/mol.The σ ( f ) for the three force fields are similar in most cases (Fig. 2).
Figure 3 shows that the two SFE derivatives (numbers in Tables II-IV) are highly correlated (Pearson r > 90%) and that the pattern is very similar for the three force fields.The slope of the fit is about −0.3, which can be interpreted to mean that a 1% change in molecular volume can be compensated by a −3% change in the charges; however, the scatter around the FIG. 3. Correlation between SFE derivatives (Tables II-IV).Circles denote the SFE in water and squares in cyclohexane.The diagonal, plus, and horizontal patterns in gray, orange, and blue symbols refer to CGenFF, GAFF, and OPLS data, respectively.regression lines is significant.The sensitivity of the SFE with respect to the force field parameters is found to be high and it is indeed well established that small changes in charges can affect hydration free energies significantly [24].
Errors in the experimental liquid density are typically smaller and therefore the contribution to σ ( f ) is smaller than that due to errors in experimental multipole moments (it should be noted that in many cases quantum chemistry is used to produce charges, adding an additional error source [40]).Nevertheless, the SFE is very sensitive to changes in the van der Waals radius σ in the Lennard-Jones 12-6 potential (Tables II-IV).Recent force fields employing softer potentials in combination with distributed charges [67,68] may be less sensitive to small changes in σ or the magnitude of the charges.
In the derivation of σ ( f ) we have assumed that the force field reproduces the experimental data perfectly.However, it is well known that this assumption does not hold.Cailliez and Pernot have studied model uncertainty using Bayesian optimization [43].They found that the uncertainty due to parametrization is larger than statistical uncertainty for both thermodynamic and transport properties.Ghahremanpour et al. have determined force field parameters based on the Alexandria library of quantum chemical properties [40].They used bootstrapping to estimate uncertainties in, for instance, atomic polarizabilities and found these to vary between 1% and 4% [69].These examples highlight that additional sources of error are present in force field calculations.

IV. CONCLUSION
We conclude that there are errors in SFE calculations, due to imperfect data, that are significant because of the high sensitivity of the SFE to changes in force field parameters (Tables II-IV).While a great deal of computational effort is typically used to produce well-converged free energy calculations with low error bars [5], the larger error sources discussed here are usually neglected.Finally, we note that the largest uncertainties are found for aromatic compounds, which is cumbersome since these are important in drug design and there is a lack of free energy data for such compounds [70].
FIG. 2. Plot of σ ( f ) corresponding to the SFE in (a) water and (b) cyclohexane.The diagonal, plus, and horizontal patterns in gray, orange, and blue bars refer to CGenFF, GAFF, and OPLS, respectively.

TABLE I
d Solid density.

TABLE II .
CGenFF: compounds with their SFE with statistical uncertainty due to sampling σ SFE , q 0 ∂V , and σ ( f ), all in kJ/mol.

TABLE III .
GAFF: compounds with their SFE with statistical uncertainty due to sampling σ SFE

TABLE IV .
OPLS: compounds with their SFE with statistical uncertainty due to sampling σ SFE , q 0