Small Sample Limit of the Bennett Acceptance Ratio Method and the Variationally Derived Intermediates

Free energy calculations based on atomistic Hamiltonians provide microscopic insight into the thermodynamic driving forces of biophysical or condensed matter systems. Many approaches use intermediate Hamiltonians interpolating between the two states for which the free energy difference is calculated. The Bennett Acceptance Ratio (BAR) and Variationally Derived Intermediates (VI) methods are optimal estimator and intermediate states in that the mean-squared error of free energy calculations based on independent sampling is minimized. However, BAR and VI have been derived based on several approximations that do not hold for very few sample points. Analyzing one-dimensional test systems we show that in such cases BAR and VI are suboptimal and that established uncertainty estimates are inaccurate. Whereas for VI to become optimal less than seven samples per state suffice in all cases, for BAR the required number increases unboundedly with decreasing configuration space densities overlap of the end states. We show that for BAR the required number of samples is related to the overlap through an inverse power law. Because this relation seems to hold universally and almost independent of other system properties, these findings can guide the proper choice of estimator for free energy calculations.


INTRODUCTION
Free energy differences provide detailed insights into the molecular driving forces of biophysical processes and their accurate calculation is crucial for their successful application, e.g., in pharmaceutical ligand design or material science [1][2][3][4][5][6][7]. To calculate the free energy difference between, e.g., two potential drug molecules bound to a receptor, alchemical equilibrium techniques [8] based on simulations with atomistic Hamiltonians are amongst the most widely used methods. Aside from the two states of interest, these techniques conduct sampling from intermediate states whose Hamiltonians are constructed from those of the end states. The step-wise summation of the individual differences then yields the total free energy difference.
Two choices have to be made that critically affect the accuracy of free energy calculations: Firstly, the choice of the estimator that is used to evaluate the free energy differences between the individual states. Whereas a number of estimators exist that have practical advantages in different situations [8][9][10], it has been shown that between two states the Bennett Acceptance Ratio (BAR) method [11] minimizes not only the variance, but also the mean-squared error (MSE) [12]. Remarkably, as will be revisited in the theory section, the Zwanzig formula [9] yields identical MSEs if applied together with an optimally chosen virtual intermediate state in which no sampling is conducted [10,12]. For BAR, the variance and the bias have been extensively analyzed [10,[13][14][15][16][17]. As the MSE can be decomposed into variance plus the squared bias, and therefore accounts for both the variance and the bias, we will focus our analysis in this work on the MSE. Further, from an application perspective, the MSE is the relevant quantity.
The second choice concerns the functional form of the intermediate states, i.e., how these are constructed from the two end state Hamiltonians. Apart from the conventionally used linear interpolation intermediates, various functional forms have been suggested [18][19][20][21], with a particular focus on appearing or vanishing particles in solution [22][23][24][25][26]. In general, when using the Zwanzig formula or BAR as an estimator, and assuming independent samples, the Variationally-derived Intermediates (VI) [12,27,28] have been shown to yield the optimal MSE amongst all possible functional forms of intermediate states.
However, both BAR and VI have been derived using approximations that strictly hold only for large sample numbers. This question becomes particularly urgent for free energy calculations of large systems or when using quantum mechanics based methods [29][30][31][32], which are computationally demanding and, therefore, provide limited sampling. Further, sample points derived from atomistic simulations are time-correlated, such that the effective number of independent sample points is often orders of magnitude smaller than the number of configurations obtained from a simulation. We therefore will analyze how the accuracy of BAR and VI depends on sample size, and show how the obtained scalings provide guidance on their proper use.

THEORY
Several different derivations of BAR have been published [11,33,34], resting on different assumptions. Here we recapitulate the one with the least restrictive assumptions, which also highlights the unexpected relation between estimators and intermediate states [12]. The generalization of this relation to N intermediate states has been used to derive VI. Both approaches rest on the Zwanzig formula [9]. Accordingly, the free energy difference between states A and B with Hamiltonians H A (x) and H B (x), respectively, is given by where x ∈ IR 3M denotes the position of all M particles of the simulation system. Only sample points from state A are used, where A denotes the ensemble average. For ease of notation, all energies are expressed in units of k B T . In the following, the free energy estimate governed by Hamiltonian H A (x) that is obtained when the ensemble average in Eq. (1) is calculated from a finite sample of size n will be denoted by ∆G (n) A→B , whereas ∆G A,B denotes the exact free energy difference. For statistically independent samples, the MSE of the free energy calculated via Eq. (1) reads [12] MSE ∆G where denote the configuration space densities and Z A and Z B the partition functions of the respective end states. Importantly, the derivation of the MSE of the Zwanzig formula, Eq. (3), and therefore also the optimization thereof leading to BAR and VI, is based on approximations. As a prior step, we consider the Hamiltonian H B (x)−C, i.e., the Hamiltonian of end state B shifted by a constant C. Using this Hamiltonian with the Zwanzig formula, Eq. (1), the free energy difference between A and B is calculated as We now denote the sample based average from Eq. (4) as and the exact ensemble average as For large n, using C ≈ ∆G A,B implies y (n) (C) ≈ y(C) ≈ 1. After expanding the MSE, Eq. (2) (for the full derivation see Ref. [12]), the expectation value of the estimate based on finite sampling and its square are approximated by using the first order series expansion of the logarithm ln y (n) (C) ≈ y (n) (C) − 1 around y (n) (C) = 1. Along similar lines, the exact difference and its square are approximated as ∆G A, Critically, for small n the averages y (n) (C) and y(C) generally differ, and therefore C cannot be chosen such that both are approximately one. If, as in practice, C is evaluated based on the acquired samples such that y (n) (C) = 1, then y(C) differs from one and, consequently, the first order series expansion of y(C) becomes inaccurate. If y (n) (C) and y(C) differ by, e.g., less than 10 %, then the relative error of this approximation of the logarithm remains below 5 %. However, for larger differences the neglected higher order terms will contribute markedly. A similar effect is caused by small configuration space density overlaps of the end states: Due to wider distributions of the exponentially weighted differences H B (x) − H A (x), the variance of the sample based averages y (n) (C) will increase, and therefore also the average absolute deviations from y(C).
In the next step, Fig. 1(a) shows how an intermediate state I is used to derive the BAR formula via ∆G (n) B→I . We refer to I as a virtual intermediate because it only serves as an end state for the Zwanzig formula, without being actually used for sampling. The derivation based on the above approximations [12] yielded an additive MSE in this case, i.e., the MSE of the total estimate is MSE ∆G For easier notation, we assume that the same number of samples n is available for the two end states. Minimizing Eq. (9) through a variational approach leads to the Hamiltonian of the optimal virtual intermediate [12] where the MSE is minimal if C = ∆G A,B and approaches that minimum as C approaches ∆G A,B . Figure 1  Let us compare the result using ∆G B→I with intermediate Eq. (10) to the original approach by Bennett [11], which uses a suitably chosen weight function Bennett optimized the weighting function with respect to the variance, which yields the widely used BAR result where f (x) = 1/(1 + e x ) is the Fermi function and C ≈ ∆G A,B has to be determined iteratively. From Eq. (11) and ∆G (n) B→I with Eq. (1) follows that the two approaches are equivalent if the weighting function relates to the Hamiltonian of the virtual intermediate state through Therefore, any Hamiltonian of a virtual intermediate state corresponds to a weighting function.
The variance of BAR [11] is given by where Ω can be interpreted as an overlap measure. Within the limits of the approximations discussed above, Bennett's variance, Eq. (14), equals the MSE, Eq. (3), of using Zwanzig in two steps, as is shown in Appendix A. This link between BAR and VI, Eq. (13), allows creating different estimators and transforming them between the formalism of using an intermediate state or a weighting function. Here, we will apply this result and compare BAR to the estimator that uses H I (x) = 1 2 (H A (x) + H B (x)) as the virtual intermediate state. Because H I (x) is a linear interpolation, we will refer to the resulting estimator as 'linear estimator', also known as the Simple Overlap Sampling method [35,36]. The resulting configuration space density is shown by the green dashed line in Fig. 1 The term in round brackets of Eq. (16) can be interpreted as an overlap measure, different from above, which equals one for two identical configuration space densities, and zero for disjunct supports. Next, any number of optimal intermediate states can be derived by extending Eq. (9) with the MSEs of additional steps. Here, we focus our analysis on only one intermediate state S for sampling, i.e., calculations of the form A → I ← S → I ← B. The optimization with variational calculus with respect to all intermediate Hamiltonians yields the VI. These consist of, firstly, Eq. 10 (the BAR equivalent) as the optimal Hamiltonian of the virtual intermediates and secondly, the optimal sampling Hamiltonian H S (x), which is determined through solution of The initially unknown ratios of the partition sums are determined iteratively, similar to the constant C for BAR. The converged VI for the harmonic and quartic end states are shown in Fig. 1(c). In summary, for small n, BAR and VI result from the accurate optimization of an inaccurate MSE. Naturally, this does not ensure that better estimators and intermediate sampling states exist, which is therefore the subject of our test simulations.

METHODS
In the first step, we assess the MSEs of different estimators. To this aim, we consider the one-dimensional system with end states consisting of a harmonic and a quartic Hamiltonian, as shown in Fig. 1(b). Based on n sample points drawn from the configuration space density of A and B, the free energy estimate ∆G

(n)
A B is obtained and compared to the exact difference ∆G A,B . Rejection sampling is used to obtain uncorrelated sample points. The MSE, Eq. (2), is then calculated by averaging over one million of such realizations. We use n = 1, 20 and 1000 sample points per end state. For each n, we consider 82 different setups for which the potential of end state B is moved horizontally away from A by varying x 0 , thereby considering a range of overlap Ω, which is obtained through numerical integration of Eq. (15).
With this procedure, we compare three variants. To separate the effects of an inaccurate estimate of C, firstly, BAR is used where C has been set to the (in practice unknown) exact free energy difference. Secondly, using BAR, where C is iteratively determined based on the sample set as done in practice. Thirdly, the linear estimator.
In the second step, aside from sampling in the end states, sampling is also conducted in one intermediate state S and a similar procedure as above is used to evaluate the MSEs of different Hamiltonians H S (x). Separate sample sets in S are used to evaluate the free energy differences to either end state, as using the same sample set would introduce correlations between the two step-wise free energy estimates that would require a different analytic approach as the one described above [27]. Again, three variants are compared: Firstly, the VI, i.e., Eqs. (10) and (17). For simplicity, only exact estimates for C and the ratios of the partition sums are considered. Secondly, as a comparison, two variants with a linearly interpolated sampling Hamiltonian: One using the linear estimator, and another one using BAR to evaluate the step-wise free energy difference. Again, the procedure was conducted for n = 1, 20 and 1000 sample points per sample set.

RESULTS AND DISCUSSION
The MSEs of the three estimator variants are shown in Fig. 2(a)-(c) for different configuration space density overlaps Ω between the harmonic and the quartic end state. The panels show this relation for different sample sizes n. As can be seen, for n = 1 both variants of BAR (blue and green) are suboptimal for all Ω, as they yield a larger MSE than the linear estimator (yellow). For n = 20, it depends on Ω whether BAR is suboptimal. Here, a turning point exists, i.e., the linear estimator is only better for approximately Ω < 10 −1 , whereas both BAR variants yield better MSEs for larger Ω. For n = 1000, this turning point shifts towards smaller Ω. Here, the BAR variants perform better for around Ω > 10 −3 . Note that as the end states are different in form, the largest achievable overlap is Ω = 0.935 and therefore no MSE of zero can be seen in Fig. 2(a)-(c), which would be expected for Ω = 1.
Unexpectedly, whereas for most n and Ω both BAR variants have very similar MSEs, the one in blue where C = ∆G A,B (i.e., the exact free energy difference) was used yields slightly larger MSEs than the variant that uses a sample based estimate of C (green). This finding is in contrast to the widespread belief that an estimation for C that deviates from ∆G A,B is a major contribution to the inaccuracy of BAR. The reason for this behavior lies in the first order series expansions of ln y(C) and ln y (n) (C), as shown in the context of Eqs. (7) and (8) in the theory section. For small n, y (n) (C) and y(C) differ, and C can therefore not be chosen such that the requirement is met that both are close to one. As a consequence, even if C = ∆G A,B such that y(C) = 1, then the first order series expansion of ln y (n) (C) becomes inaccurate, and the same holds true for the subsequent derivation of BAR.
The dashed lines in Fig. 2(a)-(c) show the predicted MSEs for BAR, i.e., Eq. (14), whereas the dotted lines show the one of the linear estimator, Eq. (16). As can be seen from Fig. 2(a), for n = 1 the predicted MSEs are much too small. Furthermore, BAR is predicted to have a better MSE than the linear estimator which is not the case for the results of the test simulations. For n = 20, the MSEs start to agree for large Ω, but still deviate substantially for small Ω. For BAR with n = 1000, the MSEs agree well for most Ω. For the linear estimator, the prediction is still mostly only accurate for large Ω.  Fig. 1(b) for sample sizes of n = 1, 20 and 1000. The MSEs are shown as a function of the configuration space density overlap Ω, where different Ω were obtained by varying x0 of the quartic end state. The results of two variants of BAR are shown: Firstly, using a constant C that equals the exact free energy difference (blue), and secondly, for C that was iteratively determined for each set of samples (green). The MSE of the linear estimator is shown in yellow. The dashed and the dotted lines show the analytical MSEs calculated based on approximations for BAR and the linear estimator, respectively, i.e. Eqs. (14) and (16). (d) Setups used for the test simulations yielding the results shown by the respective Roman numbers (e). Setup I is identical to the one in Fig. 1(b). (e) The minimum sample size n required such that BAR with an exact C yields a better (i.e., smaller MSE) than the linear estimator is shown as a function of Ω. The solid lines show the function n = b Ω −a fitted to the data points in the respective colors. The fit coefficients a and b are provided in the legend.
Interestingly, unlike at n = 1, Eq. (16) predicts an MSE that is larger than the one from the test simulations for n = 1000. These results show that BAR is only optimal in cases where the predicted MSE is close to the actual one. In cases where the predicted MSE is inaccurate, BAR, as the optimization thereof, becomes suboptimal.
As the turning point Ω above which BAR becomes optimal varies with n, the question arises for the relation between the required n for different Ω and how this relation compares for different systems. Therefore, in the next step we test how many sample points are required for BAR to achieve a smaller MSE than the linear estimator, depending on the configuration space density. To this aim, the first variant is used (C exact). Starting with n = 1, the MSEs of both BAR and the linear estimator are calculated and n is gradually increased until the turning point is found. In addition to the setup consisting of end states with a harmonic and a quartic Hamiltonian, three other diverse systems are considered, which are shown in Fig. 2(d). Again, for each system different horizontal shifts are used to vary Ω. The definitions and parameters of these systems are described in Appendix C.
The required number of sample points n is shown in dependence of Ω in Fig. 2(e). The four colors indicate the different test systems with corresponding roman numbers from Fig. 2(d). The required n closely follow a linear relation in the log-log plot, indicating a relation of the form n = b Ω −a . Fits of this form are shown as solid lines and the fit coefficients are provided in the legend of  Fig. 2(a)-(c), test simulations with a harmonic and a quartic end state were used, and (a)-(c) show the results for samples size of n = 1, 20 and 1000, respectively, in each state as a function of the configuration space density overlap Ω between the end states. The results of two variants using a linear intermediate state are shown: Firstly, using the linear estimator (yellow) and secondly, using BAR (red) to evaluate the step-wise free energy differences. The MSE of VI, which includes using virtual intermediate states that correspond to BAR as shown in Fig. 1(c) is shown in blue. The respective analytical MSEs are shown as black dashed, dotted and dashed-dotted lines. Fig. 2(e). Interestingly, the relation between n and Ω is very similar for all four test systems, suggesting that Ω and n are almost the sole factors that determine which estimator is superior. Again, for n = 1 the predicted MSEs are much smaller than the actual ones. However, already for n = 20, the actual MSE for VI is only slightly larger than the prediction, and matches perfectly for n = 1000. For the linear intermediate, for n = 20 both the predictions for BAR and the linear estimator hold only for larger overlaps. For n = 1000, the one for BAR matches the actual MSEs very well, whereas for the linear estimator the prediction reproduces the trend but slightly overestimates the MSEs for small overlaps. We also tested how many sample points n are required per state for VI to be optimal. Whereas for systems with large Ω, two or three sample points per state suffice, in no case does the required number of sample points exceed seven per state (data therefore not shown).
These results show that, again, the predicted MSEs are inaccurate for small n. As a consequence, the VI, which have been derived as an optimization thereof, are sub-optimal. However, using an intermediate sampling state, the MSEs become accurate and VI becomes optimal for much fewer n than for BAR. We attribute this unexpected result mainly to the fact that for VI the sampling intermediate still maintains a large overlap with both end states, even if their configuration space densities are entirely disjunct.

SUMMARY AND CONCLUSION
We have shown that for small sample sizes n the analytically calculated MSEs of free energy estimates based on the Zwanzig formula become increasingly inaccurate due to approximations in its derivation. As a consequence, BAR and VI, which have been derived as an optimization thereof, become suboptimal for small n, which was demonstrated through the existence of better alternatives. For BAR, even if the constant C is set to the exact free energy difference this suboptimality not only remains, but is even slightly worse than when C is estimated based on the samples.
Whether BAR and VI are optimal depends, aside from n, on the configuration space density overlap Ω, because for small Ω the fluctuations in the exponential averages increase. However, whereas BAR is suboptimal even for n > 1000 if Ω < 10 −3 , VI is already better than all other tested variants for n = 7 independent of Ω, owing to the fact that the overlap between adjacent states is largely increased when using an intermediate state. For BAR, Ω was almost the sole factor that determined how many sample points were required to be better than the linear estimator. The relation follows an inverse power law of the form n = aΩ −b , with very similar coefficients a and b for all four test systems considered.
It should be emphasized again that in atomistic simulations subsequent sample points are correlated, whereas the theory in this work relies on the common assumption of independent sample points. Therefore, the n provided here such that BAR is optimal will typically refer to the effective number of statistically independent sample points, which is typically much smaller that the actual sample size. The low number effects on then MSE assessed here, therefore, can be relevant in macromolecular applications also for quite large sample sizes.
For such applications, instead of monitoring the variance or MSE directly (as implemented in many simulation software packages), we recommend to firstly consider Ω. Secondly, packages such as alchemical-analysis.py [37] analyze the time correlations between sample points and give an estimate for the number of independent ones. Then, thirdly, the relation between the required n and Ω from this work will indicate whether BAR is optimal or whether another estimator such as the linear one should be used instead. Furthermore, in cases where BAR becomes close to being suboptimal, also the uncertainty estimates become inaccurate and other methods such as bootstrapping should be considered. Whereas BAR will remain the optimal estimator in many cases, these findings can help to assure that the optimal estimators are employed in all challenging applications.

Appendix A: Proof of MSE Equivalence to BAR Variance
The Zwanzig formula [9], Eq. (1), is used in two steps, as shown in Fig. 1(a). The MSE of a single step is given through Eq. (3). Therefore, the total MSE is calculated through MSE ∆G leads to