Multilevel Monte Carlo for quantum mechanics on a lattice

Monte Carlo simulations of quantum field theories on a lattice become increasingly expensive as the continuum limit is approached since the cost per independent sample grows with a high power of the inverse lattice spacing. Simulations on fine lattices suffer from critical slowdown, the rapid growth of autocorrelations in the Markov chain with decreasing lattice spacing. This causes a strong increase in the number of lattice configurations that have to be generated to obtain statistically significant results. This paper discusses hierarchical sampling methods to tame this growth in autocorrelations. Combined with multilevel variance reduction, this significantly reduces the computational cost of simulations for given tolerances $\epsilon_{\text{disc}}$ on the discretisation error and $\epsilon_{\text{stat}}$ on the statistical error. For an observable with lattice errors of order $\alpha$ and an integrated autocorrelation time that grows like $\tau_{\mathrm{int}}\propto a^{-z}$, multilevel Monte Carlo (MLMC) can reduce the cost from $\mathcal{O}(\epsilon_{\text{stat}}^{-2}\epsilon_{\text{disc}}^{-(1+z)/\alpha})$ to $\mathcal{O}(\epsilon_{\text{stat}}^{-2}\vert\log \epsilon_{\text{disc}} \vert^2+\epsilon_{\text{disc}}^{-1/\alpha})$. Even higher performance gains are expected for simulations of quantum field theories in $D$ dimensions. The efficiency of the approach is demonstrated on two model systems, including a topological oscillator that is badly affected by critical slowdown due to freezing of the topological charge. On fine lattices, the new methods are orders of magnitude faster than standard sampling based on Hybrid Monte Carlo. For high resolutions, MLMC can be used to accelerate even the cluster algorithm for the topological oscillator. Performance is further improved through perturbative matching which guarantees efficient coupling of theories on the multilevel hierarchy.


Introduction
The Euclidean path integral formulation of quantum mechanics [1] allows the calculation of observable quantities as expectation values with respect to infinite-dimensional and highly peaked probability distributions.After discretising the theory on a lattice with finite spacing a, expectation values are computed with Markov Chain Monte Carlo methods (see for example [2] for a highly accessible introduction).This approach is elegant and attractive since it can be extended to quantum field theories, where it allows first principles-predictions for strongly interacting theories such as Quantum Chromodynamics (QCD), see e.g.[3,4].Ultimately, however, one is interested in the value of observables in the continuum limit of vanishing lattice spacing a → 0. Since the cost of the calculation grows with a high power of a −1 , efficient Monte Carlo sampling techniques are crucial to obtain precise and accurate numerical predictions.Today state-of-the-art techniques [5] are routinely used to accelerate the Metropolis-Hastings algorithm [6,7] and in particular the Hybrid Monte Carlo (HMC) method [8] has proved to be highly successful in lattice QCD simulations.However, lattice calculations with HMC meth-ods still become prohibitively expensive as the continuum limit is approached.The reasons for this are twofold: 1.For quantum mechanical problems the cost C sample of generating a single discretised path grows at least in proportion to the number of lattice sites, which increases with a −1 if the physical size of the simulation box is kept fixed (for a quantum field theory in D dimensions the growth would be even faster with a cost of a −D per configuration).
2. As the theory approaches a critical point, subsequent states in the Markov chain are increasingly correlated, which requires the generation of more paths to obtain a given number of statistically independent samples.
Furthermore, the law of large numbers dictates that to reduce the statistical (sampling) error below a given tolerance stat , at least N indep ∝ −2 stat independent samples have to be generated.While in lattice QCD the continuum limit is usually taken by extrapolating simulations at different lattice spacings a and fixed tolerance stat on the statistical error, in the multilevel Monte Carlo literature (see e.g. the classical paper [9]) it is more common to decrease stat in proportion to the tolerance disc on the discretisation error as the lattice spacing is reduced.Reducing the combined statistical and discretisation error in this way would make optimal use of computational resources to obtain a result with a given total error for a specific fine lattice spacing.
The correlation of subsequent samples in the Markov chain is quantified by the integrated autocorrelation time τ int , which grows particularly rapidly for some quantities, such as the topological susceptibility χ t in QCD, where it has been observed that τ int ∝ a −z with z = 5 [10].This is attributed to "freezing" of the topological charge, and can lead to observable effects.Those can be both direct, since e.g. the mass of the η meson receives important contributions from the topological susceptibility in a pure Yang-Mills theory [11,12], and more indirect due to the coupling of slow modes with large autocorrelation times to other observables.The authors of [10] further report a milder but still significant growth with z = 0.5 − 1.0 for a range of other physically relevant observables.While the rapid growth of the integrated autocorrelation time for the topological susceptibility can be addressed by using open boundary conditions in time [13,14], this introduces additional complications since it requires lattices with a very large extent in the temporal direction.
To estimate the overall growth in cost of a simulation, as the lattice spacing a is reduced, consider a quantum mechanical observable with a discretisation error that is O(a α ), where values such as α = 1, 2 are typical.To reduce the discretisation error below a tolerance of disc and the statistical error below stat incurs a cost ), (1) with standard Monte Carlo (StMC), since disc ∝ a α .To get an intuitive understanding of this and subsequent complexity estimates it might be instructive to consider the special case α = 2, z = 0: since the discretisation error decreases quadratically with a, reducing this error by a factor of 4 can be achieved by halving the lattice spacing, which in turn doubles the cost for generating a single sample if the physical size of the simulation box is kept fixed; in other words, the cost per sample grows in proportion to In this paper, it is shown how this explosion in computational cost can be significantly reduced with hierarchical sampling [15] and Multilevel Monte Carlo (MLMC) [16,9], which has recently been extended to a Markov chain setting [17,18].To generate samples, a hierarchy of L − 1 coarser lattices with spacings of 2a, 4a, . . ., 2 L−1 a and corresponding coarse-grained versions of the original theory are constructed.Based on this hierarchy, a recursive implementation of the delayed acceptance method in [15] is proposed.Starting on the coarsest level, proposals are successively extended by additional modes and screened with a standard Metropolis-Hastings accept/reject step on increasingly finer lattices.At this point is important to stress that the coarse lattices are only used to accelerate sampling and do not introduce any additional bias because ultimately each new sample is accepted or rejected step with the correct, original action on the finest lattice.Since evaluating the action on the coarse lattices is substantially cheaper, the cost of generating a single fine level sample is not substantially higher than if a single-level sampler was used.In fact, when compared to a method such as HMC, it may be smaller since the cost of generating an HMC trajectory can be shifted to the coarsest level where it is substantially shorter.Since on each level proposals are screened with a Metropolis-Hastings accept/reject step, the method samples from the correct distribution on the original lattice with spacing a and does not introduce any additional bias, cf.[15].Due to the convergence of the lattice theories on subsequent levels of the hierarchy with a → 0, hierarchical sampling eliminates the growth in autocorrelation time, reducing the computational cost to MLMC is a variance reduction technique, which uses the fact that the expectation value of an observable (or quantity of interest) Q on a lattice with spacing a can be written as a telescoping sum.For this assume that there is some integer L ∈ N and a constant a 0 such that a = 2 −L+1 a 0 .Further, let Q be the observable measured on a lattice with spacing 2 − a 0 .Then where Here, the sums in Y are taken over independent samples, labelled by the superscript "(j)".The key observation is that, except for the very coarsest level, MLMC estimates differences of the observable instead of the quantity of interest itself.Provided that theories on subsequent levels can be coupled efficiently and the variance of the difference Q − Q −1 decreases sufficiently rapidly as the lattice spacing a is reduced, significantly lower numbers of samples N are sufficient on the finer levels of the grid hierarchy.The majority of the cost can be shifted to the coarser levels L, where sampling is substantially cheaper.Due to the exactness of the telescoping sum (i.e. the first equality in Eq. ( 3)), MLMC does not introduce any additional bias if the individual MC estimators Y are unbiased.The algorithms described in the paper allow the construction of estimators Y which have an arbitrarily small bias.In the numerical results presented below the size of this bias is comparable to the discretisation error on the original, fine level lattice.Compared to Eqs. ( 1) and ( 2), MLMC further reduces the computational complexity to see below.Similar estimates have been derived in [9,17,18] and it has been demonstrated numerically that MLMC leads to a significant reduction in computational complexity and overall runtime for a range of applications, e.g. in uncertainty quantification (UQ) for sub-surface flow [19,17], inverse problems [20] or material simulation [21].
While this paper focuses on the application of these new methods in quantum mechanics, the ultimate goal is to apply them in D-dimensional quantum field theories, such as lattice QCD with D = 4 and α = 2.For D > α, the expected gains are even larger, since the cost to generate a single configuration grows like a −D instead of a −1 while the accuracy is still decreasing no faster than a 2 .The predicted improvement in computational performance is summarised in the following diagram, generalising Eqs. ( 1), ( 2) and ( 4) to D dimensions: disc ).To discuss this further, consider first the relative advantage of MLMC over standard Monte Carlo in the continuum limit disc → 0 for fixed stat .MLMC only requires the generation of a small number of samples on the finest lattice for small disc (eventually only one for very small disc ), whereas the number of configurations that have to be generated with a standard Monte Carlo method is proportional to −2  stat .As can be seen from the final two lines of Eq. ( 5), MLMC is a factor of κ −2 stat faster than standard Monte Carlo with hierarchical sampling for disc → 0. This argument holds for general α and D; the coefficient κ depends on the relative cost of generating independent samples on the finest level and the coarser levels.Provided those costs are proportional to the number of unknowns on each level (and the constant of proportionality is independent of disc ) we expect κ to lie between 1 and 2.
If stat is kept fixed as the continuum limit is taken, eventually the statistical error will dominate the discretisation error.To avoid this, one might consider the case where disc = stat = / √ 2 and the combined root mean square error is reduced below some given tolerance .This is the common choice in the multilevel Monte Carlo literature (see e.g.[9]).In that case, the complexity estimates in Eq. ( 5) become In quantum field theories, coarse grained actions are naturally obtained by integrating out high-frequency modes in a renormalisation group (RG) transformation, which results in an effective theory with less degrees of freedom.In practice, the RG transformation can be carried out either non-perturbatively (e.g. through a block spin transformation) or through perturbative matching.The latter would, in fact, be sufficient for MLMC as long as the variance of Y decays sufficiently rapidly since the coarse levels are only used to accelerate sampling on the original, fine level.For asymptotically free theories, such as lattice QCD, Symanzik-improved actions [22,23] can be constructed by systematically adding suitable terms which are proportional to powers of the lattice spacing a and which are multiplied by appropriate, so-called 'improvement coefficients'.These coefficients can be tuned non-perturbatively [24,25], or they can be computed using perturbation theory for sufficiently small lattice spacing [22,23].In the MLMC approach, the perturbatively calculated improvement coefficients on different levels of the lattice hierarchy are in fact sufficient since the differences of these coefficients between subsequent levels are sufficiently small on fine lattices.
As a proof-of-concept, hierarchical sampling and multilevel Monte Carlo are applied to two problems in quantum mechanics (D = 1): a non-symmetric double-well potential and the topological oscillator studied in [26].The latter case is particularly interesting since it has a topological quantum number, which freezes in the continuum limit (a → 0).This results in a rapid growth of the autocorrelation time of the topological susceptibility if standard HMC sampling is used.Hierarchical sampling all but eliminates this growth, resulting in a dramatic reduction in runtime.Furthermore, the coarse-grained theories can be improved using a perturbative matching technique for this problem, which further increases the efficiency of the hierarchical approach.As demonstrated in [26], for the topological oscillator the so-called 'cluster algorithm' [27] almost entirely eliminates autocorrelations through long-range spin updates.However, this method can be further accelerated with MLMC, leading to a reduction in computational complexity and in absolute runtime for high resolutions.Similar gains are observed for the non-symmetric double-well potential problem with MLMC.
In summary, the main achievements of this work are: It is stressed again that the additional gains due to MLMC accelerating are expected to be significantly larger in high-dimensional theories, such as lattice QCD (see Eq. ( 5)).The present paper therefore aims to lay the foundation for further work on extending the described methods to quantum field theories on a lattice.
Structure.This paper is organised as follows: after briefly reviewing the literature on related approaches in Section 1.1, the application of hierarchical sampling and multilevel Monte Carlo to the path integral formulation of quantum mechanics is discussed in Section 2. The quantum mechanical model problems that are used in this work are described in Section 3, including the construction of coarse-grained actions for those problems.Numerical results for the non-symmetric double-well potential and the topological oscillator are presented in Section 4, in particular we compare the runtime of all considered algorithms for fixed stat .Section 5 contains the conclusion and outlines directions for future work.More technical topics, such as a detailed cost analysis of MLMC and a discussion of how the methods can be extended to higher dimensional problems, are relegated to the appendix where we also show results for

Relationship to previous work
While hierarchical sampling techniques have been suggested previously (see e.g.[28,29,30,31]), the variance reduction techniques from MLMC significantly improve on this.Eqs. ( 2) and (4) show that the additional acceleration will lead to a further dramatic reduction in computational complexity.The presented methods are therefore expected to be superior to the approach in [32], which uses a hierarchical method to initialise the simulation, but not for the Monte Carlo sampling.Earlier work in [30,31] uses renormalisation group techniques to sample close to the critical point of the Ising model where the theories on the coarser levels become self-similar.Similarly, collective cluster-update algorithms [33,27,34] have been applied to models in solid state physics close to phase transitions (see e.g.[35]).However, the application of all those techniques is limited to spin systems.The approach here applies to general systems and delivers significant additional speedup through multilevel Monte Carlo variance reduction.

Path integral formulation of quantum mechanics
For completeness and to introduce the discretised path integral for non-experts, we recapitulate the key principles here.The path integral formulation of quantum mechanics [1] expresses the expectation value of physical observables as the infinite-dimensional sum over all possible configurations or paths {x(t)}, where x(t) ∈ D ⊂ R for all times t ∈ R. In this sum each path x(t) is weighted by a complex amplitude e i S(x(t)) , where S(x(t)) is the action, the integral over the Lagrangian L of the system.This formulation is very elegant since it allows the direct quantisation of any system which can be described by a Lagrangian.In the limit → 0 fluctuations around the classical path which minimises the action cancel out, and the Euler-Lagrange equations are recovered.However, for simplicity from now on we will work in atomic units where = 1.To make the evaluation of the path integral tractable, two approximations are made: (1) time is restricted to a finite interval t ∈ [0, T ) and ( 2) the time interval is divided into d intervals of size a = T /d, which is known as the lattice spacing.Conditions have to be imposed on the paths at t = 0 and t = T ; here we use periodic boundary conditions x(T ) = x(0).Each path x(t), which is defined for all times t ∈ [0, T ), is replaced by a vector x = (x 0 , x 1 , . . ., x d−1 ) ∈ Ω = D d .For each j = 0, 1, . . ., d − 1 the quantity x j approximates the position x(t j ) of the particle at the time t j = aj.Those two approximations turn the infinite dimensional integral over all paths into an integral over a finite, but high-dimensional domain D d .Evaluating the integral in Euclidean time converts it to the canonical ensemble average of a statistical system at a finite temperature.More specifically, the expectation value of an observable (commonly known as "Quantity of interest", QoI, in the UQ literature) which assigns a value Q(x) to each discrete path x can be written as the following ratio with the d-dimensional probability density π * given by with normalisation constant Z.The action S(x) is an approximation of the continuum action where L is the Lagrangian.Physically meaningful predictions, which can be compared to experimental measurements, are obtained by extrapolating to the continuum limit a → 0 and infinite volume T → ∞.As d is inversely proportional to the lattice spacing, the integrals in Eq. ( 7) become very high dimensional in the continuum limit.In this paper we do not discuss finite volume errors (due to finite values of T ).In other words, we take the continuum limit Q exact = lim a→0 E[Q] for finite T as the "true" value for any observables studied here.

Standard Monte Carlo
Since the distribution π * in Eq. ( 8) is highly peaked, the expectation value in Eq. ( 7) is usually computed with importance sampling.For this, the Metropolis-Hastings algorithm [6,7] is used to iteratively generate a sequence of samples x (0) , x (1) , . . ., x (N −1) ∼ π * .The expectation value can then be approximated as the sample average A single Metropolis-Hastings step for computing x (t+1) , given x (t) , is written down in Alg.
The Markov chain x (0) , x (1) , x (2) , . . . is generated by starting from some x (0) , which is either a given vector or drawn at random.Since this x (0) is not drawn from the correct distribution, all subsequent samples x (t) are distributed according to some distribution π * (t) with lim t→∞ π * (t) = π * .In practice, the first n burnin samples are discarded, and throughout this paper we implicitly assume that n burnin 1 is chosen such that for all subsequent samples the error due to the difference between π * (t) and π * is much smaller than the discretisation-and sampling-errors.
The law of large numbers states that in the limit of a large number of samples N 1 the sample average Q StMC in Eq. ( 9) is distributed according to a Gaussian N (µ, σ) with mean µ = E[Q] and variance In this expression τ int is the integrated autocorrelation time defined as where t meas n burnin is an arbitrary point in time.As can be seen from Eq. ( 10), the number of samples required to reduce the statistical error below a given tolerance grows with τ int , and it is therefore important to reduce the correlation between subsequent samples as far as possible.This can be achieved by carefully choosing the proposal y in line 1 of Alg. 1.In lattice QCD with dynamical fermions, Hybrid Monte Carlo [8] is very popular since it generates global updates.We therefore choose to use this method here, being aware that other algorithms, such as heat bath sampling, might be more efficient for particular applications.We nevertheless believe that HMC is representative, since τ int grows with a large power of the inverse lattice spacing as the continuum limit is approached also with other sampling approaches.The only exception are some problem-specific samplers, such as the cluster algorithm [27] for the topological oscillator, which we therefore also consider in this work.

Multilevel Monte Carlo
We now describe hierarchical methods for overcoming the growth in autocorrelations and reducing the variance of the measured observable.

Lattice hierarchy
Recall that a path describes the position of the particle at the discrete points t j = aj with j = 0, 1, 2, . . ., d − 1.More formally, define a lattice T as the set of points Note that the lattices are nested, and the points of the lattice T −1 are a subset of the points of T , namely the points with even indices.A path on a particular level stores values at the odd and even lattice points, where the latter are also present on the next-coarser lattice.Formally this can be expressed as such that all x ∈ Ω can be written as x j = x j/2 for even j x(j−1)/2 for odd j.
On each lattice we define an action S : Ω → R such that S L−1 = S is the original action.In the simplest case the coarse-level actions are obtained by re-discretising the original action S with the appropriate lattice spacings, but other choices are possible and will be discussed below.On each level the action induces a probability distribution π such that where Z −1 is the normalisation constant.The probability distribution π L−1 on the finest level is identical to π * defined in Eq. ( 8).Further, introduce a conditional probability distribution π (•|x ) for the values at the odd points on level , given the values at the even points on the same level, namely for all x, x ∈ Ω −1 .The action S should be some approximation to S , such that it is possible to sample from π for a given x .For the quantum mechanical model problems considered in this work the construction of S is described in Sections 3.1.1and 3.2.1.
We stress that although in this paper we assume that the lattice can be partitioned into sets of mutually independent even and odd sites, the ideas developed here can be generalised to higher dimensions.This is outlined in Appendix A.

Hierarchical sampling
Similar to the delayed-acceptance approach in [15], we next introduce a hierarchical algorithm to efficiently construct a Markov chain on a given level using coarser levels: First we define the two-level Metropolis-Hastings step in Alg. 2. Setting x −1 ] this algorithm assumes that on a given level there is a coarse level proposal distribution q −1 (•|x Based on this, it proposes a new fine-level state which is either accepted and returned as the new state x (t+1) or rejected; in the latter case the previous state x (t) is returned as x (t+1) .It was shown in [15] that this defines a correct Metropolis-Hastings algorithm targeting π .Let q (TL) (x (t+1) |x (t) ) be the transition kernel for the process x (t) → x (t+1) implicitly defined by Alg. 2. The key idea is now to use the algorithm recursively by using q (TL) −1 as the proposal distribution q −1 on level −1.The process of picking y −1 from q −1 (•|x in the first line of Alg. 2 then corresponds to a recursive call to the same algorithm on the next-coarser level.On the coarsest level ( = 0) y 0 is drawn with the standard Metropolis-Hastings step in Alg. 1 with corresponding transition kernel q 0 ); here we always assume that the proposal in this Metropolis Hastings step is generated with a symmetric method such as HMC.
Algorithm 2 Two-level Metropolis Hastings step.

6:
Compute else 10: Draw uniformly distributed random u ∈ [0, 1).imply that the overall acceptance probability of Alg. 3 drops as the number of levels increases, and subsequent samples are highly correlated.However, this turns out not to be the case if the theories on subsequent level converge with → ∞: in this case the proposal from the two-level step in Alg. 2 is almost certainly accepted on finer levels.Our numerical experiments confirm this observation.
In practice, it is more convenient to implement Alg. 3 iteratively, starting from the coarsest level.As discussed in Appendix B, the cost of executing Alg. 3 on level can be bounded by a constant times the number of unknowns d on this particular level.Observe also that setting = L − 1 in Alg. 3 allows drawing a new sample x (t+1) ∼ π * from the original fine level probability distribution defined in Eq. ( 8).
Relationship to the literature.The two-level step in Alg. 2 is closely related to similar algorithms in [15,17].If the coarse level sample is drawn with an arbitrary Metropolis-Hastings kernel q −1 (•|x −1 ), then Alg. 2 above is a variant of the delayed-acceptance method in [15, Alg.1] with proposal distribution q(y |x (t) ) = π (ỹ |y −1 )q −1 ], y = [ỹ , y −1 ].On the other hand, if the coarse level sample is drawn from the exact coarse level distribution, i.e. if q(•|x

Multilevel Monte Carlo algorithm
As discussed in the introduction, the multilevel Monte Carlo algorithm computes the quantity of interest Q 0 on the coarsest level and adds corrections to this by computing the difference Y of the observable on subsequent levels = 1, 2, . . ., L − 1 according to the telescoping sum in Eq. (3).Since those differences Y have a smaller variance, this allows shifting the cost to the coarser levels where samples can be generated cheaply.The original MLMC algorithm described in [9] assumes that it is possible to draw independent identically distributed (i.i.d.) samples from a distribution on each level.For the Markov chain Monte Carlo setting considered here this is not possible since subsequent samples in the chain are correlated and, as discussed in [17], this introduces an additional bias.This bias can be reduced by constructing sequences z (0) , z (1) , z (2) , . . . of samples for each level = 0, . . ., L−1 with Alg. 3 and sampling those sequences with sufficiently large sub-sampling rates t .The typical rule in statistics is to use twice the integrated autocorrelation time τ int, to achieve (sufficient) independence.In our numerical experiments, we set t = 2τ int, and observe that the additional bias due to computing the coarse level samples which are only approximately independent is comparable to the discretisation error.
The multilevel Monte Carlo algorithm which we use in this work is presented in Alg. 4 and visualised in Fig. 2. It is similar to the multilevel algorithm in [17], but with the recursive independent sampler in [17,Alg. 3] replaced by the (suitably sub-sampled) hierarchical delayed-acceptance sampler in our Alg.3 above.Multilevel Monte Carlo computes which as unbiased estimator for the expectation E[Q] in Eq. ( 3).On each level , the number of samples is chosen to be where C eff is the effective cost of generating an independent sample (taking into account autocorrelations) and V = Var[Y ] is the variance of the quantity Y on level , which converges to zero as → ∞. for j = 1, . . ., N eff do Create a new sample x 0 with a standard Metropolis-Hastings method.

5:
Compute Y Create a new sample x (t+1) from x (t) with Alg. 2 and q −1 (•|x In practice, use t −1 steps of Alg. 3 to compute an approximately independent level sample z ).
We now discuss how the cost of Alg. 4 increases as the tolerance on the total error is tightened, assuming for simplicity i.i.d.samples on all levels.Let the exact value of the observable in the continuum limit be Q exact .The total mean square error of the multilevel Monte Carlo estimator defined in Eq. ( 15) can be expanded where the first term in the final line of Eq. ( 17) is the squared statistical error, whereas the second term is the squared discretisation error.An easy calculation shows that choosing N eff as in Eq. ( 16) guarantees that To analyse the complexity we assume that (i) the discretisation error is of order O(a α ), (ii) V converges with order O(a β ) for some β > 0, (iii) the integrated autocorrelation times of Y , and thus also the sub-sampling rates t , can be bounded by a constant independent of such that the cost C eff of generating an independent sample does not grow faster than the number of unknowns d for all .
As shown in more detail in Appendix B, it is then possible to choose the number of levels L such that the discretisation error in Eq. ( 17) does not exceed disc .As a consequence, the cost C MLMC ( disc , stat ) of computing the MLMC estimator in Eq. ( 15) with a statistical error less than stat and a discretisation error less than disc has the following computational complexity: (18) For the choice disc = stat = / √ 2, the total mean square error in Eq. ( 17) does not exceed 2 and Eq. ( 18) becomes which is a special case of the well-known estimate in [9].However, the samples created by Alg. 4 on each of the levels are generated with a Markov chain and thus only asymptotically distributed according to π .As discussed in [17], the complexity analysis can be modified to address this issue, leading to an additional factor | log disc | in Eqs.(18) and (19).This seems to be not visible in the numerical results below or in [17] (at least preasymptotically).

Quantum mechanical model systems
To demonstrate the performance of the methods discussed in the previous section we consider two non-trivial quantum mechanical problems.

Non-symmetric double-well potential
The first system describes a particle with mass m 0 moving subject to a non-symmetric double-well potential Figure 3: Double well potential used for numerical experiments 4 .Fig. 3 shows this potential for the choice of parameters that were used in our numerical experiments, namely where x(t) ∈ R.
For a given path x = (x 0 , x 1 , . . ., x d−1 ) ∈ R d the discretised lattice action is The observable we consider is the average squared displacement Note that since points on the lattice are correlated with a correlation length which is constant in physical units, the variance of this observable does not go to zero in the continuum limit.In other words, the sampling error is not automatically reduced on finer lattices.

Coarse level action
Coarse grained versions S of the action in Eq. ( 21) are obtained by re-discretising the Lagrangian in Eq. ( 20) on the lattice T with d = 2 −L+1 d points and lattice spacing a = 2 L−1− a on level to obtain To construct the action S defined in Eq. ( 14), observe that where π even is the marginal distribution of the even points and Here W is defined for arbitrary values x − , x + as and Z −1 ,j is a normalisation constant.The distribution in Eq. ( 24) can be approximated by a Gaussian by writing and satisfies the non-linear equation In the code ζ is found by a small number of fixed point iterations of Eq. ( 25), using x = (x − +x + )/2 as a starting guess.Further, 2m 0 σ /a is the curvature of the function Now write x = [x, x ] as in Eq. ( 13).Given ζ (x j , x j+1 ) and σ (x j , x j+1 ) for all j = 0, 1, . . ., d −1 −1, we can then construct The resulting probability density π (•|x ) defined in Eq. ( 14) is a multivariate normal distribution with diagonal covariance matrix, which can be easily sampled; the normalisation constant in Eq. ( 14) is

Topological oscillator
The second model system is the topological oscillator, described for example in [26].This is an interesting problem since it has a topological quantum number which can only take on integer values.The Lagrangian is where now crucially x ∈ [−π, π), i.e. the particle is confined to a finite interval.The Lagrangian in Eq. ( 26) can be obtained from the action of a free particle with mass m 0 confined to a circle with radius R, The form of the discretised action chosen here is As above we used periodic boundary conditions For a given path x(t) the topological charge q(x) of the system describes the number of complete revolutions during the time period T .Mathematically it is defined as For the discretised system this becomes The observable we consider is the topological susceptibility Defining ξ := T /I 0 and z := a/I 0 , a tedious but straightforward calculation shows that the expectation value of χ t for finite a, T is given by where for any p ∈ N, ξ > 0 the function Σ p is defined as Eq. ( 28) allows the calculation of the constant ∆ 0 in the Taylor expansion of the topological susceptibility.In other words, we can work out the bias for a given lattice spacing.This will also allow us to balance the discretisation-and statisticalerrors in the MLMC estimator if we choose disc = stat .
In the continuum limit (a → 0) the variance of χ t can be shown to be

Coarse level action
For the topological oscillator the coarse level action is where the moment of inertia I ( ) 0 is level dependent.In the simplest case one could simply set I ( ) 0 = I 0 for all = 0, 1, . . ., L−1.However, as will be shown below, performance can be improved significantly by using a perturbative matching procedure to construct I ( ) 0 on the coarser levels.To obtain S , rewrite π as in Eqs. ( 23) and (24), where now Again write x = [x, x ] as in Eq. ( 13), and given ζ (x j , x j+1 ) and σ (x j , x j+1 ) for all j = 0, 1, . . ., The normalisation constant in Eq. ( 14) is where B 0 is the zero-order modified Bessel function of the first kind.The resulting probability density π is the product of one-dimensional densities of the form which can be easily sampled for arbitrary values of σ and δx.In our code we find that rejection sampling with a suitable Gaussian envelope (as described in Appendix C) gives good results.

Coarse level matching
Ideally, the coarse level actions should be obtained by recursively integrating out the modes that can be represented on a given lattice, but not on the next coarser one.
In other words, S −1 is an effective action obtained from S .While for an arbitrary action this can not be done exactly, an approximate effective action can be constructed by a perturbative renormalisation group transformation or through (approximate) matching.Here we follow the latter procedure for the topological oscillator to adjust the moment of inertia I ( ) 0 on the coarser levels, starting from the physical value I 0 = I (L−1) 0 on the finest lattice.Let χ t (a, I 0 , T ) be the topological susceptibility calculate for a given I 0 , T and lattice spacing a, and recall that we can compute χ t (a, I 0 , T ) up to corrections of O((a/I 0 )2 ).We now require that 2 ) for all = 1, . . ., L − 1.Using Eq. ( 28) this gives with and Σ p as defined in Eq. (29).As the following numerical results show, computing I (0) with this approximate coarse level matching procedure significantly improves performance both for the hierarchical sampler in Alg. 3 and the MLMC method in Alg. 4.

Results
We now quantify the performance gains of the numerical algorithms described above.All results were obtained with a C++ code developed by the authors which is freely available at https://bitbucket.org/em459/mlmcpathintegral/ The code was compiled with version 18.5.274 of the Intel C compiler and run on a single core of an Intel E5-2650 v2 (2.60 GHz) CPU.For all numerical results we set T = 4; as remarked above we do not consider finite-volume errors here, i.e. we assume that the exact value is the expectation value of the observable in the limit a → 0 at a given T .As can be seen from Eq. ( 28), finite-volume errors are exponentially suppressed for the topological oscillator1 .For the doublewell potential the mass is set to m 0 = 1.0 whereas the moment of inertia for the topological oscillator is I 0 = 0.25.

Autocorrelations
To quantify the significant reduction of autocorrelations which is achieved by hierarchical sampling, we measure the integrated autocorrelation time τ int for the single level Metropolis-Hastings algorithm (Alg. 1) if either a simple HMC algorithm or the hierarchical delayed acceptance sampler in Alg. 3 is used.We refer to the first method as "StMC" from now on, whereas the latter is denoted as "HSMC".In the latter case the number of levels is chosen such that the coarsest level is fixed and always has d 0 = 16 points for the double-well potential and d 0 = 32 for the topological oscillator (corresponding to lattice spacings of a 0 = 0.25 and a 0 = 0.125 respectively).A HMC sampler is used to generate proposals on the coarsest level.In all cases (i.e.either on the fine level for the StMC method or on the coarsest level for HSMC) 100 HMC steps are carried out and the size of the HMC timestep is tuned such that the acceptance probability of the HMC sampler is close to 80%.We implemented a simple HMC method based on a symplectic leapfrog integrator.The integrated autocorrelation time defined in Eq. ( 11) is estimated by measuring the QoI for N = 10 5 samples and computing where t meas is defined as in Eq. (11).As described in [36] the size of the window W is chosen such that systematic and statistical errors on τ int are balanced.Fig. 4 shows the integrated autocorrelation time of the quantity of interest defined in Eq. ( 22) for the double well potential.As can be seen from this plot, τ int increases in proportion to a −z with z ≈ 2.35 for small lattice spacings, whereas it is completely flat for the hierarchical sampler.
For the topological oscillator the observable is the topological susceptibility defined in Eq. ( 27).Here two   different setups are considered for the hierarchical sampler: in the first setup the value of I ( ) 0 on the coarse levels is adjusted with the perturbative matching procedure described in Section 3.2.2.For comparison we also consider the case where I ( ) 0 = I 0 = 0.25 is kept fixed on all levels, we refer to this as the "not renormalised" setup in the plots.As Fig. 5 shows, for the topological susceptibility the integrated autocorrelation time increases very rapidly with approximately τ int ∝ a −z , z = 8.77 for small lattice spacings if a standard HMC sampler is used.In fact, the measured τ int is larger than 1000 for lattice spacings smaller than 0.03, and the single level method becomes practically unusable if a is reduced further.This is consistent with the results shown in [26,Fig. 1] and can be attributed to freezing of the integer-valued topological charge q: for small lattice spacings tunnelling between sectors with different values of q becomes increasingly unlikely.If the hierarchical sampler is used, this problem is dramatically reduced: τ int is around 10 and grows only weakly for small lattice spacings.Perturbative matching reduces τ int by a factor of approximately two.
The slow growth of the integrated autocorrelation for the hierarchical sampler is related to the acceptance probability of Alg. 3. Recall that a proposal on the finest level is only accepted (i.e.x (t+1) L−1 ) if all coarse level proposals have been accepted.In other words, the overall acceptance probability p acc = P(x L−1 ) is the probability of accepting the proposal generated with HMC on the coarsest level (this probability is tuned to around 80%), times the probabilities of accepting the proposals generated with the two-level step in Alg. 2 on all levels = L − 1, L − 2, . . ., 1.
Fig. 6 shows this overall acceptance probability p acc as the number of levels L increases for both the double-well potential and the topological oscillator.For the doublewell potential the overall acceptance rate does not drop below 75% which implies that the acceptance probabil-ity of an individual two-level Metropolis-Hastings step approaches 100% on the finer levels.For the topological oscillator a similar behaviour can be observed, although the curve flattens slower and the total acceptance rate approaches a smaller value for small lattice spacings.As expected, the acceptance probability is higher for the renormalised action.This is not surprising since in the two-level Metropolis-Hastings step the coarse level proposal is a better approximation of the even modes on the next-finer level.Although this explains the smaller absolute value of the autocorrelation time in Fig. 5, measurements of the runtime (see Tab. 3 below) show that using the renormalised action for the HSMC method has a smaller impact on the overall runtime since the average cost per sample grows as the acceptance probability increases.This can be seen immediately from Alg. 2: if the proposal is already rejected on one of the coarser levels, it is no longer necessary to carry out the more expensive two-level Metropolis Hastings steps on the finer levels.

Discretisation error and variance decay
To quantify the discretisation error ∆ disc (a) as a function of the lattice spacing, we derive an asymptotic bound on ∆ disc (a).For this assume that For the double-well potential the parameters ∆ 0 and α are obtained by calculating Q(a) with N = 4 • 10 8 samples (using the hierarchical method in Alg. 3) for a range of lattice spacings a = 1/64, 1/32, 1/16, 1/8.As shown in Fig. 7, the measured data is consistent with α = 1.The coefficient ∆ 0 is estimated by approximating E[Q(a = 0)] by Q(a fine ) with a fine = 1/512 and fitting a function of the form log ∆ 0 + log a to log( Q(a) − Q(a fine )) to obtain ∆ 0 = 0.17298.Based on this result we use the relationship disc = ∆ 0 a to relate the lattice spacing to the tolerance on the discretisation error in the following.For the topological oscillator the asymptotic form of the discretisation error can be deduced from Eq. ( 28), which implies that at leading order the error is again linear in the lattice spacing (α = 1).For our choice of numerical values we find that ∆ 0 = 0.21567.For the performance of the multilevel Monte Carlo method the behaviour of the variance V of the difference of the quantity of interest between subsequent levels is important.Recall in particular that the computational complexity of the MLMC algorithm given in Eq. ( 18) depends on value of β which bounds V /V −1 ≤ 2 −β .Fig. 8 shows V for the double well potential as well as the variance of the quantity of interest itself.As can be seen from this plot, β = 1, and hence (since α = 1, as discussed above) we expect the computational complexity of MLMC to be O( −2 stat | log disc | 2 + −1 disc ).This assumes that the sub-sampling rates and integrated autocorrelation times can be bounded, which appears plausible given the results shown in Figs. 4 and 5.The variance decay for the topological oscillator is shown in Fig. 9, both for the perturbatively renormalised action and the un- Figure 9: Variance of difference estimators Y and the quantity of interest Q for the topological oscillator.The lattice spacing on level is a = 2 L−1− a.The continuum limit as given in Eq. ( 30) is shown as a red dashed line.The data points are labelled with the of dimensions d for each lattice spacing.renormalised action.Renormalising the action reduces the absolute value of V .In both cases it is safe to assume that β ≥ 1, and hence again we expect the computational complexity to be no worse than ), provided the sub-sampling rates and integrated autocorrelation times can be bounded as a → 0.

Total runtime
Finally, we compare the total runtime for three different setups: StMC The standard single level Monte Carlo method in Alg. 1 with a HMC sampler.
HSMC The standard Monte Carlo method in Alg. 1 with the hierarchical delayed-acceptance sampler written down in Alg. 3.
MLMC The multilevel Monte Carlo method in Alg. 4 The configuration of StMC and HSMC is described in Section 4.1.For the multilevel method the coarsest level has d 0 = 16 points for the double-well potential and d 0 = 32 points for the topological oscillator.The subsampling rates t in Alg. 4 are set to 2 τ int, where τ int, is the estimated integrated autocorrelation time of the quantity of interest on level obtained with the hierarchical sampler.We confirmed that this choice of subsampling rate is sufficient to generate approximately independent samples and that any additional bias in the final MLMC estimator due to imperfect sub-sampling is comparable to the discretisation error.In all cases we generated and discarded a sufficiently large number of samples before computing estimators to ensure that the Markov chains are equilibrated on all levels.The runtimes reported here do not include the time spent in this      As can be seen from those figures, the runtime grows rapidly with −1−z disc for the StMC method, which is proportional to a high power a −1−z of the inverse lattice spacing since the discretisation error is first order in all cases.Here a factor a −1 arises since the cost of generating a path is O(a −1 ) and the remaining power a −z can be explained by the growth in τ int discussed in Section 4.1.As the results in Figs. 10 and 11 show, by taming autocorrelations the HSMC method reduces the growth of computational cost.In fact, for the lattice spacings considered here the cost grows slower than predicted by the theoretical O( −1 disc ) complexity bound.This is a pre-asymptotic effect and the reason for it is twofold: firstly, the (fixed) cost of the expensive coarse level HMC sampler still contributes significantly to the overall cost of the hierarchical sampler which is not yet dominated by the evaluation of the action on the finer levels.Secondly, as can be seen from the initial drop of the total acceptance probability in Fig. 6, the probability of accepting a proposed sample in the two-level Metropolis-Hastings step on a given level is smaller than 1 on the coarser levels, before is approaches 1 on the finer levels.As a consequence, in a significant proportion of cases generating a hierarchical sample does not require the evaluation of the fine-level action since the proposal is already rejected on a coarser level.
MLMC reduces the asymptotic rate of growth further, and for the double-well potential MLMC is significantly faster than HSMC for the smallest tolerance disc considered here.
Tab. 1 summarises the speedup of MLMC over StMC for both problems.The relative gain of MLMC over HSMC is shown in Tab. 2. Although the gap between the runtime of the two methods also reduces for the topological oscillator, for the tolerances While here we kept the tolerance stat = 0.01 fixed, in Appendix D we also show the runtime as a function of the tolerance on the total root mean square error, i.e. for disc = stat = / √ 2.

Breakdown of MLMC cost
For the multilevel method it is instructive to break down the total computational cost into the time spent on the individual levels of the lattice hierarchy.To estimate the fraction of the runtime spent on level we computed which is plotted in Fig. 12.As can be seen from this plot, relatively more time is spent on the coarser levels of the lattice hierarchy, which can be explained by the fact that the variance decays slightly faster than linearly.

Gains from coarse level matching
Finally, we quantify the gains from coarse-level matching for the topological oscillator.For this the HSMC and MLMC runs were repeated without coarse level matching, i.e. with I ( ) 0 = I 0 = 0.25 for all = 0, 1, . . ., L − 1. Tab. 3 shows that this results in a relatively modest reduction of the runtime for the HSMC sampler.As already discussed at the end of Section 4.1, this can be explained by the fact that renormalising the coarse level action leads to a reduction of the integrated autocorrelation time, but this effect is largely compensated by the increased cost per sample.As the corresponding speedups for MLMC in Tab. 4 show, the gain is significantly larger for the multilevel method, where coarse level matching more than halves the runtime.This is because for MLMC matching the actions has the additional effect of reducing the absolute value of the variance of the difference estimators, as can be seen from Fig. 9.

Multilevel accelerated cluster algorithm
For the topological oscillator the cluster algorithm [27] can be used to generate Monte Carlo updates with extremely small autocorrelations for arbitrarily small lattice spacing.We implemented a variant of Alg. 4 in which the new samples x (t+t0) 0 (line 4) and the coarse level samples z (t+t −1 ) −1 (line 7) are generated with the (single-update) cluster algorithm instead of the hierarchical sampler in Alg. 3. Again the sub-sampling rates are set to t = 2 τ int, and we find numerically that τ int, ≈ 3 for all levels considered here.The number of unknowns on the coarsest level was fixed to d 0 = 512, while increasing to fine-level problem size from d = 1024 to d = 16384.The performance of this MLMC algorithm is compared to the standard, single-level cluster algorithm, again updating a single cluster in each Metropolis Hastings step.As the numerical results in Fig. 13 show, MLMC is around 40% faster than the standalone cluster algorithm for the smallest lattice spacing considered here.The speedup of the MLMC accelerated cluster algorithm over the singlelevel cluster algorithm for all lattice spacings is shown in Tab. 5.More importantly, the numerical experiments show that the runtime of MLMC increases roughly as O(| log( disc )| 2 ) and thereby grows significantly slower than the runtime of the cluster algorithm, which shows the expected O( −1 disc ) growth.Again we also show the corresponding results for varying stat = disc = / √ 2 in Fig. 17 in Appendix D.
It should be stressed at this point, that while the cluster algorithm proved to be highly efficient for the topological oscillator, its applicability is highly problem dependent and can for example not be directly used for the double well potential problem considered in this work or

Conclusion
In this paper we have described a hierarchical sampling algorithm and applied it for simulations in quantum mechanics.We demonstrated that this can overcome the rapid growth of autocorrelations as the continuum limit is approached.In particular, we considered the anharmonic oscillator with a non-symmetric double well potential and the quantum mechanical topological oscillator model described in [26].Empirically we find that for both cases the integrated autocorrelation time does not show any signifcant increase towards the continuum limit when the lattice spacing approaches zero.This result is particularly significant for the susceptibility of a topological oscillator, which suffers from freezing of the topological charge if a single level method with a standard HMC sampler is used.
Combining this new hierarchical sampling technique with a multilevel Monte Carlo acceleration results in a dramatic reduction of the computational complexity and a significant reduction of the overall runtime.For the finest considered lattice spacings the additional speedup from MLMC (compared to hierarchical sampling for a non-symmetric double well potential or the cluster algorithm for a topological oscillator) is around 1.4× to 2.0×.We find that the accurate construction of coarse level theories with an approximate matching procedure is important to achieve optimal performance.
In this paper we have concentrated on reducing the time spent in the sampling phase of the Markov Chain Monte Carlo simulation and did not include burn-in times in the reported runtimes.However, also burn-in can be accelerated with hierarchical sampling since the reduction in autocorrelation time allows chains to equilibrate much faster.
While here we have demonstrated the methods for quantum mechanical systems, the same techniques can be used in lattice field theory simulations.In fact, as explained in the introduction, we expect the speedup to be more significant in this case since the relative cost of computations on coarser levels is further suppressed.A crucial step will be to construct suitable coarse grained theories which could be achieved analytically in perturbation theory or by adopting the framework of Symanzik's effective theory, where the improvement coefficients can be computed perturbatively.Since many physically interesting theories, and in particular lattice QCD, is asymptotically free, this is expected to work increasingly well as the continuum limit is approached.Of, course, finding non-perturbative methods to construct the course grained theory would be even better.
Recently, multilevel Monte Carlo has received significant attention in other areas, which led to further innovations.While here the method is described in the most natural setup, where coarse levels are constructed by increasing the lattice spacing, coarsening in other categories is also possible and potentially leads to further performance gainsby using the multiindex Monte Carlo [37] technique.For example the complexity of the theory could be reduced on coarser levels or the physical volume of the lattices could be increased with , thus aiming to approach the continuum limit a → 0 and large volume limit T → ∞ simultaneously.In lattice QCD one might increase the dynamical quark masses on the coarser levels, which simplifies the computation of the fermion determinant.
In summary, the success of the benchmark computations presented in this paper suggests that applying MLMC techniques to higher dimensional theories, e.g. to gauge theories, is indeed a promising approach which we plan to follow in the future.

A Extension to higher dimensions
To illustrate how the methods in this paper, and in particular the two-level Metropolis-Hastings step in Alg. 2, can be extended to higher dimensions, consider a discretised two-dimensional theory for which the degrees of freedom are located at the vertices of a uniform lattice with spacing a.If Ω is the state space and S : Ω → R is the lattice action, the probability density π * which is sampled with a Monte Carlo method is defined by Φ)  for all Φ ∈ Ω.
Starting from the finest lattice T , construct a hierarchy of L lattices T = T L−1 , T L−2 , . . ., T 0 by doubling the lattice spacing simultaneously in both dimensions; this is shown for two subsequent levels T , T −1 of the hierarchy in Fig. 14 (left).On level the lattice spacing is written as a = 2 −L+1 a and we assume that there is an action S : Ω → R with associated probability density π where Again, the coarse level theories are naturally obtained as (approximate) effective theories of the original fine-level theory on level L − 1 where S L−1 = S and π L−1 = π * .While the coarse level actions are constructed by starting from the original lattice, the hierarchical sampler in Alg. 2 constructs new fine level samples by generating a proposal on the coarsest lattice and successively adding fine-level modes.On each level this requires a mechanism for filling in the values of unknowns in the fine-level space Ω for a given state Φ ∈ Ω −1 in the coarse level space Ω −1 .We use two iterations of the construction described for the Ising model in [31] to achieve this.As illustrated in Fig. 1(a) there, the key idea is to use a rotated lattice with a lattice spacing that is reduced by a factor √ 2. The values at the additional sites that are generated in each rotation are drawn from a distribution which depends only on the values at the already existing sites 2 .
To explain this process in more detail, observe that on each level the state space Ω can be written as the direct sum of three spaces Ω (2) , Ω (1) and Ω −1 with Ω := Ω (2) ⊕ Ω (1) ⊕ Ω −1 , which should be compared to the decomposition in Eq. ( 12).To see this and to define Ω (1) , Ω (2) , separate the unknowns Φ ∈ Ω into three different classes, depending on which topological entity of a coarse grid cell on level − 1 they are associated with, namely 1. coarse-level unknowns associated with coarse level vertices, collected in a vector Φ ∈ Ω −1 and shown as empty black circles in Fig. 14, 2. fine-level unknowns associated with the interior of coarse level cells, collected in Φ(1) ∈ Ω (1) and shown as empty blue squares and 2 Although sampling is particularly simple if the action only contains nearest-neighbour interactions (as for the Ising model in [31]) so that the value at each new site can be drawn independently, this is not a necessary condition.associated with the empty blue squares are filled in using π(1) (upper right).Next, the distribution π( 2) is used to fill in the unknowns in Ω (2) associated with the solid red circles (lower right) to finally obtain the state on the fine lattice (lower left).
3. fine-level unknowns associated with edges of coarse level cells, collected in Φ(2) ∈ Ω (2) and shown as solid red circles in the figure. Given ∈ Ω (1) and Φ(2) ∈ Ω (2) write which should be compared to Eq. ( 13) in the main text.Assume that there is a conditional probability density π(1) (•|Φ ) on the state space Ω (1) , given the values of the coarse level unknowns Φ ∈ Ω −1 .Since the empty black circles and empty blue squares in the top right of Fig. 14 define a rotated lattice with spacing √ 2a , the density π(1) can be constructed by writing down an action S(1) : Ω → R on this rotated lattice, namely ) , Φ ]) .
(34) This action could for example be obtained by a renormalisation group transformation on S , followed by some approximations that guarantee that it is possible to effectively generate states in Ω (1) for a given Φ .Similarly define a conditional probability density π(2) (•|Φ) on Ω (2) , given the values of the unknowns Φ ∈ Ω .Here with the standard Metropolis sampler on the coarsest level is proportional to the number of unknowns d 0 , and does not increase as the number of levels increases (while keeping a 0 fixed).More specifically, we assume that this cost C coarse can be bounded by for some constant A 0 .Furthermore, given the coarse level sample y −1 , the cost of executing Alg. 2 is proportional to d and can be bounded by .In our code we measured C coarse , C 2−level and C during the setup phase of each run, and then used Eq. ( 36) to compute C eff required in Eq. ( 16).The integrated autocorrelation time were updated on-the-fly in the multilevel Monte Carlo algorithm, generating additional samples if this increased the N eff in Eq. ( 16).
To quantify the cost of the multilevel Monte Carlo algorithm in Alg. 4, further assume that the sub-sampling rates t are bounded by some t max ≥ t for all = 1, . . ., L − 2. By definition, this is also an upper bound on the integrated autocorrelation times on all levels, i.e. τ int, ≤ t max for = 0, 1, . . ., L − 1.Then, there exists a constant C0 such that C eff ≤ C0 d = 2 C0 d 0 , in other words the cost for generating an independent measurement Y (j) on level does not grow at a faster rate than the number of unknowns d = 2 d 0 = 2 −L+1 d on this particular level.More generally, to make the following derivation applicable for field theories in D > 1 dimensions (where D = 1 corresponds to quantum mechanics), we assume that there is a C 0 > 0 such that C eff ≤ 2 D C 0 for all = 0, 1, . . ., L − 1.
We now show how the cost of the MLMC algorithm depends on the tolerances disc and stat as disc , stat → 0.
Using the definition of N eff in Eq. ( 16) and the fact that max{A, B} ≤ A + B, the total cost of MLMC with a tolerance stat on the statistical error and a given number of levels L can be bounded by To obtain a bound on the number of levels L, we further assume that the discretisation is of order α, i.e. for a given lattice spacing a the discretisation error ∆ disc (a) can be bounded by ∆ disc (a) ≤ ∆0 a α for some constants α ≥ 1, ∆0 .If we set a = 2 −Lmax+1 a 0 with the discretisation error will be smaller than disc .Hence, it is not necessary to use more than L max levels, and L in Eq. ( 38) can be bounded by Using this bound in Eqs.(38) and (39) implies that the cost in Eq. ( 37) has the following computational complexity as a function of disc and stat : (40) For the quantum mechanical problems considered in this paper we have that D = 1, which leads to the computational complexity in Eq. ( 18); Eq. ( 4) in the introduction is a special case of this for α = β = 1.In fact, as explained in [17], α = β holds more generally for the Markov Chain variant of the multilevel Monte Carlo algorithm.Hence, for quantum field theories in higher dimensions with D > α = β the third case in Eq. ( 40) applies, which results in Eq. ( 5) in the introduction.Finally, setting stat = disc = / √ 2 gives Eq. ( 6).

C Rejection sampling
To draw samples from the distribution p σ,δx defined in Eq. ( 31) we use rejection sampling with a Gaussian envelope, as described in the following algorithm: Algorithm 6 Rejection sampling for distribution p σ,δx defined in Eq. ( 31) Draw sample x from Gaussian distribution g σ with g σ (x) = 2σ π 3 exp − 2σ π 2 x 2 .

D Fixed tolerance on the total error
While for the results presented in the main text we fixed stat = 0.01 and varied the tolerance on the discretisation error, in the following we also show the runtime as a function of the tolerance on the total root mean square error.For this we set stat = disc = / √ 2 as is common in the multilevel Monte Carlo literature.Figs. 15 and 16 show the runtime of the single-level HMC method, the hierarchical sampler and the multilevel method as a function of the tolerance on the total error; they should be compared to Figs. 10 and 11.Finally, Fig. 17 shows the runtime of the standard cluster-sampler and the multilevelaccelerated variant of the method; the corresponding plot in the main text is Fig. 13.
) For example, consider the prediction of the topological susceptibility (z = 5) in lattice QCD (D = 4) with improved action (α = 2).In this case, hierarchical sampling reduces the cost of a Monte Carlo simulation from O( 1, . . ., d − 1}.Paths x on this lattice are objects in the domain Ω = D d ⊂ R d .We introduce a hierarchy of L lattices T for = 0, 1, . . ., L − 1, such that lattice T has d = 2 −L+1 d points and a lattice spacing of a = T /d = 2 L−1− a, i.e.T = {t j = ja : j = 0, 1, . . ., d − 1}.Here T L−1 = T is the original lattice with d L−1 = d points and a spacing of a L−1 = a.Paths on lattice T are represented by vectors in the domain Ω = D d ⊂ R d , where obviously Ω L−1 = Ω.

Figure 5 :
Figure 5: Integrated autocorrelation time for topological oscillator.Results are shown both for a standard HMC and the hierarchical sampler.

Figure 6 :L− 1 )
Figure 6: Acceptance probability p acc = P(x (t+1)L−1 = x (t)L−1 ) of standard Monte Carlo with hierarchical sampling (HSMC).Results are shown both for the double well potential and the topological oscillator action.

Figure 7 :
Figure 7: Discretisation error ∆ disc (a) as a function of the lattice spacing a for the double well potential.The fit takes the form ∆ 0 a.Statistical errors are shown as vertical bars, and the data points are labelled with the number of dimensions d for each lattice spacing.

Figure 8 :
Figure 8: Variance of difference estimators Y and the quantity of interest Q for the double well potential.The lattice spacing on level is a = 2 L−1− a.The data points are labelled with the number of dimensions d for each lattice spacing.

Figure 10 :
Figure 10: Runtime of different Monte Carlo sampling algorithms for the double well potential with a fixed tolerance stat = 0.01 on the statistical error.Results are shown in seconds and as a function of the tolerance disc .The data points are labelled with the number of dimensions d for each lattice spacing.

Figure 11 :
Figure 11: Runtime of different Monte Carlo sampling algorithms for the topological oscillator with a fixed tolerance stat = 0.01 on the statistical error.Results are shown in seconds and as a function of the tolerance disc .The data points are labelled with the number of dimensions d for each lattice spacing. d

Figure 12 :
Figure 12: Estimated breakdown of cost per level for MLMC.Results are shown for the finest lattice spacing with d = 2048 for the double well potential and d = 1024 for the topological oscillator. d

Figure 13 :
Figure 13: Runtime of singlelevel (StMC) and multilevel (MLMC) cluster algorithm for the topological oscillator with a tolerance stat = 0.001 on the statistical error.Results are shown in seconds and as a function of the tolerance disc .The data points are labelled with the number of dimensions d for each lattice spacing. d

Figure 14 :
Figure 14: Fill-in of fine-level unknowns on a hierarchical two-dimensional lattice as required in lines 5 and 6 of Alg. 5. Starting from the coarse level unknowns on a rotated lattice (upper left), first the unknowns in Ω (1)

C 2 −
level ≤ B 0 d = 2 −L+1 B 0 d for some other constant B 0 .A straightforward calculation shows that the cost of obtaining a new sample x (t+1)with Alg. 3 can be bounded byC ≤ (A 0 + B 0 )d = 2 −L+1 (A 0 + B 0 )d,i.e.does not grow more than linearly with the number d of unknowns on level .Taking into account the subsampling rates t , the cost of obtaining an independent measurement of Y (j) on level in Alg. 4 is thereforeC eff = τ int, C 2−level + t −1 C −1 for = 1, . . ., L − 1 τ int,0 t 0 C coarse for = 0 (36) where τ int, is the integrated autocorrelation time on level Assuming thatV ≤ 2 −β V 0 , a straightforward calculation shows that σ(L) can be bounded as follows, depending on whether β is larger, equal or smaller than D: σ(L) ≤ κ 0

Figure 15 :
Figure 15: Runtime of different Monte Carlo sampling algorithms for the double well potential.Results are shown in seconds and as a function of the tolerance on the total error.The data points are labelled with the number of dimensions d for each lattice spacing.

Figure 16 :
Figure 16: Runtime of different Monte Carlo sampling algorithms for the topological oscillator.Results are shown in seconds and as a function of the tolerance on the total error.The data points are labelled with the number of dimensions d for each lattice spacing.

Figure 17 :
Figure 17: Runtime of singlelevel (StMC) and multilevel (MLMC) cluster algorithm for the topological oscillator.Results are shown in seconds and as a function of the tolerance on the total error.The data points are labelled with the number of dimensions d for each lattice spacing. 1.

Table 1 :
Comparison of runtime for standard, singlelevel Monte Carlo (StMC) and multilevel Monte Carlo (MLMC).All times were obtained with stat = 0.01 and are given in seconds burn-in phase of the simulation.Figs.10 and 11show the total runtime for a given tolerance stat = 0.01 on the statistical error and different lattice spacings a, corresponding to different values of disc : as discussed in Section 4.2, for both considered problems we bound the discretisation error by disc = ∆ 0 a with ∆ 0 = 0.17298 for the double-well potential and ∆ 0 = 0.21567 for the topological oscillator.The times reported for HSMC and MLMC in Figs.10 and 11were obtained with the renormalised coarse level action.

Table 2 :
Comparison of runtime for Monte Carlo with hierarchical sampling (HSMC) and multilevel Monte Carlo (MLMC).All times were obtained with stat = 0.01 and are given in seconds considered here HSMC is still faster than MLMC. a

Table 3
MLMC respectively.All times were obtained with stat = 0.01 and are given in seconds.