Infinite Variance in Monte Carlo Sampling of Lattice Field Theories

In Monte Carlo calculations of expectation values in lattice quantum field theories, the stochastic variance of the sampling procedure that is used defines the precision of the calculation for a fixed number of samples. If the variance of an estimator of a particular quantity is formally infinite, or in practice very large compared to the square of the mean, then that quantity can not be reliably estimated using the given sampling procedure. There are multiple scenarios in which this occurs, including in Lattice Quantum Chromodynamics, and a particularly simple example is given by the Gross-Neveu model where Monte Carlo calculations involve the introduction of auxiliary bosonic variables through a Hubbard-Stratonovich (HS) transformation. Here, it is shown that the variances of HS estimators for classes of operators involving fermion fields are divergent in this model and an even simpler zero-dimensional analogue. To correctly estimate these observables, two alternative sampling methods are proposed and numerically investigated.

In Monte Carlo calculations of expectation values in lattice quantum field theories, the stochastic variance of the sampling procedure that is used defines the precision of the calculation for a fixed number of samples. If the variance of an estimator of a particular quantity is formally infinite, or in practice very large compared to the square of the mean, then that quantity can not be reliably estimated using the given sampling procedure. There are multiple scenarios in which this occurs, including in Lattice Quantum Chromodynamics, and a particularly simple example is given by the Gross-Neveu model where Monte Carlo calculations involve the introduction of auxiliary bosonic variables through a Hubbard-Stratonovich (HS) transformation. Here, it is shown that the variances of HS estimators for classes of operators involving fermion fields are divergent in this model and an even simpler zero-dimensional analogue. To correctly estimate these observables, two alternative sampling methods are proposed and numerically investigated.

I. INTRODUCTION
Quantum field theories (QFTs) at strong coupling are interesting in many contexts in particle, nuclear, and condensed matter physics, but in many cases can only be quantitatively investigated using numerical approaches. One such approach involves discretising the theory on a spacetime lattice with a Euclidean metric. The functional integrals corresponding to measurable quantities can then be approximated using an importance sampling Monte Carlo method. In such a calculation, the probability of sampling a given configuration of the field degrees of freedom is determined by the Euclidean action and, depending on the parameters in the action, it is possible that field configurations enter with probability weights arbitrarily close to zero. If this is the case, certain random variables (observables corresponding to field operators) will have arbitrary large (infinite) variance. As will be discussed below, quantities with infinite variance in standard sampling algorithms occur in phenomenologically-relevant theories such as Quantum Chromodynamics (QCD) due to zero-modes of the lattice Dirac operator as well as in other contexts. A particularly clear example is provided by correlation functions constructed from large numbers of fermion fields as will be the focus of this work. 1 In applying Monte Carlo methods to QFTs, the Central Limit Theorem (CLT) is used to construct confidence intervals for the expectation value (mean) of the random variable from the corresponding variance over the samples. However, a random variable with infinite variance does not satisfy the conditions for the CLT and the sample variance of such a random variable is not meaningful because it does not converge to a particular value with * cyunus@mit.edu † wdetmold@mit.edu 1 Observables with infinite variances in fermionic theories have been analysed using a different approach in Ref. [1].
increasing sample size. Moreover, the CLT is valid only in the limit that the sample size approaches infinity and hence similar deficiencies will appear for random variables with finite but very large variances compared to squares of their means. Despite these issues, there are physically interesting quantities in QCD and other field theories that formally have finite mean but infinite variance under standard sampling methods. To address these cases, alternative sampling schemes are required for reliable Monte Carlo estimates. In this work, two methods will be introduced to address specific occurrences of infinite variance. The first method is applicable in the context of fermionic lattice field theories that are typically approached using the (continuous) Hubbard-Stratonovich (HS) transformation such as theories whose actions involve powers of fermion bilinear operators. A class of discrete HS transformations is introduced which generate discrete auxiliary bosonic variables. The variance of an estimator constructed from these discrete bosonic variables will then be manifestly finite although it may be still very large compared to the square of its mean. This discrete sampling scheme is investigated in a toy model and in the 2D Gross-Neveu (GN) model. While the approach is seen to be useful in some contexts, it becomes impractical in the limit of large spacetime volumes in its current implementation. The second method that is considered is a sequential reweighting procedure that is suitable for analysis of non-negative stochastic variables. With this method, the mean of a such a non-negative bosonic variable with infinite variance can be written as a product of the means of the several non-negative random variables each having finite variance. This approach is also investigated in the toy model and in the 2D GN model but can be applied in more complicated theories.
The structure of this work is as follows. In Sec. II, the way in which random variables with infinite variances arise in lattice calculations of field theories such as QCD is outlined as a motivation for subsequent studies of related phenomena in simple models. In Sec. III, the main statistical concepts that are used in our analysis are in-troduced and interpreted. In Sec. IV, simple models are introduced that cleanly exhibit the features that lead to observables with infinite variance. In Sec. V, a novel discrete Hubbard-Stratonovich transform is presented that provides estimators with manifestly finite variance. This method is tested for the toy models introduced in Sec. IV. In Sec. VI, a new reweighting method that can be applied to non-negative stochastic variables is also introduced and this method is then tested for the toy model introduced in Sec. IV. Finally, Sec. VII summarises the results of this work and provides an outlook for future directions of investigation. A number of important statistical results that support our main analysis are proven in Appendix A while Appendices B and C present further technical details.

II. INFINITE VARIANCE IN EUCLIDEAN FIELD THEORY
One can construct illustrative examples of infinite variance in phenomenologically-relevant theories such as lattice QCD. In this case, the partition function is given by: where U represents the gauge field and Ψ andΨ represent the fermions. Here S[U ] is the bosonic part of the action of lattice QCD, D[U ] is the N D ×N D Dirac matrix, the determinant of which arises from integration of the fermion degrees of freedom, and σ D[U ] is the spectrum of D[U ] which accounts for multiplicities of the eigenvalues. It is assumed that the Dirac matrix D[U ] is diagonalizable for each U and can be expressed as where Λ U is a diagonal matrix consisting of eigenvalues , and Q U is not necessarily unitary.
With this definition, the columns v where a and b label the eigenvalues and i and j index the components of the corresponding eigenvectors. It must be noted that one can permute and (independently) scale the columns of Q U freely. Furthermore, Q U can not generically be chosen continuously in U and consequently the quantities λ a U , v (a) U and w (a) U depend implicitly on the choice of Q U . In terms of these quantities, the components of the inverse of the Dirac operator for field U can be expressed as: For certain values of the couplings that define the theory, there may be an "exceptional configuration", that is a bosonic field configuration U * such that, for simplicity, strictly one of the eigenvalues, λ * ∈ σ D[U * ] , vanishes. In what follows, the corresponding left and right eigenvectors of U * will be denoted by (w * ) T and v * respectively. If such exceptional configurations exist, it can be seen that the standard estimators of physical quantities, such as fermion propagators, diverge. To illustrate this, consider a fermion field bilinear denoted as and choose a particular combination of these bilinears weighted by the left and right eigenvectors at the exceptional configuration After the fermions are integrated out, for each sample size N S , a standard estimator for the expectation value of V 1 ij in a Monte Carlo calculation iŝ where U t for t ∈ {1, · · · , N S } are assumed to be independently and identically generated samples. The corresponding estimator for O iŝ The mean ofÔ N S is given as: (9) As one of the eigenvalues, λ * , for the field configuration U * vanishes, the integration measure in Eq. (9) is such that U * will have vanishing probability of being sampled and consequently the singularity due to D −1 [U * ] ij will not cause the expectation value to diverge.
Nevertheless, configurations in a neighbourhood 2 of U , which will be sampled with a very small frequency governed by det D[U ], will make large individual contributions to the sample mean but the expectation value will remain finite as det −1 by construction and it was assumed that λ * = 0 is the only vanishing eigenvalue of D[U * ], the variance ofÔ 1 is divergent as (λ * ) −2 det[U * ] is divergent. It must be stressed that, in an actual Monte Carlo calculation, exceptional configurations will not be sampled so the sample variance will remain finite for any finite sample size, but will not be bounded from above as the sample size increases.
The above example of a single fermion propagator illustrates the way in which infinite variance manifests but is not of physical relevance. However, correlation functions involving hadrons and nuclei in a theory such as QCD involve many propagators that arise from products of k fermion bilinears. In this context, it is useful to consider the more general product where {i} ≡ {i 1 , · · · , i k } and {j} = {j 1 , · · · , j k } label the fermions that enter in an ordered manner. A family of estimators for V k {i},{j} analogous to Eq. (7) for each N S isV unless {i} and {j} contain repeated indices in which casê V k {i},{j} = 0 due to the anti-commutativity of fermions. Here, S k is the symmetric permutation group of order k, and s π is the sign of permutation π. Again choosing N S = 1, one observes: If N 0 > 1 eigenvalues of D[U ] vanish, then it suffices to focus on a product of N 0 fermion bilinears: where N D is the size of the Dirac matrix and (w s ) i and (v s ) j are the left and right eigenvectors of D[U ] respectively, with vanishing eigenvalues λ s = 0, for s ∈ {1, · · · , N 0 }. For the estimator where the first sum is over permutations π in the symmetric group S N0 . Using the same arguments as forÔ 1 , it can be shown thatR N0 has infinite variance.
The above arguments illustrate how infinite variances of estimators of physically-relevant quantities can arise in Monte Carlo calculations of theories including lattice QCD. We note that, the situation is exacerbated in quenched QCD, where the fermion determinant is taken to be unity, or in partially-quenched or mixed action QCD, where the Dirac operators entering the measure and the observables are different. In these cases, fermionic observables can have infinite expectation values. Since the fermion action is different in the measure and in defining observables, similar concerns will arise in partially-quenched or mixed-action QCD. Without knowing that an observable in such a theory is free of the problem illustrated above 3 , standard sampling methods result in estimates of observables whose statistical behaviours are not governed by the CLT at any sample size and are unreliable.

III. STATISTICAL SAMPLING
In this section, important results for stochastic variables that will be needed in the following analysis are introduced. A review of the relevant aspects of probability theory and proofs of the results presented here are provided in Appendix A.

A. A Natural Indicator of Infinite Variance
For a sequence of independent and identically distributed (i.i.d.) random variables {X n }, a sequence of random variables {s n } can be defined such that s n = Each s n is an unbiased estimator of the variance ofX n when it is finite. The n → ∞ behaviour of s n provides empirical evidence as to whether the system has a finite variance or not. In particular: • Let {X n } be a sequence of i.i.d. random variables with finite variance σ 2 . Then as n → ∞, s n a.s.
• Let {X n } to be a sequence of i.i.d. random variables with finite mean µ and infinite variance. Then, for any given δ > 0, the number of random variables s n that satisfies s n > δ is infinite a.s..
The former statement follows from the Strong Law of Large Numbers, while the latter statement is proven in Appendix A as Theorem 1.

B. Empirical Bias of the Sample Average for Finite Systems with Exceptional Configurations
In systems that contains exceptional configurations, the convergence of the sample mean to the mean is slow and it is not straightforward to estimate uncertainties as the sample variance does not converge. These issues resurface as empirical biases in systems with finite configuration spaces with configurations that are sufficiently infrequently sampled. To explore this, let Ω be a finite sample space with |Ω| elements. To this space, we associate the σ-algebra F = 2 Ω that is the set of subsets of Ω, and a family of probability distributions P t : F → [0, 1] for t ∈ (0, 1]. Here, t corresponds to a parameter describing the system from which the samples are drawn such as a coupling constant or a mass. For a finite system, the knowledge of P t ({ω}) for all ω ∈ Ω completely determines P t : F → [0, 1] through the requirement P t (A ∈ F) = ω∈A P t ({ω}). Therefore, it is enough to consider P t ({ω}) and for brevity we define P t (ω) ≡ P t ({ω}). In the following, it is assumed that P t is continuous in the sense that P t (ω) is a continuous function of t for t ∈ (0, 1] for all ω ∈ Ω and that X t is a nonnegative random variable which is continuous in t in the same sense. Furthermore, the set of exceptional configurations is defined as E ⊂ Ω such that lim t→0 P t (ω) = 0 and lim t→0 P t (ω)X t (ω) = 0 for all ω ∈ E. An element ω ∈ E is referred to as an exceptional configuration and it should be noted that this definition depends on the choice of X implicitly.
The mean of X t , µ X t , can be written as a sum of contributions from the exceptional configurations and contributions from the non-exceptional configurations. where and E c = Ω \ E. For a Monte Carlo estimate of the mean µ X t with a fixed sample size N S , the contribution from the exceptional configurations will be missing for t sufficiently close to 0, resulting in a "gap" denoted by ∆ X ≡ lim t→0 µ e X t . That is, denoting the actual mean of the observable by µ X ≡ lim t→0 µ X t , the sample mean will underestimate this value by ∆ X for ensembles that are large but not sufficiently large that the CLT applies, as will be discussed below.
Consider the product space Ω N S corresponding to the set of all ensembles of size N S , that is every element ω [N S ] ∈ Ω N S will correspond to a sequence of elements from Ω: A new random variable which should be interpreted as the ensemble average for each ensemble can be defined byX t . Now let p e min (t) = min ω∈E P t (ω) and p / e min (t) = min ω∈E c P t (ω). As t → 0, p e min (t) → 0 while p / e min (t) → 0. Therefore, for small enough t one will have [p / e For practical purposes, these results can be summarized by saying that for small t, with very high probability,X t N S first approaches to µ X −∆ X and then eventually converges to µ X as N S is further increased. The above statements are made precise and proven as Theorem 2 in Appendix A. It should be noted that if E includes more than one element,X N S may demonstrate a series of plateaus before eventually converging to µ X . Figure 1 schematically demonstrates the expected behaviour.

C. Non-Asymptotic Estimators
While the CLT is of utmost importance in statistical analysis, it is only valid asymptotically and for random variables with finite variance (see Appendix A). Therefore, the CLT is not applicable when dealing with random variables with infinite variance and the standard methods of estimation can not be utilized. Similar issues are also expected for a random variable with finite variance that has infinite variance in a certain limit, as such a variable is expected to be extremely non-Gaussian and require impractically large sample sizes for the CLT to apply.
To address these situations, non-asymptotic estimators are important, and in this work the Median of Means (MoM) estimator will be used. The MoM is an estimator for which one is able to define confidence intervals which are also valid for random variables with infinite variance. After including the possibility of autocorrelations between samples, the MoM estimator can be defined as follows. Let {μ 1 , · · · ,μ K } be the means of the random variable X on each of K independent batches of B samples of X obtained from the same stationary (thermalised) discrete time process. Then the median of means estimatorμ MoM = median({μ 1 , · · · ,μ K }). Confidence intervals can be defined using (18) where µ X is the expectation value of X, σ X be the standard deviation of X, and τ int,X (B) is the integrated autocorrelation time of the discrete time process. 4 Further details and a proof of the above relation are provided in Appendix B.

IV. SIMPLE EXAMPLES WITH INFINITE VARIANCE
In this section, two simple models are introduced and exemplar correlation functions are investigated to illustrate the problem of infinite variance in Monte Carlo sampling. Numerical explorations of these models are presented in Secs. V and VI below.

A. Toy Model
The first model considered is a zero dimensional (Euclidean) theory of 2N f interacting fermions represented where Ψ i andΨ i are independent Grassmannian variables. The Lagrangian of this toy model is defined as where it is assumed that g is positive. As shown in Ref. [2], positivity of g is required for the unitarity of realistic theories with four fermion interaction. The partition function of this theory coupled to (Grassmannian) sourcesη and η is given by: To calculate quantities derived from this partition function, one needs to remove the quartic term so that the Grassmannian integrations can be performed exactly.
The standard way to do this is to introduce an auxiliary field through a (continuous) Hubbard-Stratonovich transformation [3,4]. It is straightforward to see that up to a multiplicative constant, the partition function is equivalent to where φ is a real-valued scalar field. The fermions can now be integrated exactly, leading to Here, the Boltzmann weight is common to the partition functions and all correlation functions derived from it and therefore acts as the probability weight in importance-sampling Monte Carlo calculations. Now suppose that one is interested in calculating the expectation value of the observable which is determined by Using the auxiliary field, this is given by which is clearly finite. Difficulties arise if this quantity is naively estimated through a Monte Carlo calculation. The standard estimator for this expectation iŝ where N S is the sample size and is the representation of the observable in terms of the auxiliary field. This quantity has a singularity at φ * = − m √ g . While one will never sample this point because P (φ * ) = 0, with sufficiently many samples one will sample nearby points and they will cause large fluctuations in the estimation of the observable. In fact, the variance of this estimator is divergent, as the second moment (and all higher moments) of the bosonic operatorÕ diverges: (29)

B. Gross-Neveu Model
To further explore the ideas introduced above, it is useful to consider the N f -flavour Gross-Neveu (GN) model [2] which resembles QCD in a number of ways. In particular, it is asymptotically free and exhibits chiral symmetry breaking. 5 Here, the Gross-Neveu model is defined in two dimensions on a discretised lattice geometry with Wilson fermions [5]. Consider a rectangular lattice, described by the points {(s, t)|1 ≤ s ≤ L, 1 ≤ t ≤ T } where s, t, L and T are positive integers and lattice units are assumed throughout. Periodic (anti-periodic) boundary conditions are implemented in space (time). In this work, two-dimensional Dirac matrices are defined as Denoting the masses by m i and the coupling constant by 5 The version of the model introduced here has a discrete chiral symmetry but it is simple to modify the action to obtain a theory with a continuous chiral symmetry [2]. g, the partition function of the GN model is given by where, 1 ≤ i ≤ N f and In the current work, N f = 2 flavours of fermions are used everywhere with m 1 = m 2 = m. By utilizing a Hubbard-Stratonovich transformation, the exponential in Eq. (31) can be made bilinear in the fermion fields as in Sec. IV A.
Indeed, the toy model in Sec. IV A is an approximation to the Gross-Neveu model in which the kinetic terms in the action are ignored. 6 The set of exceptional configurations in the GN model is more complicated than in the toy model discussed in the previous subsection. In particular, the exceptional configurations will correspond to a union of surfaces of codimension 1 (and higher). For L × T = 2 × 2, the set of the exceptional configurations can be found algebraically by solving the characteristic equation of the Dirac operator for a given set of parameters and is composed of two and three dimensional surfaces in the four-dimensional field-space. For larger lattice geometries, determination of these surfaces can in principle be performed numerically.

V. DISCRETE HUBBARD-STRATONOVICH TRANSFORMATION
The failure of sampling for some quantities with the standard HS transformation is tied to the continuous values taken by the auxiliary field, necessitating the existence of exceptional configurations in the models of the previous section. To avoid this, a family of discrete HS sampling schemes is introduced in this section and their utility in ameliorating the infinite variance problem is investigated numerically.
As introduced above, the continuous Hubbard-Stratonovich transformation is given by This expression is valid for all commuting variables χ. However, if χ is constructed out of fermion bilinears as in the models in Sec. IV, Eq. (33) need only be satisfied up to terms O(χ 2N f ) (where N f is the number of fermions for the theories that have spinor dimension 2) since higher powers of χ vanish identically.
To find additional solutions, solutions of are required, where the index a takes values in a finite index set A that is to be determined. The weights, w a , are required to be non-negative to have a probabilistic representation and the ξ a are required to be real to avoid a sign problem. χ is assumed to satisfy χ 2N f +1 = 0. After a change of variables χ → iχ, solving the above equation is equivalent to solving where χ is considered as a real variable. That is, the above equation may be interpreted as the equality of the two real power series in χ up to the 2N f th order in χ. The series on the left and right sides of the above equation can be viewed as the characteristic functions 7 of two probability densities in a conjugate variable ξ, where these densities are and respectively. Eq. (34) can thus be rephrased as finding a polynomial f (ξ) of degree at most 2N f that satisfies Written in this form, the {ξ a } and {w a } can be found through the method of Gaussian quadrature. Denoting the Hermite polynomials by 7 The characteristic function of a random variable X is defined as the N f + 1 roots of He N f +1 (ξ) give the ξ a and the w a are constructed as as shown in Appendix C.
Having defined the sets {ξ a } and {w a }, a Monte Carlo calculation can be performed for a Euclidean field theory as follows. Assume that the theory has a partition function: where {α, β} correspond to all fermion indices except the spacetime location x = (s, t) and C αβ is a complex ma- will vanish. Then, the partition function can be expressed as: where D[a] ≡ x ax∈A and A indexes the set of roots of He N >k . Note that N can be chosen to be any integer greater than k.
After integrating over the fermion fields, one obtains: where D [U, a] α,x;α ,x is given as Consequently, one can perform a Monte Carlo calculation using e −S[U ] det (D [U, a]) x w ax as the probability weight.
The family of discrete HS transformations introduced here generalises the transformation first proposed by Hirsch [6] and used extensively in the context of Quantum Monte Carlo simulations [7,8]. The form used in that work is equivalent to the N f = 1 case of the transformation introduced above.

A. Discrete Sampling vs. Continuous Sampling for the Toy Model
In this section, the toy model discussed in Sec. IV is used to compare estimators based on discrete HS transformations to each other and to the standard estimator based on the continuous HS transformation. The opera- (24) combines fermion bilinears for each type of fermion in the model and provides a concrete example on which to focus. N f = 2 will be used in numerical studies.  The behaviour of the different estimators is determined by the model parameters m and g in Eq. (19). The behaviour of the continuous estimator has been discussed above. For the discrete HS-based estimators, the choice of m, g and the order N of the Hermite polynomial He N control the magnitude and probability of the least probable configuration. The roots and the corresponding weights for the first few Hermite polynomials are given in Table I. In what follows, the numerical data are analysed in N p = 10 3 steps by adding 10 5 samples at each step. Precisely, at the step k, the samples that are included are the set {1, · · · , k · 10 5 }. For each step the data is analysed disregarding the samples not included and all metrics, including the autocorrelation times, are calculating independently for each step.
In order to compare methods, the behaviours of the mean and the standard deviation of the continuous and discrete HS estimators are considered as a function of the sample size. Figure 2 shows this comparison for He N for N ∈ {3, . . . , 9} at m = 1.73 and g = 1.0. These couplings are chosen such that the exceptional point φ * = −m/ √ g is very close to one of the configurations in the He 3 estimator (φ = − √ 3). As can be seen from the behaviour of the mean, most of the discrete HS estimators rapidly converge to the exactly calculable value that is used to normalize the Monte-Carlo results. However, the continuous HS estimator shows significant jumps as the number of samples increases that occur whenever a sample suffi-   ciently close to φ * = −1.73 is chosen, as expected from the general arguments in Sec. III. Note that the binning of results in steps of N p = 10 3 has a smoothing effect on the mean; unbinned results show more frequent and larger jumps. The He 3 discrete sampling rapidly converges, but is biased even for 10 8 samples. The He 8 estimator also samples configurations close to φ * (but not as close as for He 3 ) and correspondingly individual samples of these points significantly modify the mean, leading to the discontinuous jumps shown in the figure.
The logarithm of the standard deviation shows the expected 1/ √ N S behaviour for most of the discrete HS estimators, however the continuous HS estimator, and to some extent the He 8 estimator, exhibits non-asymptotic scaling arising from samples close to φ * . As the number of samples increases, the continuous HS estimator will sample configurations arbitrarily close to φ * and the non-asymptotic behaviour will persist indefinitely: the mean is not guaranteed to converge to the true value for any finite sample set and the variance will not monotonically decrease. This behaviour is anticipated by Theorem 1 in Appendix A which shows that the large jumps observed in the variance will never cease.
The behaviour seen for the He 3 and He 8 estimators is in line with expectations given the configurations that are sampled and their respective probabilities. He 8 has a root t −1.63652 that is close to the exceptional configuration φ * = −1.73, and consequently the He 8 results show many jumps. This root of He 8 is sampled with a probability p 3 × 10 −7 and is thus sampled about 30 times for a sample size of N S = 10 8 . Supporting this expectation, it is observed that the first jump emerges around N S ∼ 1 p with the subsequent jumps are less marked. For He 3 discrete sampling, the variance is apparently behaving asymptotically, falling as 1/N S , despite the empirical bias observed in the mean. For this sampling, the root t a = − √ 3 is sampled with probability p 10 −13 . Since this root has not been chosen in the N S = 10 8 samples used in Fig. 2, the mean is significantly underestimated. For N s 10 13 , the sample mean will begin to converge to the true value and the variance will exhibit jumps (as seen for He 8 ). For N S 10 13 , t a = − √ 3 will be sampled representatively and the mean will converge to the correct value and the variance will decrease asymptotically. While for this case the empirical bias would be observed with a very high probability if the same numerical experiments were repeated, it is not strictly a bias. With a very low probability the mean will be overestimated enormously making the estimator unbiased.
As this particular example shows, in the case of random variables with very large variance, asymptotic scaling of the variance is no guarantee of correctness. If the model parameters m and g are chosen such that the exceptional configuration is one of the roots of a given discrete HS sampling, the corresponding configuration will never be sampled, just as in the case of continuous HS sampling. Under these circumstances, the variance will decrease as 1/N S but the mean will be biased.
In Fig. 3, the mean and standard deviation of the same observable are studied for the He 3 discrete HS sampling from g = 1.0 and for a range of values of m ∈ [1.03, 2.43]. As can be seen, for masses such that the exceptional value φ * = −m/ √ g is not close to one of the roots t a ∈ {− √ 3, 0, √ 3}, the calculations converge quickly to the correct value as the number of samples is increased and display the expected asymptotic 1/N S scaling of the variance. However as φ * moves closer to the root at t a = − √ 3 from either above (m = 1.63) or below (m = 1.83), the convergence to the true value is much slower and large jumps are seen in the variance each time this root is sampled. For m = 1.73, the results apparently converge rapidly with 1/N S scaling, but to an incorrect result at this number of samples (as in the previous figure).
To obtain a concrete value, we choose = 3×10 −7 corresponding to 5 standard deviations for the standard normal distribution. Then, N (δ, ) can be chosen as where r = δ µ−∆ is ratio of the deviation δ to the biased mean µ−∆ and Φ(x) is the cumulative distribution function of the standard normal distribution. For r = 0.01, N (δ, ) = 6 × 10 5 satisfies the required conditions.
To further investigate how the He 3 discrete sampling behaves as the exceptional point of the theory moves towards one of the roots, the convergence of the sample av- In this simple toy model, the expected deviation of the sample mean arises from the contribution of just one root that is the least probable and is straightforward to determine. Figure 4 presents the results and shows that as the exceptional point approaches a root, the number of samples needed to remove the empirical bias increases, scaling approximately as the inverse probability of the least probable root.
where the second product is over the fermion spin components. This quantity is evaluated at a single site, chosen to be (s, t) = (1, 1), and is constructed from all spin and flavour components of the fermion field at that site. 8  While the slope converges to −0.5 for the discrete sampling scheme, the standard deviation of the continuous scheme exhibits large jumps over the entire N S = 10 8 samples. Fig. 6 displays results for the same quantities calculated using a larger lattice of extent L = T = 8. It is clear that over the same range of sample sizes, even the discrete sampling scheme does not conclusively show the variance decreasing as 1/N S .
The lack of convergence seen for the larger lattice can be understood by considering the spectrum of the logarithm of Q(1, 1). Fig. 7 shows this spectrum on each configuration as a function of the logarithm of the probabilities of the configurations for L × L lattices with L ∈ {2, 3, 4} (since the number of configurations grows exponentially with L, results for L > 4 are not shown). As can be seen in each case, there are a significant num- ber of rare but important configurations. As L increases, the number of these configurations increases rapidly.
The operator Q(1, 1) is explicitly constructed such that for the continuous HS sampling scheme P (φ * ) = 0 and Q(φ * ) = ∞ for at least one exceptional configuration φ * while φ P (φ)Q(φ) < ∞ (here,Q is the HS representation of Q(1, 1) after fermions are integrated out). Since it follows that for |φ| < ∞, P (φ * ) = 0 occurs only when the determinant vanishes. For a valid discrete sampling scheme, φ * will not be in the domain of the discrete variable ξ. However for ξ close to φ * , will be small since w(ξ) > 0 and the determinant has the same functional dependence on either the continuous or discrete HS field. Similarly,Q(ξ) will be large for ξ near φ * as both the continuous and discrete HS transforms result in the same functional form forQ after the fermion fields are integrated out. As a consequence of this behaviour, configurations of smaller and smaller probabilities contribute larger and larger amounts to Q(1, 1). From Fig. 7, it is clear that this issue is exacerbated for larger lattices, Since the set of exceptional configurations grows with volume, the number of nearby configurations in discrete HS sampling with small probability and large contribution to Q(1, 1) grows rapidly. While the discrete sampling scheme Q(1, 1) has finite variance, in practice one needs to have a sample size on the order of the inverse of the smallest probability to obtain a reliable estimate of Q(1, 1) . The smallest probability for a lattice with volume V and number of degrees of freedom per site c has an upper bound ∼ O(c −V ) although the smallest probabilities will typically be much smaller. Consequently for observables that have formally infinite variance, one needs to have a sample size that is greater than O(c V ) to properly estimate the mean. As a comparison, Fig. 8 shows the logarithm of the absolute value of an observable with finite variance, namely for L = 4, N f = 2, m = −1.5, √ g = 2.0 using the He 3 sampling scheme. In terms of the auxiliary variable, ξ, after integrating out the fermions, this operator will take a form ξ → O a.v. (ξ), so that O = ξ P (ξ)O a.v. (ξ). The notation a.v. indicates the "absolute value of the condensate" which refers to the random variable ξ → O a.v. (ξ) . Note that this definition depends on the particular auxiliary variable chosen. In contrast to Q (1, 1), O only involves one fermion bilinear in each term in the sum and is thus less singular around exceptional configurations; although for small probabilities log O ∼ − log(prob), this growth is not as severe as in the case of Q (1, 1). This is made clear in Fig. 9 which compares the behaviour of O and Q(1, 1) directly. While the behaviour of observables with infinite variance in regions of low probability is controlled by the exceptional configurations, the more general structure of the log-count plots above is specific to the particular observable. For each observable, the probabilities of the configurations that correspond to a given observable range (as in Figures 7c and 8 respectively) are summed over and provides the probability of finding an observable in the given range.

C. Summary
The discrete sampling schemes that have been proposed have manifestly finite variance provided the roots in the scheme do not contain an exceptional configuration. In the case where the roots do contain an exceptional configuration, the variance is still finite; however, the sample mean will be biased due to the missing contribution of the exceptional configuration to the mean. These discrete sampling schemes are effective for calculating observables for small lattice volumes and provide interesting testing grounds for investigation of the fundamental issues of infinite variance. However for quantities with infinite, or very large, variance under a continuous HS sampling, the discrete sampling schemes do not practically overcome the issues of large variance for large volumes.

VI. REWEIGHTING
In this section, a method for sampling non-negative observables with infinite variance is proposed that constructs the target observable through a series of discrete reweighting steps or through a continuous reweighting procedure. In each case, samplings are performed using probability measures that incorporate part of the observable.

A. Discrete reweighting
Consider an unnormalised probability distribution P (x) and an observable T (x) that is non-negative everywhere. The expectation value with the standard estimator for this quantity is given bŷ where the x i are sampled with respect to the probability weight P (x). As in the previous sections, the variance of the standard estimator is not well-defined if the second moment of T under the unnormalised probability weight P (x) is infinite. To surmount this problem, a set of unnormalised probability weights P s (x) are introduced with P 0 (x) = P (x). Since T (x) is nonnegative, this forms a probability distribution for real s. We denote the expectation value of an observable with respect to P s (x) by · s . It is straightforward to see that: where N is a positive integer. Based on the breakup in Eq. (52), an alternative estimator of T can be defined as follows. Consider a set of N configurations such that x r is sampled with respect to P r N and denote the set by x ≡ (x 0 , · · · , x N −1 ). Let for k ∈ {1, · · · , N s } be i.i.d. sets of configurations. In terms of these sets, a valid estimator is given bỹ where the total number of samples is N S = N × N s . It is easy to check that this estimator is unbiased. Except in pathological cases, the random variables T (x r ) 1 N , where x r are sampled with respect to P r N (x), will have finite variance for r ∈ {0, · · · , N − 1} for large enough N (each quantity is less singular near an exceptional configuration than the original observable). If this is the case, then the estimatorT will also have finite variance. Fig. 10 presents results for Q(1, 1) defined in Eq (45) for the Gross-Neveu model on a L = 2 lattice using this discrete reweighting sampling scheme, Eq. (53). For comparison with Section V B, Fig. 10 also presents the corresponding results for the L = 8 lattice. Note that the exact result for the latter case is not shown as calculating it with the He n discrete sampling scheme requires the generation of n 64 configurations.

B. Continuous Reweighting
There is a natural extension of this sequential reweighting method to a continuous version of the procedure. To arrive at this version, note that for any s ∈ [0, 1] and positive observable P as long as Z s ≡ P s 0 is finite. Z s is naturally interpreted as the partition function for the probability weight P s (x) = P (x)P(x) s . Since the left-hand side of this equation is s-independent, one obtains This differential equation is straightforward to solve, noting that Z s −1 Z s = log P s and log(P 0 ) s = 0. Therefore, under the assumption that Z s and log P s are both finite for s ∈ [0, 1], one finds that Utilizing Gauss-Legendre quadrature and Eq. (56), an estimator for log P can be defined. Let N a positive integer. Then an integral of the form  2 are determined by the roots, z i , of the N th Legendre polynomial and the corresponding weights, w i , associated with Gauss-Legendre quadrature. 9 Defining the set of 9 If f (x) is a polynomial of degree at most 2N − 1, then configurations x = (x 1 , · · · , x N ) where x i is sampled with respect to P si (x), let x (k) for k ∈ {1, · · · , N s } be i.i.d. sets of configurations. Then the following is an estimator for log P : where N S = N × N s is the total number of samples divided evenly between each of the N Gauss-Legendre quadrature points.
As a very simple demonstration of this method, consider a one-dimensional example of the standard normal distribution P (x) = 1 √ 2π e −x 2 /2 and observables P p (x) = e px 2 for real p < 1 2 . The expectation values of these observables are given by P p = 1 To calculate log P p using Eq. (57) and Gauss-Legendre quadrature, a proposal distribution for each s i is required. A simple and straightforward choice is to use P 0 (x) = 1 √ 2π e − 1 2 x 2 for all s i . The data labelled as "basic" in Fig. 11 is generated in this context. For 1/p 3 this provides an accurate estimate that agrees with the exact value within uncertainties. However for a fixed sample size per Gauss-Legendre node, the estimates deviate from the exact value as p approaches 1 2 . A possible cause of this is that for p = 1 2 − , the probability weight is proportional to e − x 2 for s = 1 and important contributions will be due to |x| 1 √ . Therefore, more and more samples will be needed as → 0 and the "basic" method suffers from an overlap problem. To improve the algorithm in this simple example, one can also directly sample for P si which are normal distributions. Results generated in this latter context are labelled as "improved" in Fig. 11, and are seen to agree perfectly with the exact results for all p. Systematic errors due to the finite number of nodes are negligible compared to the statistical errors in both cases.
To test this method on a more realistic system, estimates of log Q(1, 1) for the Gross-Neveu model on the L = 2 and L = 8 lattices with m = −1.5, √ g = 2.0 and N f = 2 are presented in Fig. 12. For L = 2, the exact value is reproduced within uncertainties while for L = 8, results from continuous reweighting agree with those from discrete reweighting.

VII. CONCLUSIONS
Large statistical variance in Monte Carlo sampling severely limits the precision with which many important quantities in quantum field theories can be determined. In this work, quantities that have formally infinite variance under standard sampling schemes have been considered. In the context of fermionic theories, a family of discrete sampling schemes has been presented that surmounts the issue of infinite variance. Nevertheless, the variances in these schemes can be very large (compared to their means squared) and hence sampling maybe inefficient. An alternate sampling scheme has also been developed which can be applied to any non-negative random variable that can be sampled with the Monte Carlo method. While the method has been proposed in order to estimate observables with infinite variances, it is likely to be effective for non-negative random variables that have finite but large noise to signal ratios. There are potentially interesting connections of the investigations presented here to the large time-separation behaviour of two-point correlation functions for quantities that possess global charges, such as for baryons and nuclei in QCD, that will be explored in subsequent work. The standard method of estimating an observable (a random variable) is based on the Central Limit Theorem. We therefore begin by reviewing the background for the Central Limit Theorem. We refer to Ref. [10] for further details.
A probability space is a triplet (Ω, F, P ) where: • Ω is the sample space.
• F is the space of events and is required to be a σ-algebra.
Every element ω in the sample space Ω is called an outcome. An event A ∈ F is said to occur if ω is the outcome and ω ∈ A. F is required to be a σ-algebra which means that The probability measure P is required to satisfy: • P (∅) = 0.
• If A i≥1 ∈ F is a countable sequence of pairwise disjoint elements of F it follows that P (∪ i≥1 A i ) = i≥1 P (A i ). It must be noted that in general one can't choose F = 2 Ω , the set of all subsets of Ω. Therefore, the choice of F is essential and elements of F are said to be measurable.
A random variable X : Ω → R is a real valued function on the sample space such that for all a ∈ R, X −1 (A B R ) ∈ F where B R is the Borel σ-algebra on R, the smallest 10 σ-algebra containing all open subsets of R. An equivalent condition is X −1 ((−∞, a]) ∈ F for all a ∈ R. This condition allows one to define another probability distribution P X on the real line through the formula By the celebrated Carathéodory's extension theorem P X can be extended to the B R . If A ∈ B R , P X (A) should be interpreted as the probability that X takes value in A. We further define the cumulative distribution function F X (t) = P X ((−∞, t]) which gives the probability that X ≤ t.
For a stochastic physical system represented by the probability space (Ω, F, P ), one can consider the space of repeated outcomes denoted by Ω ∞ = ∞ i=1 Ω, associated with the σ-algebra F ∞ , the smallest σ-algebra containing ∞ i=1 A i where only finitely many of A i ∈ F are different than Ω. An element ω ∞ ∈ Ω ∞ is given by ω ∞ = {ω 1 , ω 2 , · · · } where ω i ∈ Ω for all i ∈ N + . Then, Ω ∞ can be identified with the set of of all samples of the physical system with infinite sample size. By Kolmogorov's Extension Theorem, there exists a unique probability mea- only finitely many of A i ∈ F are different that Ω. Given a random variable X on (Ω, F, P ), we define the random variable X n on (Ω ∞ , F ∞ , P ∞ ) by X n (ω ∞ ) = X(ω n ).
For our purposes, there are three important types of convergence for random variables. One says that a sequence of random variables X n converges to a random variable X • almost surely/everywhere 11 if P (X n → X) = 1. It is denoted by X n a.s.
10 Smallest σ-algebra containing a given set of sets is defined as the intersection of all σ-algebras that contains the given set of sets which can be shown to be a σ-algebra. 11 To be precise, there is a set A ∈ F such that for all ω ∈ A, limn→∞ Xn(ω) = X(ω) and P (A) = 1. It is possible that the set of all elements ω ∈ Ω satisfying Xn(ω) → X(ω) is not measurable, but this distinction is not relevant for our discussion. • in probability if for every > 0, P (|X n − X| > ) → 0. It is denoted by X n p −→ X.
• in distribution if F Xn converges pointwise to F X at every continuity point of F X . It is denoted by These three different types of convergence imply each other in the sense that a.s.

An important consequence of the SLLN is the Weak Law of Large Numbers
Weak Law of Large Numbers Let {X n } be a sequence of identically and independently distributed random variables with the finite mean E[X n ] = µ. If one defines the sample meanX n = 1 n n i=1 X i for n ≥ 1, it follows that for any > 0: Although the SLLN says any sequence of sample means will eventually converge to the mean, in practice it does not say anything about how close a sample mean is to the mean for a given sample size N . A similar statement also applies to the WLLN. On the other hand, the CLT gives a measure of how close the sample mean is to the mean.
Central Limit Theorem Let {X n } be a sequence of identically and independently distributed random variables with the finite mean E[X 1 ] = µ and finite vari- where N (µ, σ 2 ) denotes the normal distribution with mean µ and variance σ 2 .
It follows that for large enough n, √ n X n − µ ≈ N (0, σ 2 ). Therefore, it follows thatX n ≈ N (µ, σ 2 n ). Although one can derive a similar expression for the estimation of σ 2 in the case E[X 4 1 ] < ∞, it is usually enough to estimate σ 2 by the (unbiased) estimator s n = 1 Let {X n } to be an i.

Theorems under infinite variance
Two theorems are particularly important for analysis of random variables with infinite variance. Theorem 1. Let X n≥1 to be a sequence of independent and identically distributed random variables with finite mean µ and infinite variance. Then, for any given L > 0, the number of the random variables s n that satisfies s n > L is infinite almost surely.
We need a bit preparation before we can prove Theorem 1.
Lemma 1 (Second Borel-Cantelli). Let {E n } to a sequence of independent events. If n P (E n ) = ∞ then 12 P (lim sup E n ) = 1.
Lemma 2. Let Z be a non-negative random variable with infinite mean. Then, for any given L > 0, ∞ n=1 P ({Z ≥ nL}) = ∞. 12 Here, lim sup En ≡ ∩ n≥1 ∪ m≥n Em and is equal to the set of all outcomes ω such that ω ∈ E k for infinitely many E k . Therefore, P (lim sup En) can be interpreted as the probability that infinitely many events E k happens. Proof.
It follows that ∞ n=1 nP ({nL ≤ Z < (n + 1)L}) = ∞. Then: (A5) Corollary 1. Let {Z n } to be a sequence of independent and identically distributed non-negative random variables with infinite mean. Then, for any given L > 0, the number of the random variables Z n that satisfies Z n ≥ nL is infinite almost surely.
Proof. We define the events E n = {ω : Z n (ω) ≥ nL}. The corollary then follows from Lemma 2 and the second Borel-Cantelli lemma.
Proof of Theorem 1. We first define the random variable s n = 1 2 for n ≥ 2. Next, we are going to show that, almost surely, infinitely many elements of the sequence {s n } satisfy s n ≥ L. Let T n = (X n − µ) 2 . Then, Corollary 1 applies and there are infinitely many T n that satisfies T n ≥ nL. Let n k ≥ 2 for k ≥ 1 be an increasing sequence that satisfies T n k ≥ n k L. Then for each k ≥ 1, we have s n k ≥ 1 n k −1 T n k > L. Let Ω s ,L be the set of outcomes such that {s n > L} is satisfied for infinitely many n. We showed that P (Ω s ,L ) = 1.
Let Ω be a finite sample space associated with the σalgebra F = 2 Ω ,the set of all subsets of Ω, and a family of probability distributions P t : F → [0, 1] for t ∈ (0, 1]. We assume that P t is continuous in the sense that P t (ω) is a continuous function of t for t ∈ (0, 1] for all ω ∈ Ω. We consider a non-negative random variable X t which is continuous in t in the same sense. We further assume that there is a set E ⊂ Ω such that lim t→0 P t (ω) = 0 and lim t→0 P t (ω)X t (ω) = 0 for all ω ∈ E. Theorem 2. Let δ, > 0. There is an integer N (δ, ) such that for all N ≥ N (δ, ): Proof of Theorem 2. We first define another probability measure on Ω that we will denote by P 0 . P 0 is defined by P 0 (ω) = lim t→0 P t (ω) for all ω ∈ Ω. We also define X 0 similarly: X 0 (ω) = lim t→0 X t (ω). Effectively, this definition ignores exceptional configurations. It follows that, expectation value of X 0 is µ − ∆: Now given δ, > 0, by the WLLN there is an integer N (δ, ) such that for all N ≥ N (δ, ): Now we consider (E c ) N , the set of ensembles of sample size N that does not include any exceptional configurations where E c Ω \ E. For P 0 , exceptional configurations can be ignored and therefore Ω N ≡ (E c ) N effectively so it follows that: Now we make the following observation. Let (E c ) N c ⊂ Ω N be the subset of Ω N S that includes at least one element from E, the set of the exceptional configurations. The probability of (E c ) N c occurs is a polynomial in the variables {P t (ω) : ω ∈ E} with the constant term is vanishing. Since lim t→0 P t (ω) = 0 for all ω ∈ E, we have: Now we complete the proof of Theorem 2 by combining Eqs. (A9) and (A10):

Appendix B: Median of Means
In this section we will prove Eq. (18) for the median of means by modifying the arguments given in Ref. [11] to include correlations between samples. Consider a random variable X with mean µ X . Given an > 0, we aim to find a lower bound for the probability that |μ MoM − µ X | < , whereμ MoM is defined in Sec. III C. If there are K batches of size B, for this to happen less than K 2 of the batch meansμ i must be outside the range (µ X − , µ X + ). Let us define the indicator random variables I i for i = 1, · · · , K. I i defined to be 1 ifμ i ∈ (µ − , µ + ) and 0 otherwise. Consequently: Since the batches are independent, we can use Hoeffding's inequality ( [10]) where the first indicator function I 1 is chosen for convenience. Now we defineμ 1 to be the standard deviation of X and use Chebyshev's inequality ( [10]) to obtain: By choosing = 2σ 1 , we obtain: To estimate σ 1 , we note thatμ 1 = 1 B B n=1 X n . Then one obtains: V ar(X n ) + 2 m<n Cov(X m , X n ) Cov(X m , X n ) where we have defined the autocorrelation function Γ X (t) ≡ 1 σ 2 Cov(X n , X n+t ) and the integrated autocorrelation time τ X,int (B) = 1 2 + B−1 t=1 (1 − t B )Γ X (t) ( the sequence {X n } is assumed to be stationary, Γ X (t) is independent of n.). Eq. (18) then follows by combining above inequality with (B4).