Variational Microcanonical Estimator

We propose a variational quantum algorithm for estimating microcanonical expectation values in models obeying the eigenstate thermalization hypothesis. Using a relaxed criterion for convergence of the variational optimization loop, the algorithm generates weakly entangled superpositions of eigenstates at a given target energy density. An ensemble of these variational states is then used to estimate microcanonical averages of local operators, with an error whose dominant contribution decreases initially as a power law in the size of the ensemble and is ultimately limited by a small bias. We apply the algorithm to the one-dimensional mixed-field Ising model, where it converges for ansatz circuits of depth roughly linear in system size. The most accurate thermal estimates are produced for intermediate energy densities. In our error analysis, we find connections with recent works investigating the underpinnings of the eigenstate thermalization hypothesis. In particular, the failure of energy-basis matrix elements of local operators to behave as \textit{independent} random variables is a potential source of error that the algorithm can overcome by averaging over an ensemble of variational states.


I. INTRODUCTION
Calculating the ground state and thermal equilibrium properties of large and complex quantum systems remains a central task in contemporary quantum physics. While for integrable systems analytical techniques can often solve the problem, in generic nonintegrable systems such methods do not apply. In the last two decades however, efficient numerical methods have been developed to calculate ground-state and thermal properties in settings where the target state is only modestly entangled. Tensor network (TN) methods exploit the locality of physical Hamiltonians, in particular their area-law entangled ground states [1], to find efficient representations of the wavefunction via truncated matrix product states on classical hardware [2]. Additionally, these efficient representations can be extended to Gibbs states at finite temperature via matrix product operators [3]. Examples of algorithms based on TNs include the minimally entangled typical thermal state (METTS) algorithm [4] for estimating canonical averages, and an algorithm for estimating microcanonical averages using time evolving block decimation (TEBD) [5]. In higher than one spatial dimension however, the TN contraction step becomes hard [6], so that classical algorithms may not be sufficient for the simulation of even weakly entangled quantum systems.
It has long been believed that quantum computers are the natural platform to simulate quantum systems [7], but to exploit their full power it is likely that deep quantum circuits and error correction will be required. Currently, we have noisy intermediate scale quantum (NISQ) devices that cannot yet implement deep circuits with high * iadecola@iastate.edu FIG. 1. The VME algorithm. In step (0), the QPU is initialized in a random product state |ψ 0 r ⟩ (r = 1, . . . , R). The VQA repeats steps (1) and (2) that optimize the cost function C(θ) in Eq. (1.1) to "squeeze" the state onto a microcanonical window of size δ as shown in step (3). Steps (0-3) are repeated to produce a pseudo-random ensemble of states |ψr⟩ which for large N and R can be used to approximate microcanonical averages of local operators A as in step (4), where ρR = 1 R r |ψr⟩ ⟨ψr|.
fidelity, but which can still demonstrate the potential for quantum computing in cases where low-depth circuits are sufficient [8]. There is thus a significant need to develop algorithms that can take advantage of these NISQ devices.
Originating with the variational quantum eigensolver (VQE) [9], one class of algorithms that can potentially achieve this goal in some cases are the hybrid quantumclassical variational quantum algorithms (VQAs) [10][11][12], which employ a digital quantum computer aided by a classical optimizer. Although generic VQAs suffer from the well known barren plateau problem [13][14][15] which suggests unscalablility in full generality, there is evidence that VQAs can calculate the ground state of certain Hamiltonians using only polynomial quantum resources, e.g. by using the Hamiltonian variational ansatz for the transverse field Ising model [16]. Recent works have also considered using VQAs to prepare Gibbs states using cost functions such as the relative entropy or relative free energy between the current state and target state [17]; strategies to overcome the costly evaluation of the entropic term have also been proposed [18,19]. Other finite-temperature VQAs prepare thermofield-double (TFD) states, which require doubling the number of qubits in the physical system being simulated-for example the algorithm of Ref. [20] can prepare the TFD state of free fermions efficiently at any inverse temperature. Alternative quantum algorithms for preparing thermal states include a quantum version of the minimally-entangled typical thermal states algorithm (QMETTS) that involves imaginary time evolution on quantum hardware [21], and an algorithm based on random quantum circuits [22].
In this work, we task a VQA with calculating microcanonical averages of local observables in a onedimensional (1D) nonintegrable spin model. Our work is partially inspired by analog quantum simulation [23] and classical tensor network [24] algorithms for estimating microcanonical properties. The algorithm takes advantage of the eigenstate thermalization hypothesis (ETH), in particular the "diagonal" ETH which states that in a nonintegrable model the energy-basis diagonal matrix elements ⟨E|A|E⟩ of an observable A approach a smooth function A(E) in the thermodynamic limit [25,26].
The algorithm, which we call the variational microcanonical estimator (VME), works as follows (see Fig. 1). We initialize the QPU in a random product state [step (0)] |ψ 0 r ⟩, whose energy variance is typically extensive in N (the number of sites) [24]. Given a target energy λ and microcanonical window size δ, a classical optimizer is then tasked with minimizing the cost function [steps (1) and (2)] originally proposed in [9]. However, instead of trying to reach a local or global minimum, we stop the optimization as soon as Var(H) = ⟨(H − ⟨H⟩) 2 ⟩ ≤ δ 2 . This produces states whose energy support is roughly limited to the microcanonical window of interest [step (3)], and the resulting variational states |ψ r ⟩ are then used to compute the expectation of a local observable A by averaging ⟨ψ r |A|ψ r ⟩ over R variational states [step (4)]. The ensemble average in step (4) enables a parametric reduction in the error and is essential to the algorithm's performance.
We benchmark the VME algorithm on a nonintegrable Ising chain by comparing its estimates for local observables to corresponding Gaussian microcanonical en-semble predictions obtained from exact diagonalization (ED). Using numerical evidence in combination with the phenomenology of ETH, we conjecture that for local operators A and target energies λ in the bulk of the spectrum, the absolute error in the VME algorithm scales as Here, D(λ) is the density of states at the target energy λ, δ is the microcanonical window width, N is the system size, and c ≪ 1 is a small empirical constant whose magnitude depends on A and other problem parameters. The last two terms in this formula are predicted by ETH and the first two terms we give a phenomenological argument for that we substantiate with numerical evidence.
We then generalize the problem to the reduced state of small subsystems of the chain and find numerically that when choosing R = O(N 2 ) and for certain λ, the VME appears to approach the corresponding microcanonical state in the thermodynamic limit. The states prepared by the VME are consistent with area law entanglement for a fixed N , and require roughly linearly deep quantum circuits to prepare. We find that every random initial product state is able to converge, which we attribute to the fact that the algorithm does not seek global minima of the cost function. An additional distinction from other current VQAs for preparing mixed states is that we prepare pure states one at a time, thus avoiding storage of a large ensemble of pure quantum states in a quantum memory. The smallness of the trace distance when choosing R = O(N 2 ) implies that the microcanonical ensemble, which involves at least one (via ETH) highly entangled (i.e. volume law) eigenstate is approximately indistinguishable by local operators from a polynomially large ensemble of weakly entangled variational states.
The paper is organized as follows. In Sec. II, we introduce (i) the statement of ETH and (ii) a class of states which might be called microcanonical superposition states, which our converged variational states fall under. We then review related works attempting to use these states to estimate thermal averages and the relationship of this problem to ETH. In Sec. III we discuss how averaging over an ensemble of these microcanonical superposition states could significantly improve how well they can estimate microcanonical averages, and then we detail the VME algorithm which can produce these states. Finally in Sec. IV we present the numerical results for the form of the variational ensemble, the error in the algorithm for various local operators, the observable independent trace distance, and finally the quantum resources like circuit depth and entanglement.
FIG. 2. The energy-basis diagonal matrix elements ⟨E|A|E⟩ of various local observables A acting in the middle of the chain, plotted against energy density in the nonintegrable 1D mixed-field Ising model, Eq. (4.1) with parameters J = 1, hx = −1.05, and hz = −0.5. Lighter blue colored points are for system size N = 9 and darker blue points are for N = 13. Orange curves are coarse grained versions of the N = 13 scatter plots which define the "smooth" function A(E) in the thermodynamic limit.

A. Eigenstate Thermalization Hypothesis
Here we review relevant aspects of the ETH and some recent works which attempt to exploit it to estimate thermal averages. We assume a nonintegrable (i.e. chaotic) Hamiltonian H which has a non-degenerate energy spectrum so that its eigenstates |E⟩ are uniquely labeled by their energies E. Furthermore, we will assume that all operators and states of interest are real in the energy basis for simplicity. The variant of ETH we consider was formulated in Ref. [27] and proposes that in a quantum chaotic system, the energy-basis matrix elements of observables have the form is the density of states at energyĒ, A(Ē) and f (Ē, ω) approach smooth functions in the thermodynamic limit, and R EE ′ are order-one fluctuations. Examples of such functions A(Ē) are shown in Fig. 2 which demonstrates this for local spin operators in the 1D mixed-field Ising model (defined in Sec. IV).
The ansatz (2.1) captures several features of such matrix elements that have been observed in numerical studies. Firstly, because the density of states is exponentially large in system size, the off-diagonal matrix elements are exponentially small. Secondly, the smooth function A(E) is related to the statistical mechanical prediction for ⟨A⟩ at average energy E; this function will play a central role in our algorithm. Finally, the function f (Ē, ω) controls the approach to thermal equilibrium and is related to other spectral properties of the observable [28]; this function figures less prominently in our analysis.
To see how A(E) is related to a thermal average, consider for example a broadened microcanonical ensemble ρ λ,δ centered on energy E = λ and of width O(δ) which we will define more precisely at beginning of Sec. IV. Under certain assumptions about the density of states of the model and away from λ = 0 (which corresponds to infinite temperature), and assuming the ETH ansatz (2.1), we have in the thermodynamic limit that (see Appendix C for details) where ⟨A⟩ mc = tr ρ λ,δ A. The ETH thus suggests that, if one could prepare even a single eigenstate |λ⟩ of the Hamiltonian with energy λ, then one could accurately estimate thermal averages in sufficiently large systems. However for a nonintegrable Hamiltonian, a generic excited eigenstate is volume-law entangled, and thus cannot efficiently be prepared by classical algorithms nor by VQE-type algorithms [29]. Thus, this feature of ETH does not appear practically useful, expect perhaps in the case of an error corrected quantum computer.

B. Microcanonical Superpositions
An alternative approach to using exact eigenstates for computing thermal averages is using pure states of the form where either c E are exactly zero outside the energy window defined by |E − λ| ≤ δ, or the states satisfy the weaker condition that ⟨ψ|(H − λ) 2 |ψ⟩ = O(δ 2 ). We refer to states of this type as "microcanonical superposition states" and they have been studied in the context of thermal pure quantum (TPQ) states [30], the foundations of quantum statistical mechanics [31,32], algorithms for analog quantum simulators [23], and tensor network algorithms [24]. The practical reason for considering these states is that they appear to be significantly less entangled than exact eigenstates. In fact, there exist MPS-based numerical constructions of them such that the maximum entanglement entropy across any cut scales as k/δ + log 2 √ N for some constant k [24] and N being the system size. Thus, by choosing δ = O(1/log 2 N ), such states can have only O(log 2 N ) entanglement, whereas a single excited eigenstate of a nonintegrable system is expected to have O(N ) entanglement. In this work, by choosing δ = O(N −1/2 ) (for the values of N studied in this paper N −1/2 ≈ 1/log 2 N ) we find a VQA can generate these states using roughly linear circuit depth and which have area-law entanglement for fixed N . In Ref. [24] and in our findings it is clear that generically a smaller δ requires more computational effort.
It is known that if the coefficients c E are generic, and δ is sub-extensive in N , then when a state of the form (2.3) is evolved under H, it approaches a state in which small subsystems are approximately thermal [27,33]. Given this fact, one may wonder if a relation like (2.2) holds with A(λ) replaced by ⟨ψ|A|ψ⟩, just as it did for |λ⟩.
A key issue however is that although the off-diagonal elements of a generic operator are exponentially small, the quantity involves summing exponentially many off-diagonal matrix elements, so long as δ itself is not exponentially small [23]. The reason that the long-time evolved state is locally thermal is that the off-diagonal terms become dephased just enough to counteract the exponentially large sum [33]. As a result, the off-diagonal contribution scales as O D −1/2 (λ) , like the off-diagonal matrix elements themselves. Without the additional pseudorandom phases e i(E−E ′ )t appearing in the time evolved expectation value, for arbitrary δ there is no a priori reason to expect ⟨ψ|A|ψ⟩ to closely approximate ⟨A⟩ mc .
On the other hand, in discussions of ETH the offdiagonal fluctuations R EE ′ are usually stated to behave as random variables [28]. If we take them to be actual independent random variables, then the total offdiagonal contribution scales as a random walk and will therefore remain typically of the order O(D −1/2 (λ)) as shown in Appendix B. However, the validity of such an independence assumption on R EE ′ is known to depend on the energy scale δ. Sometimes this scale is quoted as δ = O(N −2 ) which is the scale of |ω| below which |f (0, ω)| reaches a plateau, so that the ETH ansatz (at infinite temperature) becomes structureless and reduces to the random-matrix prediction [28]. More recently, however, Refs. [34] and [35] found numerical and analytical evidence that "true" random matrix behavior with effectively independent matrix elements emerges only on the parametrically smaller scale, δ = O(N −3 ).
Regardless of whether the matrix elements can be treated as independent random variables, it is in general an open question how δ must scale in order for ⟨ψ|A|ψ⟩ to converge to the thermal value in the thermodynamic limit. Ref. [24] argued that δ = O(1/ log 2 N ) is sufficient for a slow convergence but Ref. [23] argued that O(N −1 ) is needed. Finally, in Ref. [36] it is proposed that in a quantum chaotic system, for a fixed operator A of interest, every state of the form (2.3) is thermal with a worst case error x obeying the relation δ(x) = poly(x). However, for δ much larger than the random matrix the-ory scale O(N −2 ) defined above [28], the behavior (and in particular the N scaling) of this polynomial was not completely settled in that work.
In summary, some source of randomness is needed to make the off-diagonal contribution small. In the theory of canonical typicality [31,32], it is the state coefficients; under time evolution it is the effectively random phases; if the window δ is small enough, it is the matrix elements themselves that are effectively random. In this work the source of randomness arises from averaging over an ensemble of variational states |ψ r ⟩ that are prepared by starting from random product states |ψ 0 r ⟩. Since we will ultimately use shallow quantum circuits to prepare these variational states, we do not expect them to be typical states on the target microcanonical subspace, nor do we expect them to be typical states in the sense of TPQ states, as in both cases it is likely that deep circuits would be needed to approximate Haar random states [37,38]. On the other hand, we will see that the states are "random enough" for a certain dephasing mechanism to significantly reduce the off-diagonal contribution in the ensemble averaged version of Eq. (2.4). Thus we will henceforth refer to the states as pseudorandom, reserving "random" for Haar-random states.

III. VARIATIONAL MICROCANONICAL ESTIMATOR
A. Ensemble of microcanonical superpositions and error analysis As discussed in Sec. II, we do not necessarily expect a microcanonical superposition state of the form in Eq. (2.3) to closely approximate thermal values when the microcanonical window size δ is too large. Here we adapt known results from the theory of ETH to our problem, and then propose a heuristic mechanism which allows us to capture thermal expectation values using a large pseudo-random ensemble of microcanonical superposition states with "large" (but still subextensive) δ.
For a fixed target energy λ and microcanonical window δ, consider an ensemble of states {|ψ r ⟩} R r=1 , each of the form (2.3), and let ρ R = 1 R r |ψ r ⟩ ⟨ψ r | be their equal weight mixture. We discuss how to prepare these states using a variational algorithm in Sec. III B. In this section, we discuss the error between the actual microcanonical expectation value tr(ρ mc A) and its average in the ensemble ρ R . Considering a local operator A, this error can be expressed as Here, we have introduced a microcanonical density matrix ρ mc . The microcanonical ensemble could be defined via the standard sharp microcanonical window, or via a smoothed Gaussian version thereof; we will ultimately compare our variational estimates to the latter in Sec. IV. The results of this section do not depend on the exact form, but for now let us take ρ mc = P W /n with P W a projector onto the microcanonical window W defined by |E − λ| ≤ δ and n = tr P W the number of states in the window. For our purposes, the error (3.1) is best understood as a sum of two distinct types and we therefore decompose it via the triangle inequality as The "diagonal error" ϵ diag R captures the difference between the expectation value of A in the microcanonical ensemble and the diagonal ensemble [33] associated with ρ R . It depends only on diagonal energy-basis matrix elements of A and ρ R . The "off-diagonal error" ϵ off R captures error due to the fact that ρ R is not diagonal in the energy basis. Plugging Eq. (3.2d) into Eq. (3.2c) yields which involves only off-diagonal matrix elements of A and ρ R . We now consider the dependence of the error ϵ R on the ensemble size R, window size δ, and system size N , assuming the ETH matrix element ansatz (2.1).

R
We begin with the diagonal contribution ϵ diag R . Plugging in the ETH ansatz to the expression for the diagonal error gives [39] since both density matrices have unit trace. The density of states is evaluated at λ since this is a typical energy in W . Via the triangle inequality, where we have made the first term more compact by writing A(H) = E A(E) |E⟩ ⟨E|. Now we expand the smooth ETH function A(E) near the target energy λ. Repeated uses of the triangle inequality yields where the ellipsis signifies higher order derivatives. Both density matrices have unit trace, so the first term vanishes. But then using the fact that the k th derivative of A(E) with respect to E is proportional to N −k , which follows from A(E) = a(E/N ) with a(x) becoming Nindependent in the thermodynamic limit, we see that where and a ′ (x) = da/dx. If both ρ R and ρ mc have support only on the microcanonical window W , then If instead they have support on a larger energy interval which is still O(δ), then χ R is still O(δ) and we can still make the rough estimate In a very large system, we would not concern ourselves with the difference between Eq. (3.9) and the more accurate Eq. (3.7). However, at the relatively small systems we consider in this work, we can expect that for large R, χ R and therefore ϵ diag R will be smaller than 2δ|a ′ (λ/N )| if we compare the variational estimate ⟨A⟩ diag R to the expectation value of A in a microcanonical ensemble that is similar to the diagonal variational ensemble. In Sec. IV we numerically confirm this for a Gaussian microcanonical ensemble. It is interesting to note that any subextensive choice of δ will lead to vanishing diagonal error in the thermodynamic limit; this is simply a manifestation of statistical-mechanical ensemble equivalence from the perspective of ETH.

R
We now turn to the off-diagonal error ϵ off R . First we observe that if the states |ψ r ⟩ have exactly zero energy weight outside the microcanonical window W , the offdiagonal error (3.2c) can be expressed as ϵ off If the variational states have some non-zero weight outside W , we can expect that the error is still approximately expressible this way, by slightly expanding W . In Sec. IV, we suitably modify the expression ϵ off R = |trρ RÃ | in this case. Since we are interested in understanding how averaging over an R-state ensemble can reduce this error, we write the average explicitly as Following Refs. [35,36], let x r = ⟨ψ r |Ã|ψ r ⟩. For each r ∈ {1, 2, .., R} we have that where the limits are the minimum and maximum eigenvalues of the (purely off-diagonal) operatorÃ. We find numerically in Sec. E Fig. 13 that the maximum/minimum eigenvalues of various operators are always above/below zero, respectively, which is expected sinceÃ is traceless by construction. If the ensemble of microcanonical superposition states is "random enough" we can expect that for each r, x r fluctuates between these limits according to some distribution. Should the algorithm perform ideally, the states |ψ r ⟩ would sampleÃ in an unbiased way; i.e. in the limit of infinite samples, ϵ off R → tr(Ã)/n = 0. This would happen for example if |ψ r ⟩ were drawn from the Haar measure on the microcanonical subspace. However, let us allow for biased sampling by modelling x r as identical and independently distributed (IID) random variables with covariance E(x r x s ) − E(x r )E(x s ) = σ 2 δ rs and mean E(x r ) = c which would ideally be zero. With x r modelled this way, a simple statistical measure of the size of the off-diagonal error would be its meansquare value which has the functional form For large R, a good approximation to the actual variance σ 2 should be given by the finite-size estimate which is bounded from above by the largest square singular value ofÃ which is in turn shown in Appendix E to scale down weakly with N . Under such assumptions, the off-diagonal error in the VME algorithm should thus behave typically as where we have assumed that σ ≈ σ R = O(1). In Sec. IV A we demonstrate that x r are indeed well modelled by this phenomenological description with some small non-zero |c| always present. This implies that the variational states indeed are not sampling the microcanonical Hilbert space perfectly uniformly, which is consistent with them also not being close to Haar-random states.

Summary
In summary, the basic idea behind our algorithm is as follows. To prepare an approximation to ρ mc , instead of preparing an ensemble of one or more eigenstates |E⟩, which each individually require exponential quantum resources to generate, we will prepare a polynomially large number R of states |ψ r ⟩ which each hopefully require only polynomial quantum resources to generate, such that ρ R ≈ ρ mc as measured by local observables.
The error in the approximation can be understood in terms of two pieces. The first is the diagonal error, which is ultimately about statistical-mechanical ensemble equivalence as it manifests for an isolated quantum system via the diagonal part of the ETH ansatz (2.1). The leading contribution to this type of error thus scales as O(δ/N ), and thus any sub-extensive window width δ will in principle work. The more prohibitive error is the off-diagonal error, which for a single variational state the ETH alone cannot guarantee to be small at the scale of δ and N practically accessible to the VME algorithm discussed in Sec. III B. To remedy this, we propose to insert randomness by averaging over variational states which have been prepared by initializing the VQA with random product states. We have so far focused on a particular observable A and discussed the error in the context of its matrix elements. The claim that ρ R ≈ ρ mc as measured by local observables can be made more precise by introducing the trace distance between certain reduced density matrices, which we examine numerically in Sec. IV.
B. VME algorithm Algorithm 1: prepare |ψ r (θ * )⟩ Data: Hamiltonian H, target energy λ, tolerance δ and random product state |ψ 0 r ⟩ Result: converged variational state |ψr⟩ We now describe the VME algorithm for preparing microcanonical superposition states |ψ r ⟩ discussed in the previous section. At the beginning we fix a target energy λ, and microcanonical window size where ∆E is the full energy bandwidth of the Hamiltonian H. In the mixed-field Ising model we later consider, we find ∆E N ≈ 3 independent of N . We focus our numerical studies mainly on the case α = −1/2. We initialize the QPU (see Fig. 1) in a random product state with |φ j r ⟩ = cos(φ j r ) |0⟩ + sin(φ j r ) |1⟩ and φ j r drawn from the uniform distribution on [0, π), which we can expect to have extensive energy variance [24,33]. We then minimize the "folded-spectrum" cost function [9] until Var(H) = ⟨(H − ⟨H⟩) 2 ⟩ ≤ δ 2 , obtaining the converged variational state |ψ r ⟩. Note that C(θ) penalizes both large energy variance and deviation of the average energy from the target energy, since but that the convergence criterion only concerns Var(H). We find in practice that ⟨H − λ⟩ 2 is comparatively small when N is large, so it is also possible to think of C(θ) ≲ δ 2 as the convergence criterion. The optimization is repeated R times for different initial random product states to generate the variational ensemble. Notice that this cost function C(θ) is zero if and only if |ψ(θ)⟩ = |λ⟩, the eigenstate with energy λ [40]. Unlike previous explorations [12,29,41] with this cost function, we do not seek local or global minima, since we minimize the cost function only until Var(H) ≤ δ 2 , which is not a constraint on the gradient, but on the value of the cost function itself. Furthermore, the unique global minimum is a state completely different from the one we target.
For simplicity we restrict our variational states to be real in the computational basis (CB). Because we are interested in the minimal circuit depth needed to prepare the variational states, we employ a periodic structure ansatz (PSA) [14] circuit for which the number of "layers" will be adaptively chosen by the algorithm. The PSA with p layers is defined as (3.20) where each layer is the unitary (see Fig. 3)  and θ stands for the 2N p real parameters {θ l j , ϕ l j , φ l j } jl and Y j , Z j are Pauli operators acting on qubit j. The "brickwall" form of this ansatz breaks down for odd N , so in this case we add an additional gate to entangle the ends of the chain. That is, for odd N , we make the replacement N j=1 even The operators appearing in the single layer unitary V (θ l ) are chosen based on the findings of Ref. [42]. There it is argued that the pool of 2N − 2 operators P = {iY j Z j+1 } N −1 j=1 ∪{iY j } N −1 j=1 is "complete" in the sense that for any state |ψ⟩, the set of states {A k |ψ⟩} k form a complete basis, where A k are nested commutators of operators in P, i.e. elements of the dynamical Lie algebra [43] of P. We have added the extra gates Y N and (for odd N ) Y 1 Z N to the pool, but clearly P ∪ {Y 1 Y N , Y 1 } is still complete in the above sense.
Since our convergence criterion is based on the value of the cost function and the native convergence criterion of a gradient based optimizer is based on the size of the gradient, we "wrap" the optimizer in a simple loop (see Algorithm 1) where we repeatedly interrupt the optimizer to check if the convergence criterion is satisfied, which we accomplish by having it only minimize until the gradient norm falls below a relatively large value ϵ which starts at 10 1 and can only be decreased down to 10 −3 . If the algorithm then cannot achieve convergence using a p-layer ansatz by decreasing ϵ to 10 −3 , it adds another layer p → p + 1 and repeats the procedure.
For the classical optimizer we employ the Broyden-Fletcher-Goldfarb-Shanno (BFGS) optimizer which is gradient-based. We therefore take advantage of the "parameter shift rule" [44] for computing analytic gradients of the cost function. If the generators are Pauli strings (hence squaring to I), then with e j a unit vector in the j th direction. Thus, during the optimization both the cost function and its derivatives can be measured using the QPU.

IV. NUMERICAL RESULTS
We test the VME algorithm on the 1D Mixed Field Ising Model (MFIM) with Hamiltonian and periodic boundary conditions so that site N + 1 refers to site 1. To ensure there are no accidental degeneracies we consider weakly nonuniform transverse fields h x,j = h x + r j , where r j ∈ [−0.01, 0.01] are drawn randomly from the uniform distribution. We use a single fixed configuration of the transverse fields for our numerics. We fix parameters J = 1, h z = 0.5 and h x = −1.05 such that the system is strongly nonintegrable [45]. In this section we discuss the performance of the VME with respect to this particular model.

A. Diagonal variational ensemble
Here we characterize the nature of the converged variational states in the energy basis. To do so, we first define a "broadened" microcanonical ensemble as was done in Ref. [5]. This ensemble is of the form where G δ (x) = (2πδ 2 ) −1/2 e −x 2 /2δ 2 is a normalized Gaussian function, and D δ (λ) = tr G δ (H − λ) is the "broadened" density of states evaluated at energy λ. Here, δ corresponds to the convergence criterion for the VME algorithm, i.e. Eq. (3.16) with α = −1/2. We will from now on treat D δ (λ) as a good approximation to the density of states in the thermodynamic limit (see Appendix A for further justification). We claim that the variational algorithm 1 generates diagonal energy-basis matrix elements ρ R (E) = ⟨E|ρ R |E⟩ approximating a broadened microcanonical ensemble. Fig. 4(a) shows the variational ensemble diagonal energybasis matrix elements to which we fit the curve ρ µ,σ (E) = D −1 (µ)G σ (E−µ) with fitting parameters µ and σ (shown in solid black). Up to fluctuations from eigenstate to eigenstate, we can see that the variational diagonal ensemble is well described by the Gaussian best-fit. More precisely, the coarse grained version, ρ R (E), of the variational ensemble diagonal elements-where the fluctuations are eliminated-agrees quite well with the Gaussian bestfit. The coarse-grained curve is computed as follows: for each E in the spectrum of H, ρ R (E) is defined as the average of ⟨E ′ |ρ R |E ′ ⟩ over the K eigenenergies E ′ nearest to E. We set the "resolution" K = 64, except for E near the edges of the spectrum where 1 ≤ K < 64. At the largest system size of N = 13 and for an R = 288-state ensemble, we list the best fit parameters µ and σ in Table  I.
Note that away from λ = 0, the variational ensembles converge with µ slightly different than the target λ. This is because the variational ensemble is generated by minimizing the first two moments of the operator (H − λ), but the density of states determined by the underlying model is non-uniform. In fact if we assume the Gaussian density of states discussed in Appendix A, we have tr(ρ µ,σ H) ≈ µ − O(σ 2 µ N ), where we have assumed that σ decreases with N so that higher order terms can be neglected in the thermodynamic limit. We can see that when µ < 0, the ρ µ,σ ensemble actually has average energy larger than µ. In Appendix D we confirm this statement more quantitatively. As a consequence of this analysis, we conclude that the deviations in µ from λ are a finite-size effect due to the non-constant density of states, and not due to the fluctuations around the average curve ρ R (E).
In the last row of Table I, we see that all the σ are at least about 15% smaller than δ. This is due to a simplification we have made in the preceding analysis. Note the form of ρ R (E) in Fig. 3.2b(a) at the edges of the window; the orange curve has more weight away from the window than the Gaussian best-fit (in black). A more accurate characterization of the ensemble might be, for example, a sum of two Gaussian curves. We nonetheless opt to consider the ensemble as roughly a single Gaussian peak for simplicity. We discuss the quantitative consequences of this simplification in Appendix C and explain why the σ/δ shown in Table I are not closer to unity.
Up to fluctuations around the coarse-grained behavior ρ R (E) and the slight oversimplification of the single peak Gaussian fit, we have thus established the form of the diagonal part of the variational ensemble. In the following sections we will study the error in the VME estimate of various observables. Clearly, we will need to choose an appropriate microcanonical ensemble to compare to. This ensemble is arguably ρ µ,σ because it closely approximates the (coarse-grained) diagonal ensemble ρ R (E). One could also imagine comparing to the diagonal ensemble itself and focusing solely on the off-diagonal error as was done in [46]. However, imagining the VME as a practical algorithm for computing broadened microcanonical ensemble averages, one could not know ahead of time what µ and σ were. Furthermore we have shown that the distinction between λ/N and µ/N is a finite-size effect that is already small at N = 13. We will henceforth compare our numerical results to the ensemble ρ λ,δ , where λ and δ are precisely the parameters that were initially chosen before running the algorithm, i.e. we set ρ mc = ρ λ,δ in Eq. (3.2b).

B. Diagonal error
We now briefly discuss the diagonal error with respect to the broadened microcanonical estimates. Fig. 4(b) shows, for N = 11, 12, 13, the diagonal error versus R ≥ 12 for the operator A = Z ⌊N/2⌋ at λ/N = −0.5. For comparison we plot a simple estimate of the error (δ/N )|a ′ (λ/N )| at N = 13, as well as the more accurate estimate χ R defined via Eq. (3.7) which we calculate numerically using the N = 13 variational ensemble. We calculate a ′ (λ/N ) using a fourth-order polynomial fit to the (coarse-grained) graph of ⟨E| A |E⟩ versus E/N , shown as the solid black line in panel (c) of Fig. 4. The coarse-grained curve A EE is computed in the same way as was ρ R (E) in Sec. IV A, but here we use a resolution of K = 32.
For this particular operator and energy density, it is clear that the diagonal error decays with N and is an order-one fraction of the rough estimate O(δ/N ). Furthermore we can see that for large R its behavior is well captured by the expected estimate χ R .
In Appendix F we present further numerical results for the four local operators Z, ZZ, X, XX acting on the central one or two sites of the chain at the energy densities λ/N = −0.75, −0.5, −0.25, 0.0. At λ/N = −0.5, all operators have the property that ϵ diag R decays with N for large R, and the N = 13 values are consistent with the estimate χ R /N . At other energy densities the scaling with N is not always so well established, but the error for large R is always smaller than (δ/N )|a ′ (λ/N )|. The value of χ R also appears to generally be on the correct scale of ϵ diag R except for the operator XX, for which χ R /N undershoots the value of the diagonal error for the higher energy densities. These various deviations are likely due to ETH not yet strongly setting in at such small system sizes. In particular the N = 13 value of D −1/2 (λ) is never smaller than 0.04 for the energy densities we consider, so that the ETH fluctuations, i.e. the third term in equation (3.7), could be comparable to δ/N ≈ 0.06 at N = 13 depending on the relative size of A(E) and the ETH function f (E, 0).
In any case, the diagonal error is always quite small across all operators and energy densities, when compared for example against the scale of the microcanonical fluctuations themselves. For example, see Fig. 5(a) where in purple we can see that even for a single variational state, the diagonal ensemble estimate is highly accurate. We observe this across every considered operator and energy density, as shown in Appendix G.

C. Off-diagonal error
We now turn to discuss the numerical details of the off-diagonal error and how ensemble averaging reduces it. First, Fig. 5(a) illustrates the main problem with using a single variational state with a large δ. We take as an example the operator A = X ⌊N/2⌋ at the energy can only be computed with ED, where off-diagonal contributions can be discarded by hand. In panel (b) we show the finite-sample mean-square off-diagonal error (ϵ off R ) 2 with increasing R for N = 11, 12, 13, where increasing N corresponds to a darker blue curve as in Fig. 4. The orange curve, ⟨(ϵ off R ) 2 ⟩N , is the N -averaged value of (ϵ off R ) 2 as discussed in the text. The dashed black line is a two-parameter best-fit to ⟨(ϵ off R ) 2 ⟩N , and for comparison we plot the "theoretical" error σ 2 R /R in red. Panel (c) shows probability density functions of the N = 13 eigenvalues spec(Â) and spec(Ã) withÂ andÃ as defined in the text. Panel (c) also shows a probability density function of the collection {xr}r,N where N runs over a few system sizes as discussed in the text. density λ/N = −0.5 and compare a single variational state estimate ⟨A⟩ 1 = ⟨ψ 1 |A|ψ 1 ⟩ to the smooth microcanonical average ⟨A⟩ λ,δ = tr(ρ λ,δ A). The estimate is poor even for the largest system size. In Fig. 5(b) we examine how this error is reduced by averaging. It is clear that averaging always reduces the off-diagonal error-however since we anticipate it to behave as a random variable, we need to measure a statistical quantity to make the analysis precise. Given the finite dataset {x r } 288 r=1 of samples, we calculate a finite size estimate of the mean-square error in whatever underlying distribution x r are sampled from as a function of R as follows. For each fixed R ∈ R = {1, 2, . . . , 288} we calculate where P k (R)| R is the first R elements of a random permutation of the ordered index set R. Choosing S = 100, we plot (ϵ off R ) 2 versus R for N = 11, 12, 13 on a log-log scale in anticipation of observing an R −1 scaling of the off-diagonal error. Across various operators and energy densities (shown in Appendix G), we observe an initial R −1 scaling with prefactor approximately independent of N . However, the large R value can either be larger or smaller depending on N in a non-systematic way. To remove this dependence we focus our analysis on a system-size averaged version which is shown in orange. The data is well described by a two parameter best fit function of the form y c,σ (R) = σ 2 /R + c 2 which corresponds to x r being effectively IID random variables as discussed in Sec. III. The values of |c| and σ are shown in panel (b). We can see that for the operator X at λ/N = −0.5 shown here in the main text, the value of |c| = 0.014 is quite small. In other cases, in particular for XX, it can be slightly larger: |c| = 0.071. Data for all operators and energy densities are shown in Appendix G. We also find that the bestfit parameter σ is on the same scale as the finite-sample estimate σ R as demonstrated by the collapse of the red curve σ 2 R /R and the best-fit line in the small R regime. Here σ R is calculated forÃ having non-zero energy-basis matrix elements only on an energy window of half-width 3δ [see Eq. (3.10)]. This approximation of computing σ R based only on the matrix elements of ρ R and A in energy eigenstates near λ is numerically justified in Appendix E. These results imply that for small R, the statistics of the off-diagonal error are controlled byÃ.
Ref. [34] inferred correlations between energy-basis matrix elements of local operators A by the form of the eigenvalue statistics of certain sub-matrices of A. To help understand the nature of the off-diagonal error, in Fig. 5(c) we also examine the eigenvalue distribution of the operatorÃ defined on an energy window of half-width 3δ and centered on energy λ, i.e. as in Eq. (3.10) but with W slightly expanded to accommodate the tails of the roughly Gaussian variational states. See Appendix E for a graphical representation of how this energy window is defined. For comparison we also show in Fig. 5(c) the eigenvalue distribution of the operatorÂ, which is simply the operator A with its energy-basis diagonal elements deleted. There are a number of interesting qualitative properties displayed by these eigenvalue distributions that are relevant to the off-diagonal error in the VME.
Firstly, we observe that the eigenvalue distribution of A does not appear to qualitatively change shape as N is varied, except for a slight reduction in the total width for increasing N , as demonstrated in Appendix E. Since it isÃ which roughly determines the off-diagonal error, the qualitative lack of N dependence agrees with the fact that the off-diagonal error does not depend on N in a systematic way at the system sizes we examine.
Interestingly, we observe a correlation between the single-state variational estimates in Fig. 5(a) as N is varied and the eigenvalue distribution ofÃ. When ⟨A⟩ 1 over/under-estimates the microcanonical value across many system sizes, the eigenvalue distribution is biased to the right/left of zero. For further evidence that this correlation is not an artifact of this energy density or the choice of operator, see Appendix G. To demonstrate this further, we there also plot a histogram of the off-diagonal error present in many individual variational state samples, including samples across a window of system sizes: specifically we show a normalized histogram of the values in the set where x N r is the off-diagonal error in variational state r when the system size is N . Using the statistics across multiple system sizes is justified here since the offdiagonal error varies erratically with the system size. On a qualitative level, this latter histogram confirms that the variational states sample spec(Ã) uniformly enough that a bias in spec(Ã) on the left or right of zero is reflected in the statistics of x r . This fact is not too surprising since roughly speaking,Ã determines the off-diagonal error via withã and |ã⟩ the eigenvalues and eigenvectors ofÃ. However, it is not obvious that | ⟨ã|ψ r ⟩ | 2 is a uniform distribution and furthermore, this approximation should only be understood statistically since in actuality, it is the properties ofÂ which precisely determine the error: x r = ââ |⟨â|ψ r ⟩| 2 (4.7) and the truncation ofÂ toÃ is only shown in Appendix E to rigorously hold for the quantity σ R as opposed to individual realizations x r . At all energy densities, the eigenvalue distribution of XX is much flatter than the other three considered operators, and the histogram of x r is also qualitatively different for XX, two observations which could be related to the fact that the off-diagonal error for A = XX generally saturates sooner and to a larger value than the other operators do. However, we leave an identification of the precise underlying mechanism to future work.
We conclude the discussion of the off-diagonal error with some observations about the eigenvalue statistics of the full-spectrum off-diagonal operatorÂ. Even though the variational states in principle have support on the entire energy spectrum, we can see that it is the statistics of A and not ofÂ that are correlated with the off-diagonal error, further justifying the truncation to a local energy window. Interestingly, we can see thatÂ still has a similar spectral form to that of a Pauli string with eigenvalues ±1, but the otherwise highly degenerate peaks have been smeared out by removing the diagonal energy-basis elements. In Appendix G we show the eigenvalue distributions ofÂ for other A, and note that XX looks the most similar to that of a Pauli string, i.e. its peaks have been broadened the least. When the window is reduced to the scale δ, the distribution becomes less similar to that of a Pauli operator, and we can expect that as δ → 0, the spectrum approaches that of a random matrix, i.e. the semi-circle law [34,35,47]. The fact that the eigenvalue distribution is so far from a semi-circle law on the scale δ = O(N −1/2 ) provides further confirmation that ⟨E|A|E ′ ⟩ are not effectively independently distributed and thus we cannot rely on randomness of the matrix elements alone to make the off-diagonal error small.

D. Explicit microcanonical estimates and trace distance
Having examined in some detail the scaling of the absolute diagonal and off-diagonal errors, we now take a step back and consider what the overall statistics of the variational estimates look like when compared to the microcanonical averages and their associated microcanonical fluctuations. For example, in Fig. 6 we show for various system sizes the variational estimate tr(ρ R A) for R = 288 along with the standard error. This is compared against the broadened microcanonical average calculated from ED, with error bars indicating one microcanonical standard deviation ∆A λ,δ , i.e., which is another scale to which the error can be compared. Running the VME two different times yields two different variational ensembles ρ R and ρ ′ R . The orange error bars measure how much tr(ρ R A) and tr(ρ ′ R A) would differ when R = 288. We find the energy densities λ/N = −0.75 and λ/N = −0.5 generally have more accurate estimates than λ/N = −0.25 and λ/N = 0, see Appendix H for further results. For a fixed R, the error certainly does not systematically decrease with N . In some cases, it appears to increase with N , for example X at λ/N = −0.5, shown in Fig. 6, or in some cases for XX as discussed in Appendix H. We do not consider these observations for a fixed R at odds with the offdiagonal error analysis where we stated that over a large range of R the error does not appear to systematically depend on N . We note that the operator XX generally appears to deviate at the larger system sizes more than Z, X, ZZ across various energy densities, which agrees with the previous observations that XX behaves differently than the other operators.
To see if converged ensembles tend towards the microcanonical state with increasing N in an observableindependent way, we check if, for a subsystem S of the chain, the state ρ S R = trS(ρ R ) approaches the subsystem broadened microcanonical state ρ S λ,δ = trS(ρ λ,δ ). As a distance measure we consider the trace distance, T (ρ, ρ ′ ) = 1 2 ||ρ − ρ ′ || 1 , which measures the distinguishability of ρ and ρ ′ in an operationally meaningful way [48]. More importantly for our purposes however, it also bounds the difference in the expected value of any ob- servable A as where σ max (A) is the maximum singular value of A. This can be derived by using von Neumann's trace inequality |tr(XY )| ≤ i σ i (X)σ i (Y ). In particular for a Pauli string operator, we have σ max (A) = 1. We plot T (ρ S R , ρ S λ,δ ) versus N , averaged over all contiguous subsystems S of fixed size |S|. We observe that the trace distance for a single site subsystem |S| = 1 is always smaller (by about a factor of 1/2) than for a system of two nearest-neighbor sites |S| = 2. In light of the previous analysis where the off-diagonal error did not appear to depend systematically on N , we choose R to scale with N as R(N ) = O(N 2 ), and further average over 20 ensemble realizations in each case. Since we have only 288 samples total, we compute the statistics of 20 random (possibly overlapping) subsets of size R.
The reason one may want to increase R with N is that the trace distance satisfies the inequality with ρ S r = trS |ψ r ⟩ ⟨ψ r |, so that a large-R variational ensemble can only do better than single pure states can on average. Since we are interested in understanding the behavior of the algorithm in the thermodynamic limit, and in light of our above results suggesting that the offdiagonal error does not appear to depend strongly on N , we consider the case of R = O(N 2 ) so that we can observe a continual decrease of the trace distance with N at the lower energy densities. It is possible however, that for a larger number of samples the spatially averaged trace distance will also eventually saturate to a finite value as it did for A = (XX) ⌊N/2⌋ in particular.
The trace distance results in Fig. 7 are consistent with the estimates for particular local operators, for example when comparing Fig. 6 at N = 13, we checked numerically that the deviation of the average estimate (corresponding to R = 288) from the microcanonical one never exceeds twice the trace distance at the corresponding energy density. However, the N = 13 value of the trace distances shown correspond to R = 253, whereas the observables correspond to R = 288. Furthermore, the trace distance is averaged over all contiguous subsystems of size |S| so in principle this value no longer exactly upper bounds the observables as in Eq. (4.10), which are obtained for a given site j = ⌊N/2⌋, but in this case the results are nonetheless consistent.

E. Quantum resources
In Fig. 8(a) we plot the ensemble-averaged von-Neumann entanglement entropy for a contiguous subsystem S of the chain versus |S| at N = 13. Here, the von-Neumann entanglement entropy of a pure state |ψ⟩ is S vN (|ψ⟩) = −tr(ρ S lnρ S ) with ρ S = trS |ψ⟩ ⟨ψ|. We find that on average the variational states appear to be area-law entangled because the values saturate with increasing |S|. Regardless of the scaling with |S|, we note that the entanglement of the variational states is much smaller than the average entanglement of eigenstates within the broadened microcanonical window. In particular, in Fig. 8(b) we compute the broadened microcanonical average of the von Neumann entropy Comparing Fig. 8(a) and (b), we observe that the variational states have an order of magnitude less entanglement than the eigenstates which they are superpositions of. Panel (b) also contains the Page entropy, i.e., the average entanglement of states drawn Haar-randomly from the full Hilbert space [49]. We find that the λ = 0 microcanonical average of the entanglement entropy is quite close to the Page value, providing an additional check that the MFIM is highly nonintegrable for the chosen parameters. The low entanglement of the variational states can be attributed to the fact that they are generated by lowdepth circuits, which makes them atypical. In Fig. 8(c) we plot the average number of layers yielding convergence, which is a proxy for the circuit depth needed to prepare the converged variational state (the depth of a p-layer ansatz circuit is 3p, see Fig. 3). We show the behavior for λ/N = −0.5, but we find a linear scaling in N for the other energy densities as well (with slightly different slopes) such that they require a similar circuit depth. Since the total number of variational parameters is 2N p * , the total number of gates scales roughly quadratically in N . The actual number of variational parameters at N = 13 and λ/N = −0.5 is only about 78.
We also briefly consider how the entanglement and the number of layers in the variational ansatz depend on the tolerance δ. In particular we consider exponents α = −0. work limits the numerically accessible system sizes to around N = 8, similar to Ref. [29]. Fig. 9 shows that both entanglement and circuit depth increase with |α|. For example, for |α| = 1, already at N = 8 we need p * ≈ 5 layers for convergence, making this scaling prohibitive for the classical optimizer. It appears that p * (N ) grows much faster with N at |α| = 1 than it does for |α| ≤ 0.75. The fact that larger entanglement and higher circuit depths would be needed to prepare variational states with smaller energy variance is consistent with other studies, e.g. Ref. [24].

V. CONCLUSION AND OUTLOOK
In this work we propose a VQA for estimating Gaussian microcanonical averages of local operators at intermediate energy density. Given the target average energy λ and a microcanonical width of O(N −1/2 ), the variational algorithm evolves random product states into weakly entangled states whose diagonal ensembles are approximately Gaussian-microcanonical on average. We have systematically examined what we call the diagonal and off-diagonal contributions to the error in this estimation, and found that the latter is the dominant source of error. The mean-square off-diagonal error is on the one hand parametrically reduced by ensemble averaging as R −1 for small R, but on the other hand, for large R saturates to a small value |c| whose size depends principally on the observable under consideration. We have left the identification of the mechanism behind this bias and methods to remove it for future work.
We have also examined the performance of the algorithm in an observable-independent way by computing the trace distance between the subsystem variational ensemble and the subsystem microcanonical one, which appears to continually decrease with N when we take R = O(N 2 ) and λ/N = −0.75, −0.5 (though we cannot rule out saturation), whereas for λ/N = −0.25, 0 there is not a consistent decay. We find this result interest-ing because for other finite temperature VQA methods, intermediate temperatures (as opposed to infinite temperature) are more difficult to simulate [17,19]. Since the number of variational parameters appears to scale roughly as O(N 2 ), this suggests that a classical optimizer could handle the optimization at larger system sizes. The main bottleneck in the classical simulation of our proposed VQA is the repeated evaluation of the cost function; it would be useful to implement a more sophisticated classical simulation of the QPU, for example by tensor network methods so that larger systems could be reached. This would help to decide if the trend in the trace distance continues for larger N .
Before concluding, let us briefly discuss the complexity of the VME algorithm. Neglecting the non-zero bias constant |c| on the basis that it is small, we note that the computational time complexity of VME is O(M RGS) where M is the number of times the cost function is requested during the optimization, R is the number of states in the variational ensemble, G is the number of gates in the variational circuit, and S is an upper bound on the number of shots needed to estimate the cost function during each evaluation.
Let Ω = (H−λ) 2 . In the beginning of the optimization, when the state |ψ(θ)⟩ is basically a product state, we have (neglecting the small non-Gaussian tails of ρ R (E) that were discussed in Sec. IV B). Thus, the most "shot costly" part of the optimization is in the beginning. To simply upper bound the resources, we thus take S = O(N 2 ). Now, we have observed empirically that G = O(N 2 ), and that the off-diagonal error scales as R −1/2 (again neglecting |c|). Therefore, since the diagonal error is O(N −1 ), we may take R = O(N 2 ) and conclude that the time complexity of VME is O(M N 6 ) in achieving a statistical error of O(N −1 ). Whether or not the number of calls M is exponentially large in N is still an open question for all variational quantum algorithms.
In Sec. IV A we saw that the diagonal error vanishes as O(δ/N ) with increasing system size, so that even product states [whose typical energy width is O( √ N )] with target energy λ would suffice for a vanishing diagonal error, provided the ETH holds. An ensemble of R random product states would also yield an off-diagonal error proportional to R −1/2 , but it is unclear if one can variationally squeeze those states onto a microcanonical window without violating the unbiased condition E[x r ] = 0.
As far as variational algorithms are concerned, the VME could be considered as an example of a broader FIG. 10. The broadened density of states of the MFIM plotted against N −1/2 E so that the form of the curves is N -independent. At N = 13, a Gaussian best-fit yields parameters γ = ∆ 2 /N = 2.47 andĒ/∆ = −0.03 at N = 13. The theoretical value calculated directly from H with γ th = ∆ 2 th /N = 2.35 is also plotted and seen to agree well. The prefactor k depends on the curve and is chosen so that the maximum value of each curve is 1 (the factor is (2π∆ 2 ) 1/2 when the width is ∆).
class of VQAs where the convergence criterion is based on the value of the cost function rather than its gradient. Furthermore, the smallness of the cost function at convergence is only O[1/poly(N )] while all initial random product states were able to converge, so it would be interesting to study if the well known barren-plateau problem [13,14] is less significant in this setting. over K eigenstates near |E⟩, and | ⟨E ′ |A|E ′ ⟩ | ≤ 1 for Pauli strings. Now since A(E) = a(E/N ), it follows that a(x) = O(1) and and similarly for higher derivatives. Now consider a broadened microcanonical ensemble at energy λ, i.e. ρ λ,δ = D −1 (λ)G δ (H − λ). Then, the expectation value of the smooth ETH function in this ensemble is obtained by going to the continuum and combining Eqs. coarse grained variational states have some excess energy weight beyond the Gaussian best-fit curve. Furthermore, the variational ensembles are not exactly centered on λ. Thus, we justify replacingÂ with an appropriately choseñ A numerically as follows. In Fig. 12 we compare σ R when computed on a window of size 2s × 2s (see Fig. 11 for clarification) andσ R , which is computed from the entire spectrum. We show on the plots the fraction ofσ R that is captured by σ R at s = 3, i.e. roughly three standard deviations, which we consider to be sufficiently large to capture basically all ofσ R . For the 3δ truncated operatorsÃ we have just discussed, in this Appendix we also consider how the maximum singular value ofÃ, i.e. the larger of |λ max (Ã)|, |λ min (Ã)| scales with N . The results are shown in Fig. 13. These results provide evidence that the numerical prefactor σ R appearing in the statistical description of the off-diagonal error is O(1) in system size. In this Appendix we also consider how the eigenvalue statistics qualitatively vary with  Fig. 14 demonstrating that the distribution is qualitatively independent of N , except for the slight decrease of the distribution's width with N as reflected in Fig. 13.
Appendix F: Additional numerical data for diagonal error Fig. 15 in this Appendix shows the diagonal error in the VME estimate for operators Z, ZZ, X, XX acting in the middle of the chain. As stated in the main text, the results for energy density λ/N = −0.5 appear to agree best with the prediction of ETH that the error should decay as 1/N and agree with χ R /N [see Eq. (3.8)] for large R. While in general such a clear scaling with N is missing, all cases show that the N = 13 diagonal error is never larger than some order one fraction of the rough estimate a ′ (λ/N )δ/N , with XX at λ/N = 0 showing the case where the diagonal FIG. 15. Diagonal error in the VME. Curves are interpreted identically to those in Fig. 4(b) in the main text, except to bring the estimate (δ/N )|a ′ (λ/N )| down to scale we plot instead in some cases (δ/4N )|a ′ (λ/N )|. The latter are shown in orange instead of green to indicate the use of a constant scale factor. error comes closest to the estimate. We also note that the ETH prediction χ R /N for the diagonal error is always on the correct scale of the error for N = 13. The operator XX is also an outlier in this sense-χ R /N significantly underestimates the actual error except for λ/N = −0.5. systematically depend on N . Here we can see that for fixed R, beyond N ∼ 9, the accuracy of the VME estimates indeed does not depend systematically on N except again for XX where it appears to increase with N , consistent with the R = 288, N = 13 value of the off-diagonal error for XX being largest in the plots in Appendix. G. We see that the microcanonical estimates can be quite good, as for Z at λ/N = −0.5, or quite poor, as for XX at λ/N = 0. The better results are generally for the operators Z, ZZ, and X and at the lower target energy densities λ/N = −0.75 and λ/N = −0.5.