Minimizing Estimation Runtime on Noisy Quantum Computers

The number of measurements demanded by hybrid quantum-classical algorithms such as the variational quantum eigensolver (VQE) is prohibitively high for many problems of practical value. For such problems, realizing quantum advantage will require methods that dramatically reduce this cost. Previous quantum algorithms that reduce the measurement cost (e.g., quantum amplitude and phase estimation) require error rates that are too low for near-term implementation. Here we propose methods that take advantage of the available quantum coherence to maximally enhance the power of sampling on noisy quantum devices, reducing the measurement number and runtime compared to the standard sampling method of the VQE. Our scheme derives inspiration from quantum metrology, phase estimation, and the more recent “alpha-VQE” proposal, arriving at a general formulation that is robust to error and does not require ancilla qubits. The central object of this method is what we call the “engineered likelihood function” (ELF), used for carrying out Bayesian inference. We show how the ELF formalism enhances the rate of information gain in sampling as the physical hardware transitions from the regime of noisy intermediate-scale quantum computers to that of quantum error–corrected ones. This technique speeds up a central component of many quantum algorithms, with applications including chemistry, materials, ﬁnance, and beyond. Similar to the VQE, we expect small-scale implementations to be realizable on today’s quantum devices.


I. INTRODUCTION
Which quantum algorithms will deliver practical value first? A recent flurry of methods that cater to the limitations of near-term quantum devices have drawn significant attention. These methods include the variational quantum eigensolver (VQE) [1][2][3][4], the quantum approximate optimization algorithm [5] and variants [6], the variational quantum linear systems solver [7][8][9], other quantum algorithms leveraging the variational principles [10], and quantum machine learning algorithms [11][12][13]. In spite of such algorithmic innovations, many of these approaches have appeared to be impractical for commercially relevant problems owing to their high cost in terms of the number of measurements [2,14] and hence runtime of the expectation value estimation subroutine. Through an extensive benchmarking study of the VQE algorithm [15], it was shown that, for a set of molecules having industrial relevance, the VQE is very unlikely to yield an advantage over state-of-the-art quantum chemistry methods. Unfortunately, methods offering a quadratic speedup over the VQE in the runtime of the expectation value estimation subroutine, such as phase estimation, demand quantum resources that are far beyond the reach of near-term devices for moderately large problem instances [16].
Expectation value estimation is also a building block for many nonvariational quantum algorithms that have high-impact applications. Unfortunately, the standard versions of these algorithms lie out of reach for near-term quantum computers, in part due to the coherence requirements needed to implement estimation subroutines. Such techniques include quantum algorithms for Monte Carlo estimation [17] and quantum algorithms for solving linear systems of equations [18,19]. These algorithms find application in finance, engineering, and machine learning. Thus, there is strong motivation for developing estimation methods that make these techniques more amenable to near-term implementation.
Recently, the method of "α-VQE" [20] was proposed for interpolating between the VQE and phase estimation in terms of the asymptotic trade-off between sample count and quantum coherence. The basic idea is to start from the general framework of the VQE, namely the iterative optimization of the energy expectation that is a sum of individual operator expectations, and proceed to estimate each individual operator with a Bayesian variant of the overlap estimation algorithm [21] that shares the same statistical inference backbone with known Bayesian parameter estimation schemes [22][23][24][25]. While phase estimation is commonly regarded as a quantum algorithm intended for fault-tolerant quantum computers, previous works [20,25] have demonstrated that in a noisy setting, Bayesian phase estimation can still yield quantum advantages in sampling. For instance, in Ref. [25,Sec. III A] it is shown that depolarizing noise reduces but does not eliminate the ability of the likelihood function to distinguish between different possible values of the parameter to be estimated.
This motivates the central question of our work: with realistic, noisy quantum computers, how do we maximize information gain from the coherence available to speed up expectation value estimation, and in doing so, speed up algorithms such as the VQE that rely on sampling? We note that this question is not only urgently relevant in the current era of noisy quantum devices without quantum error correction, but remains relevant for error-corrected quantum computation.
In this work, we investigate the impact of gate and measurement-readout error on the performance of quantum estimation tasks such as amplitude estimation and expectation value estimation. Note that the standard formulation of amplitude estimation [26] is equivalent to estimating the expectation value of an observable with eigenvalues ±1 with respect to a quantum state generated by a given circuit. We introduce a simple noise model and show that the typical sample-generation scheme is hindered by a phenomenon we refer to as "dead spots" in the likelihood function. Motivated by these findings, we develop the framework of engineered likelihood functions (ELFs), in which signal processing techniques are used to boost the information gain per sample during the inference process of estimation. We develop several algorithms for the framework of engineered likelihood functions and investigate their performance with simulations. Finally, we develop a model for the runtime performance of these estimation algorithms and discuss the implications for near-term and far-term quantum computing.
The remainder of the paper is organized as follows. The remaining subsections of the Introduction review relevant prior work on quantum estimation and describe our main results in more detail. In Sec. II, we present a concrete example of our scheme for readers who wish to glean only a first impression of the key ideas. Subsequent sections then expand on the general formulation of our scheme. In Sec. III we describe in detail the general quantum circuit construction for realizing ELFs, and analyze the structure of the ELF in both noisy and noiseless settings. In addition to the quantum circuit scheme, our scheme also involves (1) tuning the circuit parameters to maximize information gain, and (2) Bayesian inference for updating the current belief about the distribution of the true expectation value. In Sec. IV we present heuristic algorithms for both. We then show numerical results in Sec. V, comparing our approach with existing methods based on Chebyshev likelihood functions (CLFs). In Sec. VI, we derive a mathematical model for the runtimes of our algorithms on noisy devices. We conclude in Sec. VII with implications of our results from a broad perspective of quantum computing.

A. Prior work
Evaluating the expectation value of an operator P with respect to a quantum state |A is a fundamental element of quantum information processing. In the simplest setting where samples are drawn repeatedly by measuring the same operator P on the same quantum state |A , the measurement process is equivalent to a sequence of independent Bernoulli experiments. Yielding an estimate of the expectation value within error ε (with high probability) requires the number of samples to scale as O(1/ε 2 ). We highlight the following key points regarding quantum estimation that are relevant to the context of this work.
1. Cost scaling improvement using phase estimation. Quantum effects introduce the opportunity to asymptotically accelerate the measurement process. In particular, there is a set of schemes [21] based on quantum phase estimation that is able to reduce the sample complexity to O(log(1/ε)). This saturates the information-theoretical lower bound for the number of samples since in order to determine a bounded quantity to resolution ε one must be able to distinguish O(1/ε) different values, requiring at least (log(1/ε)) bits [24]. However, such optimal sample complexity comes at the cost of O(1/ε) many coherent quantum operations. This trade-off between sample complexity and quantum coherence is also well understood in quantum metrology [27,28].

Amplitude estimation and generalized reflections.
Phase estimation is closely related to the task of amplitude estimation [26] with many of the performance guarantees of the former applying to the latter. In its original definition, amplitude estimation is the problem of evaluating the quantity A|P + |A , where P + is a projection operator and |A = A|0 n is an ansatz state. This problem is essentially equivalent to estimating the expectation value A|P|A , where P = 2P + − I is a reflection operator. Brassard et al. [26] showed that amplitude estimation can be solved by running phase estimation on the Grover iterate U = (2|A A| − I )P with initial state |A . Namely, the desired amplitude information is encoded in the eigenvalues of the unitary operator U. Subsequent works demonstrated [29] that using generalized reflection operators [i.e., those with a phase parameter ϕ such that R ϕ = (1 − e iϕ )|A A| − I or R ϕ = (1 − e iϕ )P + − I ], one can realize a much larger set of SU (2) rotations in the subspace span{|A , P|A } than with only common reflection operators. The set of SU(2) rotations realizable with such generalized construction has also been rigorously characterized [30] and later used for some of the most advanced Hamiltonian simulation algorithms such as qubitization [31] and signal processing [32]. 3. Bayesian inference perspective. The problem of expectation value estimation can be framed as a parameter estimation problem, common in statistical inference. In fact, previous work (for example, Ref. [3,Sec. IV A]) has already pointed out a Bayesian perspective for considering the standard sampling process for VQE algorithms. The general setting is first to treat the operator expectation = A|P|A as the parameter to be estimated. Then, a parameterized quantum circuit V( θ) that may be related to |A and P is constructed. The ability to execute the circuit and collect measurement outcome d translates to the ability to sample from a likelihood function p(d| θ , ). For a given prior p( ) representing the current belief of the true value of , Bayesian inference uses a measurement outcome d to produce (or update the prior to) a posterior distribution p( |d) = p(d| , θ)p( )/[ p(d| , θ)p( )d ]. For the settings considered in this paper, as well as in previous works [22][23][24][25], the prior and posterior distributions are maintained on the classical computer, while sampling from the likelihood function involves using a quantum device.
The combination of phase estimation and the Bayesian perspective gives rise to Bayesian phase estimation techniques [24,25,33] that are more suitable for noisy quantum devices capable of realizing limited-depth quantum circuits than earlier proposals [34]. The goal of Bayesian phase estimation is to estimate the phase θ = arccos( ) in an eigenvalue e iθ of the unitary U. The quantum circuits used in this algorithm yield measurement outcomes with likelihoods given by where d ∈ {0, 1} and T m ( ) = cos[m arccos( )] is the mth-degree Chebyshev polynomial, found in many settings beyond Bayesian phase estimation (cf. Refs. [22,Eq. 2], [23,Eq. 1], [25,Eq. 2], and [20,Eq. 4]). In Ref. [23] the exponential advantage of Bayesian inference with a Gaussian prior over other nonadaptive sampling methods is established by showing that the expected posterior variance σ decays exponentially in the number of inference steps. Such exponential convergence is at a cost of O(1/σ ) amount of quantum coherence required at each inference step [23]. Such scaling is also confirmed in Ref. [25] in the context of Bayesian phase estimation.
Combining the above observations one may devise a Bayesian inference method for expectation value estimation that smoothly interpolates between the standard sampling regime and phase estimation regime. This is proposed in Ref. [35] as "α-VQE," where the asymptotic scaling is O(1/ε α ) with the extremal values of α = 2 corresponding to the standard sampling regime (typically realized in the VQE) and α = 1 corresponding to the quantum-enhanced regime where the scaling reaches the Heisenberg limit (typically realized with phase estimation). By varying the parameters for the Bayesian inference one can also achieve α values between 1 and 2. The lower the α value, the deeper the quantum circuit needed for Bayesian phase estimation. This accomplishes the tradeoff between quantum coherence and asymptotic speedup for the measurement process (point 1 above).
It is also worth noting that phase estimation is not the only paradigm that can reach the Heisenberg limit for amplitude estimation [36][37][38]. In Ref. [36] the authors considered the task of estimating the parameter θ of a quantum state ρ θ . A parallel strategy is proposed where m copies of the parameterized circuit for generating ρ θ , together with an entangled initial state and measurements in an entangled basis, are used to create states with the parameter θ amplified to mθ . Such amplification can also give rise to likelihood functions that are similar to that in Eq. (1). In Ref. [37] it is shown that with randomized quantum operations and Bayesian inference one can extract information in fewer iterations than classical sampling even in the presence of noise. In Ref. [38] quantum amplitude estimation circuits with varying numbers m of iterations and numbers N of measurements are considered. A particularly chosen set of pairs (m, N ) gives rise to a likelihood function that can be used for inferring the amplitude to be estimated. The Heisenberg limit is demonstrated for one particular likelihood function construction given by the authors. Both works [37,38] highlight the power of parameterized likelihood functions, making it tempting to investigate their performance under imperfect hardware conditions. As will become clear, although the methods we propose can achieve Heisenberg-limit scaling, they do not take the perspective of many previous works that consider interfering many copies of the same physical probe.

B. Main results
In this work we focus on estimating the expectation value = A|P|A where the state |A can be prepared by a circuit A such that |A = A|0 n for some integer n ≥ 1. We consider a family of quantum circuits such that, as the circuit deepens with more repetitions of A, it allows TABLE I. Comparison of our scheme with relevant proposals that appear in the literature. Here the list of features include whether the quantum circuit used in the scheme requires ancilla qubits in addition to qubits holding the state for amplitude estimation or phase estimation, whether the scheme uses Bayesian inference, whether any noise resilience is considered, whether the initial state is required to be an eigenstate, and whether the likelihood function (LF) is fully tunable like ELFs proposed here or restricted to CLFs. for likelihood functions that are polynomial in of ever higher degree. As we demonstrate in the next section with a concrete example, a direct consequence of this increase in polynomial degree is an increase in the power of inference, which can be quantified by Fisher information gain at each inference step. After establishing this "enhanced sampling" technique, we further introduce parameters into the quantum circuit and render the resulting likelihood function tunable. We then optimize the parameters for maximal information gain during each step of inference. See Table I for a comparison of our scheme with relevant proposals that appear in previous literature. The following lines of insight emerge from our efforts.
1. The role of noise and error in amplitude estimation. Previous works [20,25,33,37] have revealed the impact of noise on the likelihood function and the estimation of the Hamiltonian spectrum. Here we investigate the same for our scheme of amplitude estimation. Our findings show that, while noise and error do increase the runtime needed for producing an output that is within a specific statistical error tolerance, they do not necessarily introduce systematic bias in the output of the estimation algorithm. Systematic bias in the estimate can be suppressed by using active noise-tailoring techniques [39] and calibrating the effect of noise. We have performed simulation using realistic error parameters for near-term devices and discovered that the enhanced sampling scheme can outperform the VQE in terms of sampling efficiency. Our results have also revealed a perspective on tolerating error in quantum algorithm implementation where higher fidelity does not necessarily lead to better algorithmic performance. For fixed gate fidelity, there appears to be an optimal circuit fidelity around the range of 0.5-0.7 at which the enhanced scheme yields the maximum amount of quantum speedup.
The model interpolates between the O(1/ε) scaling and O(1/ε 2 ) scaling as a function of λ. Such bounds also allow us to make concrete statements about the extent of quantum speedup as a function of hardware specifications such as the number of qubits and two-qubit gate fidelity, and therefore estimate runtimes using realistic parameters for current and future hardware.

II. A FIRST EXAMPLE
There are two main strategies for estimating the expectation value A|P|A of some operator P with respect to a quantum state |A . The method of quantum amplitude estimation [40] provides a provable quantum speedup with respect to certain computational models. However, to achieve precision ε in the estimate, the circuit depth needed in this method scales as O(1/ε), making it impractical for near-term quantum computers. The variational quantum eigensolver uses standard sampling to carry out amplitude estimation. Standard sampling allows for low-depth quantum circuits, making it more amenable to implementation on near-term quantum computers. However, in practice, the inefficiency of this method makes the VQE impractical for many problems of interest [2]. In this section we introduce the method of enhanced sampling for amplitude estimation. This technique draws inspiration from quantum-enhanced metrology [41] and seeks to maximize the statistical power of noisy quantum devices. We motivate this method by starting from a simple analysis of standard sampling as used in the VQE. We note that, although the subroutine of estimation is a critical bottleneck, other aspects of the VQE algorithm also must be improved, including the optimization of the parameters in parameterized quantum circuits [42][43][44][45].
The energy estimation subroutine of the VQE estimates amplitudes with respect to Pauli strings. For a Hamiltonian decomposed into a linear combination of Pauli strings H = j μ j P j and "ansatz state" |A , the energy expectation value is estimated as a linear combination of Pauli expectation value estimateŝ whereˆ j is the (amplitude) estimate of A|P j |A . The VQE uses the standard sampling method to build up Pauli expectation value estimates with respect to the ansatz state, which can be summarized as follows. Prepare |A and measure operator P receiving outcome d ∈ {0, 1}. Repeat this M times, receiving k outcomes labeled 0 and M − k outcomes labeled 1.
We can quantify the performance of this estimation strategy using the mean squared error of the estimator as a function of time t = TM , where T is the time cost of each measurement. Because the estimator is unbiased, the mean squared error (MSE) is simply the variance in the estimator, For a specific mean squared error MSE(ˆ ) = ε 2 , the runtime of the algorithm needed to ensure mean squared error The total runtime of energy estimation in the VQE is the sum of the runtimes of the individual Pauli expectation value estimation runtimes. For problems of interest, this runtime can be far too costly, even when certain parallelization techniques are used [46]. The source of this cost is the insensitivity of the standard sampling estimation process to small deviations in : the expected information gain about contained in the standard-sampling measurement outcome data is low.
Generally, we can measure the information gain of an estimation process of M repetitions of standard sampling with the Fisher information where D = {d 1 , d 2 , . . . , d M } is the set of outcomes from M repetitions of the standard sampling. The Fisher information identifies the likelihood function P(D| ) as being responsible for information gain. We can lower bound the mean squared error of an (unbiased) estimator with the Cramer-Rao bound Using the fact that the Fisher information is additive in the number of samples, we have is the Fisher information of a single sample drawn from likelihood function P(d| ) = [1 + (−1) d ]/2. Using the Cramer-Rao bound, we can find a lower bound for the runtime of the estimation process as which shows that in order to reduce the runtime of an estimation algorithm, we should aim to increase the Fisher information.
The purpose of enhanced sampling is to reduce the runtime of amplitude estimation by engineering likelihood functions that increase the rate of information gain. We consider the simplest case of enhanced sampling, which is illustrated in Fig. 1. To generate data, we prepare the ansatz state |A , apply the operation P, apply a phase flip 010346-5 P(0|P) about the ansatz state, and then measure P. The phase flip about the ansatz state can be achieved by applying the inverse of the ansatz circuit A −1 , applying a phase flip about the initial state R 0 = 2|0 n 0 n | − I , and then reapplying the ansatz circuit A. In this case, the likelihood function becomes The bias is a degree-3 Chebyshev polynomial in . We refer to such likelihood functions as Chebyshev likelihood functions.
In order to compare the Chebyshev likelihood function of enhanced sampling to that of standard sampling, we consider the case of = 0. Here, P(0| = 0) = P(1| = 0) and so the Fisher information is proportional to the square of the slope of the likelihood function As seen in Fig. 1, the slope of the Chebyshev likelihood function at = 0 is steeper than that of the standard sampling likelihood function. The single-sample Fisher information in each case evaluates to demonstrating how a simple variant of the quantum circuit can enhance information gain. In this example, using the simplest case of enhanced sampling can reduce the number of measurements needed to achieve a target error by at least a factor of 9. As we discuss later, we can further increase the Fisher information by applying L layers of P • A † • R 0 • A before measuring P. In fact, the Fisher information We have yet to propose an estimation scheme that converts enhanced sampling measurement data into an estimation. One intricacy that enhanced sampling introduces is the option to vary L as we are collecting measurement data. In this case, given a set of measurement outcomes from circuits with varying L, the sample mean of the 0 and 1 counts loses its meaning. Instead of using the sample mean, we use Bayesian inference to process the measurement outcomes into information about . In Sec. III B we describe the use of Bayesian inference for estimation.
At this point, one may be tempted to point out that the comparison between standard sampling and enhanced sampling is unfair because only one query to A is used in the standard sampling case while the enhanced sampling scheme uses three queries of A. It seems that if one considers a likelihood function that arises from three standard sampling steps, one could also yield a cubic polynomial form in the likelihood function. Indeed, suppose one performs three independent standard sampling steps, yielding results x 1 , x 2 , x 3 ∈ {0, 1}, and produces a binary outcome z ∈ {0, 1} classically by sampling from a distribution P(z|x 1 , x 2 , x 3 ). Then the likelihood function takes the form where each α i ∈ [0, 1] is a parameter that can be tuned classically through changing the distribution P(z|x 1 , is the Hamming weight of the bit string x 1 x 2 x 3 . Suppose that we want P(z = 0| ) to be equal to P(d = 0| ) in Eq. (9). This implies that α 0 = 1, α 1 = −2, α 2 = 3, and α 3 = 0, which is clearly beyond the classical tunability of the likelihood function in Eq. (12). This evidence suggests that the likelihood function arising from the quantum scheme in Eq. (9) is beyond classical means. As the number of circuit layers L is increased, the time per sample T grows linearly in L. This linear growth in the circuit layer number, along with the quadratic growth in Fisher information leads to a lower bound on the expected runtime, assuming a fixed-L estimation strategy with an unbiased estimator. In practice, the operations implemented on the quantum computer are subject to error. Fortunately, Bayesian inference can incorporate such errors into the estimation process. As long as the influence of errors on the form of the likelihood function is accurately modeled, the principal effect of such errors is only to slow the rate of information gain. Error in the quantum circuit accumulates as we increase the number of circuit layers L. Consequently, beyond a certain number of circuit layers, we receive diminishing returns with respect to gains in Fisher information (or the reduction in runtime). The estimation algorithm should then seek to balance these competing factors in order to optimize the overall performance. The introduction of error poses another issue for estimation. Without error, the Fisher information gain per sample in the enhanced sampling case with L = 1 is greater than or equal to 9 for all . As shown in Fig. 2, with the introduction of even a small degree of error, the values of where the likelihood function is flat incur a dramatic drop in Fisher information. We refer to such regions as estimation dead spots. This observation motivates the concept of ELFs to increase their statistical power. By promoting the P and R 0 operations to generalized reflections U(x) = exp(−ixP) and R 0 (y) = exp(−iyR 0 ), we can choose rotation angles such that the information gain is boosted around such dead spots. We find that, even for deeper enhanced sampling circuits, engineering likelihood functions allow us to mitigate the effect of estimation dead spots.

III. ENGINEERED LIKELIHOOD FUNCTIONS
In this section, we propose the methodology of engineering likelihood functions for amplitude estimation. We first introduce the quantum circuits for drawing samples that correspond to engineered likelihood functions, and then describe how to tune the circuit parameters and carry out Bayesian inference with the resultant likelihood functions.

A. Quantum circuits for engineered likelihood functions
Our objective is to design a procedure for estimating the expectation value where |A = A|0 n in which A is an n-qubit unitary operator, P is an n-qubit Hermitian operator with eigenvalues ±1, and θ = arccos( ) is introduced to facilitate Bayesian inference later on. In constructing our estimation algorithms, we assume that we are able to perform the following primitive operations. First, we can prepare the computational basis state |0 n and apply an ansatz circuit A to it, obtaining |A = A|0 n . Second, we can implement the unitary operator U(x) = exp(−ixP) for any angle x ∈ R. Finally, we can perform the measurement of P that is modeled as a projection-valued measure {(I + P)/2, (I − P)/2} with respective outcome labels {0, 1}. We also make use of the unitary operator V(y) = AR 0 (y)A † , where R 0 (y) = exp[−iy(2|0 n 0 n | − I )] and y ∈ R. Following the convention (see, e.g., Ref. [30]), we call U(x) and V(y) the generalized reflections about the +1 eigenspace of P and the state |A , respectively, where x and y are the angles of these generalized reflections, respectively. We use the ancilla-free [47] quantum circuit in Fig. 3 to generate the ELF, that is, the probability distribution of the outcome d ∈ {0, 1} given the unknown quantity θ to be estimated. The circuit consists of a sequence of generalized reflections. Specifically, after preparing the ansatz state |A = A|0 n , we apply 2L generalized reflec- to it, varying the rotation angle x j in each operation. For convenience, we call V(x 2j )U(x 2j −1 ) the j th layer of the circuit for j = 1, 2, . . . , L. The output state of this circuit is is the vector of tunable parameters. Finally, we perform the projective measurement {(I + P)/2, (I − P)/2} on this state, receiving an outcome d ∈ {0, 1}. As in Grover's search algorithm, the generalized reflections U(x 2j −1 ) and V(x 2j ) ensure that the quantum state remains in the two-dimensional subspace S := span{|A , P|A } [48] for any j . Let |A ⊥ be the state (unique, up to a phase) in S that is orthogonal to |A , i.e., To help the analysis, we view this two-dimensional subspace as a qubit, writing |A and |A ⊥ as |0 and |1 , 3. An illustration of the operations used for generating samples that correspond to an engineered likelihood function. Here A is the state preparation circuit, P is the observable of interest, and R 0 (x i+1 ) is a generalized reflection about the state |0 n . The blocks represent unitary transformations, while the caps at the left and right indicate state preparation and measurement, respectively. The outcomes of measurement of P yield information about the expectation value = A|P|A . The case of L = 0 simply prepares |A and measures P. This corresponds to the standard sampling method used in the VQE. Even with an error-prone implementation, we can enhance the information gain rate by applying a sequence of generalized reflections before the measurement. In such enhanced sampling, the likelihood of outcomes depends more sensitively on . These circuit elements are color coded to highlight the commonalities in the way the features P (blue), A (red), and |0 n (green) enter in the likelihood function.
respectively. LetX ,Ȳ,Z, andĪ be the Pauli operators and identity operator on this virtual qubit, respectively. Then, focusing on the subspace S = span{|0 , |1 }, we can rewrite P as and rewrite the generalized reflections U(x 2j −1 ) and V(x 2j ) as (18) and where x 2j −1 , x 2j ∈ R are tunable parameters. Then the unitary operator Q( x) implemented by the L-layer circuit becomes x) depend on the unknown quantity θ. It turns out to be more convenient to design and analyze the estimation algorithms in this "logical" picture than in the original "physical" picture. Therefore, we stick to this picture for the remainder of this paper. The engineered likelihood function (i.e., the probability distribution of measurement outcome d ∈ {0, 1}) depends on the output state ρ(θ; x) of the circuit and the observable P(θ ). Precisely, it is where is the bias of the likelihood function [from now on, we use P (d|θ; x) and (θ ; x) to denote the partial derivatives of P(d|θ; x) and (θ ; x) with respect to θ, respectively]. In particular, if x = (π/2, π/2, . . . , π/2, π/2) then we have Namely, the bias of the likelihood function for this x is the Chebyshev polynomial of degree 2L + 1 (of the first kind) of . For this reason, we call the likelihood function for this x the Chebyshev likelihood function. In Sec. V we explore the performance gap between CLFs and general ELFs.
In reality, quantum devices are subject to noise. To make the estimation process robust against errors, we incorporate the following exponential decay noise model into the likelihood function [25,49]. Recently, this exponential decay noise model of the likelihood function has been used in several related works [50,51] and validated in small-scale experiments [52]. It has also been used in an 18-qubit experiment on spectrum estimation [53]. Furthermore, it is closely related to the noise analysis carried out in Ref. [54]. Validation at large scales remains an important line of research, which we emphasize in Sec. VII and leave to future work. Letting p be the exponential decay factor, we have wherep accounts for state preparation and measurement error (cf. Appendix A) and (θ , x) is the bias of the ideal likelihood function as defined in Eq. (22). From now on, we use f =pp L as the fidelity of the whole process for generating the ELF, and call p the layer fidelity. Moreover, for convenience, we write P(d|θ;p, p, x) simply as Note that the effect of noise on the ELF is that it rescales the bias by a factor of f . This implies that the less errored the generation process, the steeper the resultant ELF, as one would expect. Before moving on to the discussion of Bayesian inference with ELFs, it is worth mentioning the following property of engineered likelihood functions, as it will play a pivotal role in Sec. IV. In Ref. [55], we introduced the concepts of trigono-multilinear and trigono-multiquadratic functions. Basically, a multivari- for some (complex-valued) functions C j and S j of x ¬j := (x 1 , . . . , x j −1 , x j +1 , x k ), and we call C j and S j the cosinesine-decomposition (CSD) coefficient functions of f with respect to x j . Similarly, a multivariable function f : for some (complex-valued) functions C j , S j , and B j of and we call C j , S j , and B j the cosine-sine-bias-decomposition (CSBD) coefficient functions of f with respect to x j . The concepts of trigonomultilinearity and trigono-multiquadraticity can be naturally generalized to linear operators. Namely, a linear operator is trigono-multilinear (or trigono-multiquadratic) in a set of variables if each entry of this operator (written in an arbitrary basis) is trigono-multilinear (or trigonomultiquadratic) in the same variables. Now Eqs. (18)- (20) imply that Q(θ ; x) is a trigono-multilinear operator of x. Then it follows from Eq. (22) that (θ ; x) is a trigonomultiquadratic function of x. Furthermore, we show in Sec. IV A that the CSBD coefficient functions of (θ ; x) with respect to any x j can be evaluated in O(L) time, and this greatly facilitates the construction of the algorithms in Sec. IV A for tuning the circuit parameters

B. Bayesian inference with engineered likelihood functions
With the model of (noisy) engineered likelihood functions in place, we are now ready to describe our methodology for tuning the circuit parameters x and performing Bayesian inference with the resultant likelihood functions for amplitude estimation.
Let us begin with a high-level overview of our algorithm for estimating = cos(θ ) = A|P|A . For convenience, our algorithm mainly works with θ = arccos( ) instead of . We use a Gaussian distribution to represent our knowledge of θ and make this distribution gradually concentrate around the true value of θ as the inference process proceeds. We start with an initial distribution of (which can be generated by standard sampling or domain knowledge) and convert it to the initial distribution of θ . Then we repeat the following procedure until a convergence criterion is satisfied. At each round, we first find the circuit parameters x that maximize the information gain from the measurement outcome d in a certain sense (based on our current knowledge of θ). Then we run the quantum circuit in Fig. 3 with the optimized parameters x and receive a measurement outcome d ∈ {0, 1}. Finally, we update the distribution of θ by using Bayes' rule, conditioned on the received outcome d. Once this loop is finished, we convert the final distribution of θ to the final distribution of , and set the mean of this distribution as the final estimate of . See Fig. 4 for the conceptual diagram of this algorithm.
Next, we describe each component of the above algorithm in more detail. Throughout the inference process, we use a Gaussian distribution to keep track of our belief of the value of θ. Namely, at each round, θ has prior distribution for some prior mean μ ∈ R and prior variance σ 2 ∈ R + . After receiving the measurement outcome d, we compute the posterior distribution of θ by using Bayes' rule where the normalization factor, or model evidence, is defined as Converged?
Choose circuit parameters Run circuit and measure High-level flowchart of the algorithm for estimating = cos(θ) = A|P|A . Here f is the fidelity of the process for generating the ELF. This algorithm mainly works with θ instead of , and there are conversions between the distributions of θ and at the beginning and end of the algorithm. The final estimate of isμ K . Note that only the "Run circuit and measure" step involves a quantum device.
f is the fidelity of the process for generating the ELF). Although the true posterior distribution will not be a Gaussian, we approximate it as such. Following the methodology in Ref. [56], we replace the true posterior with a Gaussian distribution of the same mean and variance [57], and set it as the prior of θ for the next round. We repeat this measurement-and-Bayesian-update procedure until the distribution of θ is sufficiently concentrated around a single value.
Since the algorithm mainly works with θ and we are eventually interested in , we need to make conversions between the estimators of θ and . This is done as follows. Suppose that at round k the prior distribution of θ is N (μ k , σ 2 k ) and that the prior distribution of is N (μ k ,σ 2 k ) (note that μ k , σ k ,μ k , andσ k are random variables as they depend on the history of random measurement outcomes up to round k). The estimators of θ and at this round are μ k andμ k , respectively. Given the distribution N (μ k , σ 2 k ) of θ , we compute the meanμ k and variancê σ 2 k of cos(θ ), and set N (μ k ,σ 2 k ) as the distribution of .
This step can be done analytically, as if X ∼ N (μ, σ 2 ). Then Conversely, given the distribution N (μ k ,σ 2 k ) of , we compute the mean μ k and variance σ 2 k of arccos( ) (clipping to [−1, 1]), and set N (μ k , σ 2 k ) as the distribution of θ . This step is done numerically. Even though the cos or arccos function of a Gaussian variable is not truly Gaussian, we approximate it as such and find that this has negligible impact on the performance of the algorithm.
Our method for tuning the circuit parameters x is as follows. Ideally, we want to choose them carefully so that the MSE of the estimator μ k of θ decreases as fast as possible as k grows. In practice, however, it is hard to compute this quantity directly, and we must resort to a proxy of its value. The MSE of an estimator is a sum of the variance of the estimator and the squared bias of the estimator. We find that, for large enough k, the squared bias of μ k is smaller than its variance, i.e., Bias( where θ * is the true value of θ. We also find that, for large enough k, the variance σ 2 k of θ is often close to the variance of μ k , i.e., σ 2 k ≈ Var(μ k ) with high probability (see Appendix E for evidences of these claims). Combining these facts, we know that, for large enough k with high probability. So we find the parameters x that minimize the variance σ 2 k of θ instead. Specifically, suppose that θ has prior distribution N (μ, σ 2 ). Upon receiving the measurement outcome d ∈ {0, 1}, the expected posterior variance [55] of θ is where in which (θ ; x) is the bias of the ideal likelihood function as defined in Eq. (22), and f is the fidelity of the process for generating the likelihood function. We introduce an important quantity for engineering likelihood functions that we refer to as the variance reduction factor, Then we have The larger V is, the faster the variance of θ decreases on average. Furthermore, to quantify the growth rate (per time step) of the inverse variance of θ , we introduce the quantity where T(L) is the duration of the L-layer circuit in Fig. 3 (recall that x ∈ R 2L ). Note that R is a monotonic function of V for V ∈ (0, 1). Therefore, when L is fixed, we can maximize R (with respect to x) by maximizing V. In addition, when σ is small, R is approximately proportional to V, i.e., R ≈ V/T(L). For the remainder of this work, we assume that the ansatz circuit A and its inverse A † contribute most significantly to the duration of the overall circuit. So we take T(L) to be proportional to the number of times A or This optimization problem turns out to be difficult to solve in general. Fortunately, in practice, we may assume that the prior variance σ 2 of θ is small (e.g., at most 0.01), and in this case, V(μ, σ ; f , x) can be approximated by the Fisher information of the likelihood function P(d|θ; f , x) at θ = μ, as shown in Appendix F, i.e., where is the Fisher information of the two-outcome likelihood function P(d|θ; f , x) as defined in Eq. (23). Therefore, rather than directly optimizing the variance reduction factor V(μ, σ ; f , x), we optimize the Fisher information I(μ; f , x), which can be done efficiently by the algorithms in Sec. IV A 1. Furthermore, when the fidelity f of the process for generating the ELF is low, we have 2 when σ and f are small. (37) So in this case, we can simply optimize | (μ; x)|, which is proportional to the slope of the likelihood function P(d|θ; f , x) at θ = μ, and this task can be accomplished efficiently by the algorithms in Sec. IV A 2. Finally, we make a prediction on how fast the MSE of the estimatorμ k of decreases as k grows, under the assumption that the number L of circuit layers is fixed during the inference process. Note that MSE(μ k ) = (1/k) as k → ∞ in this case. Let θ * and * be the true values of θ and , respectively. Then, as k → ∞, we have μ k → θ * , σ k → 0,μ k → * , andσ k → 0 with high probability. When this event happens, we obtain, for large k, Consequently, by Eq. (29), we know that, for large k, where This means that the asymptotic growth rate (per time step) of the inverse MSE ofμ k should be roughlŷ where x ∈ R 2L is chosen such that I[arccos( * ); f , x] is maximized. We compare this rate with the empirical growth rate (per time step) of the inverse MSE ofμ k in Sec. V.

IV. EFFICIENT HEURISTIC ALGORITHMS FOR CIRCUIT PARAMETER TUNING AND BAYESIAN INFERENCE
In this section, we present heuristic algorithms for tuning the parameters x of the circuit in Fig. 3 and describe how to efficiently carry out Bayesian inference with the resultant likelihood functions.

A. Efficient maximization of proxies of the variance reduction factor
Our algorithms for tuning the circuit parameters x are based on maximizing two proxies of the variance reduction factor V-the Fisher information and slope of the likelihood function P(d|θ; f , x). All of these algorithms require efficient procedures for evaluating the CSBD coefficient functions of (θ ; x) and (θ ; x) with respect to x j for j = 1, 2, . . . , 2L. Recall that we have shown in Sec. III A that (θ ; x) is trigono-multiquadratic in x. Namely, for any j ∈ {1, 2, . . . , 2L}, there exist functions It follows that is also trigono-multiquadratic in x ¬j ) with respect to θ , respectively. It turns out that, given θ and Proof. See Appendix B.

Maximizing the Fisher information of the likelihood function
We propose two algorithms for maximizing the Fisher information of the likelihood function P(d|θ; f , x) at a given point θ = μ (i.e., the prior mean of θ ). Namely, our goal is to find x ∈ R 2L that maximizes The first algorithm is based on gradient ascent. Namely, it starts with a random initial point, and keeps taking steps proportional to the gradient of I at the current point, until a convergence criterion is satisfied. Specifically, let x (t) be the parameter vector at iteration t. We update it as where δ : Z ≥0 → R + is the step size schedule [58]. This requires the calculation of the partial derivative of I(μ; f , x) with respect to each x j , which can be done as follows. We first use the procedures in Lemma 1 to compute C j := C j (μ; x ¬j ), S j := S j (μ; x ¬j ), B j := B j (μ; x ¬j ), for each j . Then we get Knowing these quantities, we can compute the partial derivative of I(μ; f , x) with respect to x j as Repeat this procedure for j = 1, 2, . . . , 2L. Then we obtain ∇I(μ; f , x) = (γ 1 , γ 2 , . . . , γ 2L ). Each iteration of the algorithm takes O(L 2 ) time. The number of iterations in the algorithm depends on the initial point, the termination criterion, and the step size schedule δ. See Algorithm 1 for more details.
The second algorithm is based on coordinate ascent. Unlike gradient ascent, this algorithm does not require step sizes, and allows each variable to change dramatically in a single step. As a consequence, it may converge faster than the previous algorithm. Specifically, this algorithm starts with a random initial point, and successively maximizes the objective function I(μ; f , x) along coordinate directions, until a convergence criterion is satisfied. At the j th step of each round, it solves the following single-variable optimization problem for a coordinate x j : This single-variable optimization problem can be tackled by standard gradient-based methods, and we set x j to be its solution. Repeat this procedure for j = 1, 2, . . . , 2L. This algorithm produces a sequence x (0) , (2) ) ≤ · · · . Namely, the value of I(μ; f , x (t) ) increases monotonically as t grows. Each round of the algorithm takes O(L 2 ) time. The number of rounds in the algorithm depends on the initial point and the termination criterion. See Algorithm 2 for more details.
We have used Algorithms 1 and 2 to find the parameters x θ ∈ R 2L that maximize I(θ; f , x) [59] for various θ ∈ (0, π) (fixing f ) and obtained Fig. 5. This figure indicates that the Fisher information of the ELF is larger than that of the CLF for the majority of θ ∈ (0, π). Consequently, the estimation algorithm based on the ELF is more efficient than that based on the CLF, as will be demonstrated in Sec. V.

Maximizing the slope of the likelihood function
We also propose two algorithms for maximizing the slope of the likelihood function P(d|θ; f , x) at a given point θ = μ (i.e., the prior mean of θ ). Namely, our goal is to find These algorithms are similar to Algorithms 1 and 2 for Fisher information maximization, in the sense that they are also based on gradient ascent and coordinate ascent, respectively. They are formally described in Algorithms 3 and 4 in Appendix C, respectively. We have used them to find the parameters x θ ∈ R 2L that maximize | (θ ; x)| for various θ ∈ (0, π) and obtained Fig. 6. This figure implies that the slope-based ELF is steeper than the CLF for the majority of θ ∈ (0, π) and hence has more statistical power than the CLF (at least) in the low-fidelity setting by Eq. (37).

B. Approximate Bayesian inference with engineered likelihood functions
With the algorithms for tuning the circuit parameters x in place, we now describe how to efficiently carry out Bayesian inference with the resultant likelihood functions. In principle, we can compute the posterior mean and variance of θ directly after receiving a measurement outcome d. But this approach is time consuming, as it involves numerical integration. By taking advantage of certain properties of the engineered likelihood functions, we can greatly accelerate this process.
Suppose that θ has prior distribution N (μ, σ 2 ), where σ 1/L, and that the fidelity of the process for generating the ELF is f . We find that the parameters x = (x 1 , x 2 , . . . , x 2L ) that maximize I(μ; f , x) [or | (μ; x)|] satisfy the following property: when θ is close to μ, i.e., for some r, b ∈ R. Namely, (θ ; x) can be approximated by a sinusoidal function in this region of θ . Figure 7 illustrates one such example.
We can find the best fitting r and b by solving the leastsquares problem This least-squares problem has the analytical solution where In Fig. 7 we present an example of the true and fitted likelihood functions.
Once we obtain the optimal r and b, we can approximate the posterior mean and variance of θ by those for which have analytical formulas. Specifically, suppose that θ has prior distribution N (μ k , σ 2 k ) at round k. Let d k be the measurement outcome, and let (r k , b k ) be the bestfitting parameters at this round. Then we approximate the posterior mean and variance of θ by

010346-14
After that, we proceed to the next round, setting N (μ k+1 , σ 2 k+1 ) as the prior distribution of θ for that round. Note that, as Fig. 7 illustrates, the difference between the true and fitted likelihood functions can be large when θ is far from μ, i.e., |θ − μ| σ . But, since the prior distribution p(θ ) = e −(θ−μ) 2 /(2σ 2 ) /( √ 2πσ ) decays exponentially in |θ − μ|, such θ have little contribution to the computation of the posterior mean and variance of θ . So Eqs. (62) and (63) give highly accurate estimates of the posterior mean and variance of θ , and their errors have negligible impact on the performance of the whole algorithm.

V. SIMULATION RESULTS
In this section, we present the simulation results of Bayesian inference with engineered likelihood functions for amplitude estimation. These results demonstrate the advantages of engineered likelihood functions over unengineered ones, as well as the impacts of circuit depth and fidelity on their performance.

A. Experimental details
In our experiments, we assume that the ansatz circuit A and its inverse A † contribute most significantly to the duration of the circuit (in Fig. 3 or 21) for generating the likelihood function. So, when the number of circuit layers is L, the time cost of an inference round is roughly 2L + 1, where time is in units of A's duration. Moreover, we assume that there is no state preparation and measurement error, i.e.,p = 1, in the experiments.
Suppose that we aim to estimate the expectation value = cos(θ ) = A|P|A . Letμ t be the estimator of at time t. Note thatμ t is a random variable, since it depends on the history of random measurement outcomes up to time t. We measure the performance of a scheme by the root-mean-squared error (RMSE) ofμ t , which is given by We investigate how fast RMSE t decreases as t grows for various schemes, including the ancilla-based Chebyshev likelihood function (AB CLF), ancilla-based engineered likelihood function (AB ELF), ancilla-free Chebyshev likelihood function (AF CLF), and ancilla-free engineered likelihood function (AF ELF). In general, the distribution ofμ t is difficult to characterize, and there is no analytical formula for RMSE t . To estimate this quantity, we simulate the inference process M times, and collect M samplesμ (1) t t is the estimate of at time t in the ith run for i = 1, 2, . . . , M . Then we use the quantity to approximate the true RMSE t . In our experiments, we set M = 300 and find that this leads to a low enough empirical variance in the estimate to yield meaningful results.
In each experiment, we set the true value of and choose a prior distribution of , and run the algorithm  ∈ (0, π), when the number of circuit layers is L = 6. For the ELF, x θ is a global maximum point of | (θ; x)| for given θ. For the CLF, x θ = (π/2, π/2, . . . , π/2) is fixed. This figure implies that the slope-based ELF is steeper than the CLF for the majority of θ ∈ (0, π). Furthermore, the slope of the CLF changes dramatically for different θ [in fact, it is exactly 0 when θ = j π/(2L + 1) for j = 0, 1, . . . , 2L + 1], whereas the slope of the ELF is less sensitive to the value of θ. in Fig. 4 to test its performance for estimating this quantity. The only quantum component of this algorithm, i.e., executing the circuit in Fig. 3 and measuring the outcome d ∈ {0, 1}, is simulated by an efficient classical procedure, since the probability distribution of the outcome d is characterized by Eq. (23) or (D3), depending on whether the scheme is ancilla-free or ancilla-based. Namely, we synthesize the outcome d by a classical pseudorandom number generator [based on Eq. (23) or (D3)] instead of simulating the noisy quantum circuit. This not only greatly accelerates our simulation, but also makes our results applicable to a wide range of scenarios (i.e., they are not specific to a certain ansatz circuit A or observable P).
For engineering the likelihood function, we use Algorithm 2 or 6 to optimize the circuit parameters x, depending on whether the scheme is ancilla-free or ancilla based. We find that Algorithms 1 and 2 generate the same likelihood function (as they both find the optimal parameters within a reasonable number of trials), and the same holds for Algorithms 5 and 6. So the simulation results in Sec. V will not change if we have used Algorithms 1 and 5 instead, as the AF ELF and AB ELF will remain intact. Meanwhile, we do not use Algorithm 3, 4, 7, or 8 in our experiments, because these algorithms are used to maximize the slope of the AF or AB likelihood function, which is a good proxy of the variance reduction factor V only when the fidelity f is close to zero [see Eqs. (37) and (D5)]. In our experiments, f is mostly between 0.5 and 0.9, and in such a case, the slope is not a good proxy of the variance reduction factor and we need to maximize the Fisher information by Algorithm 1, 2, 5, or 6 instead. Our estimation algorithm needs to tune the circuit parameters at each round, and since it can have tens of thousands of rounds, this component can be quite time consuming. We use the following method to mitigate this issue. Given the number L of circuit layers and the fidelity f of the process for generating the ELF, we first construct a "lookup table" that consists of the optimal parameters x for all the in a discrete set S = { 1 , 2 , . . . , N } ⊂ [−1, 1]. Then at any stage of the estimation algorithm, we find the j ∈ S that is closest to the current estimate of , and use the optimal parameters for j to approximate the optimal parameters for . In our experiments, we set S = {−1.0, −0.9995, −0.999, . . . , 0.999, 0.9995, 1.0} (i.e., S contains 4001 uniformly distributed points in [−1, 1]) and find that this greatly accelerates the estimation algorithm without deteriorating its performance much.
For Bayesian update with an ELF, we use the method in Sec. IV B or Appendix D 2 to compute the posterior mean and variance of θ, depending on whether the scheme is ancilla-free or ancilla-based. In particular, during the sinusoidal fitting of the ELF, we set = {μ − σ , μ − 0.8σ , . . . , μ + 0.8σ , μ + σ } (i.e., contains 11 uniformly distributed points in [μ − σ , μ + σ ]) in Eq. (58) or (D41). We find that this is sufficient for obtaining a high-quality sinusoidal fit of the true likelihood function.

B. Comparing the performance of various schemes
To compare the performance of various quantumgenerated likelihood functions, including the AB CLF, AB ELF, AF CLF, and AF ELF, we run Bayesian inference with each of them, fixing the number of circuit layers L = 6 and layer fidelity p = 0.9 (note that this layer fidelity corresponds to a 12-qubit experiment with a two-qubit gate depth of 12 and a two-qubit gate fidelity of 99.92%, which is almost within reach for today's quantum devices). In Figs. 8-12 we illustrate the performances of different schemes with respect to various true values of . These results suggest the following. (a) In both the ancilla-based and ancilla-free cases, the ELF performs better than (or as well as) the CLF. This means that by tuning the generalized reflection angles, we do enhance the rate of information gain and make the estimation of more efficient. (b) The AF ELF always performs better than the AB ELF, whereas the AF CLF may perform better or worse than the AB CLF, depending on the true value of , but on average, the AF CLF outperforms the AB CLF. So overall the ancilla-free schemes are superior to the ancilla-based ones. (c) While RMSE t → 0 as t → ∞ for the AB ELF and AF ELF, the same is not always true for the AB CLF and AF CLF. In fact, the performances of the AB CLF and AF CLF depend heavily on the true value of , while the performances of the AB ELF and AF ELF are not much affected by this value.
We may compare the above results with those of Fig (a) TheR 0 factor of the AB ELF is equal to or larger than that of the AB CLF, and the same is true for the AF ELF versus the AB CLF. This explains why the ELF outperforms the CLF in both the ancilla-based and ancilla-free cases. (b) TheR 0 factor of the AF ELF is larger than that of the AB ELF. Meanwhile, theR 0 factor of the AF CLF can be larger or smaller than that of the AB CLF, depending on the value of , but on average the AF CLF has largerR 0 factor than the AB CLF. This explains the superiority of the ancilla-free schemes over the ancilla-based ones. of the AB CLF is 0 when = cos(j π/L) for j = 0, 1, . . . , L, and theR 0 factor of the AF CLF is 0 when = cos[j π/(2L + 1)] for j = 0, 1, . . . , 2L + 1. This means that if the true value of is close to one of these "dead spots" then its estimator will struggle to improve and hence the performance of the AB or AF CLF will suffer (see Figs. 10-12 for examples.). The AB ELF and AF ELF, on the other hand, do not have this weakness.

C. Understanding the performance of Bayesian inference with ELFs
Having established the improved performance of ELFs over CLFs, we now analyze the performance of Bayesian inference with ELFs in more detail. Note that Figs. [8][9][10][11][12] suggest that the inverse MSEs of both AB-ELF-based and AF-ELF-based estimators of grow linearly in time when the circuit depth is fixed. By fitting a linear model to the data, we obtain the empirical growth rates of these quantities, which are shown in Tables II and III, respectively. We also compare these rates with theR 0 factors of the AB ELF and AF ELF in the same setting. It turns out that the

Analyzing the impact of layer fidelity on the performance of estimation
To investigate the influence of layer fidelity on the performance of estimation, we run Bayesian inference with an AB or AF ELF for fixed circuit depth but vary layer fidelity. Specifically, we set the number L of circuit layers to be 6, and vary the layer fidelity as p = 0.75, 0.8, 0.85, 0.9, 0.95. In Figs. 14 and 15 we illustrate the simulation results in the ancilla-based and ancilla-free cases, respectively. As expected, higher layer fidelity leads to better performance of the algorithm. Namely, the less noisy the circuit, the faster the RMSE of the estimator of decreases. This is consistent with the fact that theR 0 factors of the AB ELF and AF ELF are monotonically increasing functions of f = p L , as demonstrated in Fig. 16.

Analyzing the impact of circuit depth on the performance of estimation
To investigate the influence of circuit depth on the performance of estimation, we run Bayesian inference with an AB or AF ELF for fixed layer fidelity but vary circuit depth. Specifically, we set the layer fidelity p to be 0.9, and vary the number L of circuit layers from 1 to 5. In Figs. 17 and 18 we illustrate the simulation results in the ancillabased and ancilla-free cases, respectively. These results indicate that a larger L (i.e., deeper circuit) does not necessarily lead to better performance. The optimal choice of L is indeed a subtle issue. This can be intuitively understood as follows. As L increases, the likelihood function becomes steeper [61] and hence gains more statistical power, if the circuit for generating it is noiseless. But, on the other hand, the true fidelity of the circuit decreases exponentially in L and the implementation cost of this circuit grows linearly in L. So one must find a perfect balance among these factors in order to maximize the performance of the estimation algorithm.
The above results are consistent with those of Fig. 19 in which we illustrate theR 0 factors of the AB ELF and AF ELF in the same setting. Note that a larger L does not necessarily lead to a largerR 0 factor of the AB or AF ELF. One can evaluate this factor for different L and choose the one that maximizes this factor. This often enables us to find a satisfactory (albeit not necessarily optimal) L. It remains an open question to devise an efficient strategy for determining the L that optimizes the performance of estimation given the layer fidelity p and a prior distribution of .

VI. A MODEL FOR NOISY ALGORITHM PERFORMANCE
Our aim is to build a model for the runtime needed to achieve a target mean squared error in the estimate of as it is scaled to larger systems and run on devices with better gate fidelities. This model will be built on two main assumptions. The first is that the growth rate of the inverse mean squared error is well described by half the inverse variance rate expression [cf. Eq. (34)]. The variance contribution to the MSE is the variance in the estimator, which is not necessarily the same as the posterior variance. In Appendix E we show that the variance closely tracks the variance of the estimator (cf. Fig. 26). We use the posterior variance σ 2 i+1 in place of the estimator variance. The half is due to the conservative estimate that the variance and squared bias contribute equally to the mean squared error. In Appendix E we show evidence supporting this assumption of a small bias (cf. Fig. 25). The second assumption is an empirical lower bound on the variance reduction factor, which is motivated by numerical investigations of the Chebyshev likelihood function.
We carry out analysis for the MSE with respect to the estimate of θ . We then convert the MSE of this estimate to an estimate of MSE with respect to . Our strategy will be to integrate upper and lower bounds for the rate expression R(μ, σ ; f , m) in Eq. (34) to arrive at bounds for the inverse MSE as a function of time. To help our analysis, we make the substitution m = T(L) = 2L + 1 and reparameterize the way noise is incorporated by introducing λ and α such that The upper and lower bounds on this rate expression are based on findings for the Chebyshev likelihood functions, where x = (π/2) ⊕2L := (π/2, π/2, . . . , π/2) ∈ R 2L . Since the Chebyshev likelihood functions are a subset of the engineered likelihood functions, a lower bound on the Chebyshev performance gives a lower bound on the ELF performance. We leave as a conjecture that the upper bound for this rate in the case of the ELF is a small multiple (e.g., 1.5) of the upper bound we have established for the Chebyshev rate.
The Chebyshev upper bound is established as follows. For fixed σ , λ, and m, one can show [62] that the variance reduction factor achieves a maximum value of V = m 2 exp(−m 2 σ 2 − λm − α), occurring at μ = π/2. This expression is less than m 2 e −m 2 σ 2 , which achieves a maximum of (eσ 2 ) −1 at m = 1/σ . Thus, the factor 1/(1 − σ 2 V) cannot exceed 1/(1 − e −1 ) ≈ 1.582. Putting this all together, for fixed σ , λ, and m, the maximum rate is upper bounded as R(μ, σ ; λ, α, m) ≤ em exp(−m 2 σ 2 − λm − α)/(e − 1). This follows from the fact that R is monotonic in V and that V is maximized at μ = π/2. In practice, we aim to choose a value of L that maximizes the inverse variance rate. The rate achieved by discrete L cannot exceed the value we obtain when optimizing the above upper bound over continuous values of m. This optimal value is realized for 1/m = 1 2 ( √ λ 2 + 8σ 2 + λ). We defineR(σ ; λ, α) by evaluating R(π/2, σ ; λ, α, m) at this optimum value, which gives the upper bound on the Chebyshev rate We do not have an analytic lower bound on the Chebyshev likelihood performance. We can establish an empirical lower bound based on numerical checks. For any fixed L, the inverse variance rate is zero at the 2L + 2 points μ ∈ {0, π/(2L + 1), 2π/(2L + 1), . . . , 2Lπ/(2L + 1), π }. Since the rate is zero at these end points for all L, the global lower bound on R * C is zero. However, we are not concerned with the poor performance of the inverse variance rate near these end points. When we convert the estimator fromθ toˆ = cosθ , the information gain near these end points actually tends to a large value. For the purpose of establishing useful bounds, we restrict μ to be in the range [0.1π , 0.9π ]. In the numerical tests [63] we find that, for all μ ∈ [0.1π , 0.9π ], there is always a choice of L for which the inverse variance rate is above (e − 1) 2 /e 2 ≈ 0.40 times the upper bound. Putting these together, we have It is important to note that, by letting m be continuous, certain values of σ and λ can lead to an optimal m for which L = (m − 1)/2 is negative. Therefore, these results apply only in the case that λ ≤ 1, which ensures that m ≥ 1. We For now, we assume that the rate tracks the geometric mean of these two bounds, i.e., R * C (σ , λ, μ) =R(σ , λ), keeping in mind that the upper and lower bounds are small constant factors off of this. We assume that the inverse variance grows continuously in time at a rate given by the difference quotient expression captured by the inverse-variance rate, R * = (d/dt)(1/σ 2 ). Letting F = 1/σ 2 denote this inverse variance, the rate equation above can be recast as a differential equation for F, Through this expression, we can identify both the Heisenberg limit behavior and shot-noise limit behavior. For F 1/λ 2 , the differential equation becomes which integrates to a quadratic growth of the inverse squared error F(t) ∼ t 2 . This is the signature of the Heisenberg limit regime. For F 1/λ 2 , the rate approaches a constant, This regime yields a linear growth in the inverse squared error F(t) ∼ t, indicative of the shot-noise limit regime. In order to make the integral tractable, we can replace the rate expression with integrable upper and lower bound expressions (to be used in tandem with our previous bounds). Letting x = λ 2 F, these bounds are reexpressed as From the upper bound we can establish a lower bound on the runtime, by treating time as a function of x and integrating, Similarly, we can use the lower bound to establish an upper bound on the runtime. Here we introduce our assumption that, in the worst case, the MSE of the phase estimate ε 2 θ is twice the variance (i.e., the variance equals the squared bias), so the variance must reach half the MSE: σ 2 = ε 2 θ /2 = λ 2 /x. In the best case, we assume that the bias in the estimate is zero and set ε 2 θ = λ 2 /x. We combine these bounds with the upper and lower bounds of Eq. (68) to arrive at the bounds on the estimation runtime as a function of target MSE, At this point, we can convert our phase estimateθ back into the amplitude estimateˆ . The MSE with respect to the amplitude estimate ε 2 can be approximated in terms of the phase estimate MSE as where we have assumed that the distribution of the estimator is sufficiently peaked about θ to ignore higher-order terms. This leads to ε 2 θ = ε 2 /(1 − 2 ), which can be substituted into the above expressions for the bounds, which hold for ∈ [cos 0.9π , cos 0.1π ] ≈ [−0.95, 0.95]. Dropping the estimator subscripts (as they only contribute constant factors), we can establish the runtime scaling in the low-noise and high-noise limits, observing that the Heisenberg-limit scaling and shot-noise limit scaling are each recovered. We arrived at these bounds using properties of Chebyshev likelihood functions. As we have shown in the previous section, by engineering likelihood functions, in many cases we can reduce estimation runtimes. Motivated by our numerical findings of the variance reduction factors of engineered likelihood functions (see, e.g., Fig. 13), we conjecture that using engineered likelihood functions increases the worst-case inverse-variance rate in Eq. (68) toR(σ ; λ, α) ≤ R * C (μ, σ ; λ, α). In order to give more meaning to this model, we refine it to be in terms of the number of qubits n and two-qubit gate fidelities f 2Q . We consider the task of estimating the expectation value of a Pauli string P with respect to state |A . Assume that = A|P|A is very near zero so that ε 2 = ε 2 ≈ ε 2 θ . Let the two-qubit gate depth of each of the L layers be D. We model the total layer fidelity as p = f nD/2 2Q , where we have ignored errors due to singlequbit gates. From this, we have λ = 1 2 nD ln(1/f 2Q ) and α = 2 ln(1/p) − 1 2 nD ln(1/f 2Q ). Putting these together and using the lower bound expression in Eq. (74), we arrive at the runtime expression Finally, we put some meaningful numbers in this expression and estimate the required runtime in seconds as a function of two-qubit gate fidelities. To achieve quantum advantage, we expect that the problem instance will require of the order of n = 100 logical qubits and that the two-qubit gate depth is of the order of the number of qubits, D = 200. Furthermore, we expect that target accuracies ε will need to be of the order of ε = 10 −3 to 10 −5 . The runtime model measures time in terms of ansatz circuit durations. To convert this into seconds, we assume that each layer of the two-qubit gates will take time G = 10 −8 s, which is an optimistic assumption for today's superconducting qubit hardware. In Fig. 20 we show this estimated runtime as a function of the two-qubit gate fidelity.
The two-qubit gate fidelities required to reduce runtimes into a practical region will most likely require error correction. Performing quantum error correction requires an overhead that increases these runtimes. In designing quantum error-correction protocols, it is essential that the improvement in gate fidelities is not outweighed by the increase in estimation runtime. The proposed model gives a means of quantifying this trade-off: the product of gate infidelity and (error-corrected) gate time should decrease as useful error correction is incorporated. In practice, there are many subtleties that should be accounted for to make a more rigorous statement. These include considering the variation in gate fidelities among gates in the circuit and the varying time costs of different types of gates. Nevertheless, the cost analyses afforded by this simple model may be a useful tool in the design of quantum gates, quantum chips, error-correcting schemes, and noise mitigation schemes.

VII. OUTLOOK
This work is motivated by the impractical runtimes required by many near-term quantum algorithms. We aim to improve the performance of estimation subroutines that have relied on standard sampling, as used in the VQE. Drawing on the recent alpha-VQE [20] and quantum metrology [41], we investigate the technique of enhanced sampling to explore the continuum between standard sampling and quantum amplitude (or phase) estimation. In this continuum, we can make optimal use of the quantum coherence available on a given device to speed up estimation. Similar to standard sampling in the VQE, enhanced sampling does not require ancilla qubits. Quantum advantage for tasks relying on estimation will likely occur within this continuum rather than at one of the extremes.
Our central object of study is the quantum-generated likelihood function, relating measurement outcome data to a parameter of interest encoded in a quantum circuit. We explore engineering likelihood functions to optimize their statistical power. This leads to several insights for improving estimation. First, we should incorporate a wellcalibrated noise model directly in the likelihood function to make inference robust to certain error. Second, we should choose a circuit depth (reflected in the number of enhanced sampling circuit layers) that balances gain in statistical power with accrual of error. Finally, we should tune generalized reflection angles to mitigate the effect of "dead spots" during the inference process.
We use engineered likelihood functions to carry out adaptive approximate Bayesian inference for parameter estimation. Carrying out this process in simulation requires us to build mathematical and algorithmic infrastructures. We develop mathematical tools for analyzing a class of quantum-generated likelihood functions. From this analysis, we propose several optimization algorithms for tuning circuit parameters to engineer likelihood functions. We investigate the performance of estimation using engineered likelihood functions and compare this to estimation using fixed likelihood functions. Finally, we propose a model for predicting the performance of enhanced sampling estimation algorithms as the quality of quantum devices is improved.
These simulations and the model lead to several insights. As highlighted in Sec. V B, for the degree of device error expected in the near term (two-qubit gate fidelities of approximately 99.92%), we show that enhanced sampling and engineered likelihood functions can be used to outperform standard sampling used in the VQE. Furthermore, these simulations suggest a nonstandard perspective on the tolerance of error in quantum algorithm implementations. We find that, for fixed gate fidelities, the best performance is achieved when we push circuit depths to a point where circuit fidelities are around the range of 0.5-0.7. This suggests that, compared to the logical circuit fidelities suggested in other works (e.g., 0.99 in Ref. [16]), we can afford a 50-fold increase in circuit depth. We arrive at this balance between fidelity and statistical power by taking estimation runtime to be the cost to minimize.
The runtime model developed in Sec. VI sheds light on the trade-off between gate times and gate fidelity for estimation. For gate times that are one-thousand times slower, the gate fidelities must have three more nines to achieve the same estimation runtimes. The runtime model gives insight on the role of quantum error correction in estimation algorithms. Roughly, we find that, for quantum error correction to be useful for estimation, the factor of increase in runtime from error-correction overhead must be less than the factor of decrease in logical gate error rates. Additionally, the runtime model predicts that, for a given estimation task, there is a level of logical gate fidelity beyond which further improvements effectively do not reduce runtimes (especially if time overhead is taken into account). For the 100-qubit example considered, seven nines in two-qubit gate fidelities sufficed.
We leave a number of questions for future investigation. In the VQE a set of techniques referred to as "grouping" are used to reduce the measurement count [46,[64][65][66][67]. These grouping techniques allow sampling of multiple operators at once, providing a type of measurement parallelization. The grouping method introduced in Refs. [65,67] decomposes a Pauli Hamiltonian into sets of mutually anticommuting Pauli strings, which ensures that the sum within each set is a Hermitian reflection. This method of grouping is compatible with enhanced sampling, as the Hermitian reflections can be both measured and implemented as generalized reflections (i.e., an example of operator P). However, it remains to explore if the additional circuit depth incurred by implementing these reflections is worth the variance reduction in the resulting estimators. Beyond existing grouping techniques, we anticipate opportunities for parallelizing measurements that are dedicated to enhanced sampling.
Our work emphasizes the importance of developing accurate error models at the algorithmic level. Efficiently learning the "nuisance parameters" of these models, or likelihood function calibration, will be an essential ingredient to realizing the performance gain of any enhanced sampling methods in the near term. The motivation is similar to that of randomized benchmarking [68], where measurement data are fit to models of gate noise. An important problem that we leave for future work is to improve methods for likelihood function calibration. Miscalibration can introduce a systematic bias in parameter estimates. A back-of-the-envelope calculation predicts that the relative error in estimation due to miscalibration bias is inversely proportional to the number of Grover iterates L (which is set proportionally to λ) and proportional to the absolute error in the likelihood function. Assuming that this absolute error grows sublinearly in L, we would expect a degree of robustness in the estimation procedure as L is increased. In future work we will explore in more detail to what precision we must calibrate the likelihood function so that the bias introduced by miscalibration (or model error) is negligible. Finally, we leave investigations into analytical upper and lower bounds on estimation performance to future work.
We have aimed to present a viable solution to the "measurement problem" [15] that plagues the VQE. It is likely that these methods will be needed to achieve quantum advantage for problems in quantum chemistry and materials. Furthermore, such amplitude estimation techniques may help to achieve quantum advantage for applications in finance and machine learning tasks as well. We hope that our model for estimation performance as a function of device metrics is useful in assessing the relative importance of a variety of quantum resources including qubit number, two-qubit gate fidelity, gate times, qubit stability, error-correction cycle time, readout error rate, and others.

ACKNOWLEDGMENTS
We would like to thank Pierre-Luc Dallaire-Demers, Amara Katabarwa, Jhonathan Romero, Max Radin, Peter Love, Yihui Quek, Jonny Olson, Hannah Sim, and Jens Eisert for insightful conversations and valuable feedback, and Christopher Savoie for grammatical tidying.

APPENDIX A: A SUFFICIENT MODEL OF CIRCUIT NOISE
Here we describe one circuit noise model that yields the likelihood function noise model of Eq. (23). We expect that this circuit noise model is too simplistic to describe the output density matrix of the circuit. However, we also expect that physically realistic variants of this circuit noise model will also lead to an exponentially decaying likelihood function bias. In other words, this circuit noise model is sufficient, but not necessary for yielding the likelihood function noise model used in this work. The circuit noise model assumes that the noisy version of each circuit layer V(x 2j )U(θ ; x 2j −1 ) implements a mixture of the target operation and the completely depolarizing channel acting on the same input state, i.e., where p is the fidelity of this layer. Under composition of such imperfect operations, the output state of the L-layer circuit becomes This imperfect circuit is preceded by an imperfect preparation of |A and followed by an imperfect measurement of P. In the context of randomized benchmarking, such errors are referred to as state preparation and measurement (SPAM) errors [69]. We also model SPAM error with a depolarizing model, taking the noisy preparation of |A to be p where f =pp L is the fidelity of the whole process for generating the ELF, and (θ , x) is the bias of the ideal likelihood function as defined in Eq. (22).

APPENDIX B: PROOF OF LEMMA 1
In this appendix, we prove Lemma 1, which states that the CSBD coefficient functions of (θ ; x) and (θ ; x) with respect to x j can be evaluated in O(L) time for any j ∈ {1, 2, . . . , 2L}.
With this notation, Eq. (20) implies that Moreover, taking the partial derivative of Eq. (20) with respect to θ yields is the partial derivative of U(θ ; α) with respect to θ , in which is the derivative of P(θ ) with respect to θ. It follows that The following facts will be useful.
The following fact will also be useful. Taking the partial derivative of Eq. (22) with respect to θ yields In order to evaluate C j (θ ; x ¬j ), S j (θ ; x ¬j ), B j (θ ; x ¬j ), C j (θ ; x ¬j ), S j (θ ; x ¬j ), and B j (θ ; x ¬j ) for given θ and x ¬j , we consider the case j is even and the case j is odd separately.
Next, we describe how to compute C j (θ ; x ¬j ), S j (θ ; x ¬j ), and B j (θ ; x ¬j ). Using Eq. (B7) and the fact that Then it follows from Eqs. (B1) and (B16) that In this appendix, we present two algorithms for maximizing the slope of the ancilla-free likelihood function P(d|θ; f , x) at a given point θ = μ (i.e., the prior mean of θ). Namely, our goal is to find x ∈ R 2L that maximizes Similar to Algorithms 1 and 2 for Fisher information maximization, our algorithms for slope maximization are also based on gradient ascent and coordinate ascent, respectively. They both need to call the procedures in Lemma 1 to evaluate C (μ; x ¬j ), S (μ; x ¬j ), and B (μ; x ¬j ) for given μ and x ¬j . However, the gradient-ascent-based algorithm uses the above quantities to compute the partial derivative of [ (μ; x)] 2 with respect to x j , while the coordinate-ascent-based algorithm uses them to directly update the value of x j . These algorithms are formally described in Algorithms 3 and 4, respectively.
Assuming the circuit in Fig. 21 is noiseless, the AB ELF is given by where is the bias of the likelihood function. In particular, if x = (π/2, π/2, . . . , π/2), we get (θ; x) = (−1) L cos(Lθ) and call the corresponding likelihood function the ancillabased Chebyshev likelihood function. It turns out that most of the argument in Sec. III A still holds in the ancilla-based case, except that we need to replace (θ ; x) with (θ; x). So we use the same notation (e.g., |0 , |1 ,X ,Ȳ,Z,Ī ) as before, unless otherwise stated. In particular, when we take the errors in the circuit in Fig. 21 into account, the noisy likelihood function is given by where f is the fidelity of the process for generating the ELF. Note, however, that there does exist a difference between (θ ; x) and (θ; x), as the former is trigonomultiquadratic in x, while the latter is trigono-multilinear in x.
We tune the circuit parameters x and perform Bayesian inference with the resultant ELFs in the same way as in Sec. III B. In fact, the argument in Sec. III B still holds in the ancilla-based case, except that we need to replace (θ ; x) with (θ; x). So we use the same notation as before, unless otherwise stated. In particular, we also define the variance reduction factor V(μ, σ ; f , x) asin Eqs. assumptions. Since the direct optimization of V is hard in general, we tune the parameters x by optimizing these proxies instead.

Efficient maximization of proxies of the variance reduction factor
Now we present efficient heuristic algorithms for maximizing two proxies of the variance reduction factor V-the Fisher information and slope of the likelihood function P(d|θ; f , x). All of these algorithms make use of the following procedures for evaluating the CSD coefficient functions of the bias (θ; x) and its derivative (θ ; x) with respect to x j for j = 1, 2, . . . , 2L.
Our optimization algorithms require efficient procedures for evaluating C j (θ ; x ¬j ), S j (θ ; x ¬j ), C j (θ ; x ¬j ), and S j (θ ; x ¬j ) for given θ and x ¬j . It turns out that these tasks can be accomplished in O(L) time.
In this case, W 2t = V(x j ). Using the fact that we obtain For the AB CLF, x θ = (π/2, π/2, . . . , π/2) is fixed. One can see that the Fisher information of the AB ELF is larger than that of the AB CLF for the majority of θ ∈ (0, π). Furthermore, the Fisher information of the AB CLF changes dramatically for different θ (in fact, it is exactly 0 when θ = j π/L for j = 0, 1, . . . , L), whereas the Fisher information of the AB ELF is less sensitive to the value of θ.
Then we compute A t and B t by Eqs. (D15) and (D16). After that, we calculate C j (θ ; x ¬j ) and S j (θ ; x ¬j ) by Eqs. (D19) and (D20). Overall, this procedure takes O(L) time.
In this case, W 2t+1 = U(θ ; x j ). Using the fact that we obtain Given θ and x ¬j , we first compute P 0,2t and P 2t+2,2L−1 in O(L) time. Then we calculate C j (θ ; x ¬j ) and S j (θ ; x ¬j ) by Eqs. (D23) and (D24). This procedure takes only O(L) time.
Then we compute A t and B t by Eqs. (D26) and (D27). After that, we calculate C j (θ ; x ¬j ) and S j (θ ; x ¬j ) by Eqs. (D30) and (D31). Overall, this procedure takes O(L) time.

b. Maximizing the Fisher information of the likelihood function
We propose two algorithms for maximizing the Fisher information of the likelihood function P(d|θ; f , x) at a given point θ = μ (i.e., the prior mean of θ ). Namely, our goal is to find x ∈ R 2L that maximizes These algorithms are similar to Algorithms 1 and 2 for Fisher information maximization in the ancilla-free case, in the sense that they are also based on gradient ascent and coordinate ascent, respectively. The main difference is that now we invoke the procedures in Lemma 2 to evaluate C(μ; x ¬j ), S(μ; x ¬j ), C (μ; x ¬j ), and S (μ; x ¬j ) for given μ and x ¬j , and then use them to either compute the partial derivative of I(μ; f , x) with respect to x j (in gradient ascent) or define a single-variable optimization problem for x j (in coordinate ascent). These algorithms are formally described in Algorithms 5 and 6. We have used Algorithms 5 and 6 to find the parameters x θ ∈ R 2L that maximize I(θ; f , x) for various θ ∈ (0, π) (fixing f ) and obtained Fig. 22. This figure indicates that the Fisher information of the AB ELF is larger than that of the AB CLF for the majority of θ ∈ (0, π). Consequently, x θ )| for the slopebased AB ELF and AB CLF for various θ ∈ (0, π) when the number of circuit layers is L = 6. For the AB ELF, x θ is a global maximum point of | (θ; x)| for given θ. For the AB CLF, x θ = (π/2, π/2, . . . , π/2) is fixed. This figure implies that the slope-based AB ELF is steeper than the AB CLF for the majority of θ ∈ (0, π). Furthermore, the slope of the AB CLF changes dramatically for different θ (in fact, it is exactly 0 when θ = j π/L for j = 0, 1, . . . , L), whereas the slope of the AB ELF is less sensitive to the value of θ.
the estimation algorithm based on the AB ELF is more efficient than that based on the AB CLF, as demonstrated in Sec. V.

c. Maximizing the slope of the likelihood function
We also propose two algorithms for maximizing the slope of the likelihood function P(d|θ; f , x) at a given point θ = μ (i.e., the prior mean of θ ). Namely, our goal is to find x ∈ R 2L that maximizes |P (d|μ; f , x)| = f | (μ; x)|/2.
These algorithms are similar to Algorithms 3 and 4 for slope maximization in the ancilla-free case, in the sense that they are also based on gradient ascent and coordinate ascent, respectively. The main difference is that now we invoke the procedures in Lemma 2 to evaluate C (μ; x ¬j ) and S (μ; x ¬j ) for given μ and x ¬j . Then we use these quantities to either compute the partial derivative of [ (μ; x)] 2 with respect to x j (in gradient ascent) or directly update the value of x j (in coordinate ascent). These algorithms are formally described in Algorithms 7 and 8.
We have also used Algorithms 7 and 8 to find the parameters x θ that maximize | (θ ; x)| for various θ ∈ (0, π) and obtained Fig. 23. This figure implies that the slope-based AB ELF is steeper than the AB CLF for the majority of θ ∈ (0, π) and hence has more statistical power than the AB CLF (at least) in the low-fidelity setting by Eq. (D5).

Approximate Bayesian inference with engineered likelihood functions
With the algorithms for tuning the circuit parameters x in place, we now briefly describe how to perform Bayesian inference efficiently with the resultant likelihood functions. The idea is similar to that in Sec. IV B for the ancilla-free scheme.
Suppose that θ has prior distribution N (μ, σ 2 ), where σ 1/L, and that the fidelity of the process for generating the ELF is f . We find that the parameters x = (x 1 , x 2 , . . . , x 2L ) that maximize I(μ; f , x) (or | (μ; x)|) satisfy the following property: when θ is close to μ, i.e.,   2 and Var(μ t ) during the inference process for the AB ELF and AF ELF. Here has true value 0.6 and prior distribution N (0.64, 0.0009), the number of circuit layers is L = 6, each layer has fidelity p = 0.9, and there is no SPAM error (i.e.,p = 1). Each plot is generated by simulating the inference process 300 times. Note that Bias(μ t ) 2 < Var(μ t ) as soon as t becomes sufficiently large. The variance ofμ t increases at the early stage of the algorithm, because we always start with the sameμ 0 and it takes a certain number of Bayesian updates forμ t to become sufficiently dispersed.
In Fig. 24 we present an example of the true and fitted likelihood functions.
Once we obtain the optimal r and b, we approximate the posterior mean and variance of θ with those for which have analytical formulas. Specifically, suppose that θ has prior distribution N (μ k , σ 2 k ) at round k. Let d k be the measurement outcome, and let (r k , b k ) be the bestfitting parameters at this round. Then we approximate the posterior mean and variance of θ by After that, we proceed to the next round, setting N (μ k+1 , σ 2 k+1 ) as the prior distribution of θ for that round. The approximation errors incurred by Eqs. (D45) and (D46) are small and have negligible impact on the performance of the whole algorithm for the same reason as in the ancilla-free case.

APPENDIX E: BIAS AND VARIANCE OF THE ELF-BASED ESTIMATOR OF
In this appendix, we state the following two facts about the ELF-based estimatorμ t of during the inference process [recall that we use the Gaussian distribution N (μ t ,σ 2 t ) to represent our knowledge of at time t, whereμ t andσ t are random variables depending on the history of random measurement outcomes up to time t].
(a) In both the ancilla-based and ancilla-free cases, the squared bias ofμ t is smaller than its variance, i.e., Bias(μ t ) 2 = (E[μ t ] − ) 2 < Var(μ t ) for large enough t. See Fig. 25 for examples. (b) In both the ancilla-based and ancilla-free cases, the perceived varianceσ 2 t of is often close to the variance ofμ t , i.e.,σ 2 t ≈ Var(μ t ) with high probability for large enough t. See Fig. 26 for examples.
Combining these facts, we know that, for large enough t, MSE(μ t ) = E[(μ t − ) 2 ] ≤ 2σ 2 t with high probability. Similar statements hold for the ELF-based estimator μ t of θ during the inference process. t ] and Var(μ t ) during the inference process for the AB ELF and AF ELF. Here has true value 0.6 and prior distribution N (0.64, 0.0009), the number of circuit layers is L = 6, each layer has fidelity p = 0.9, and there is no SPAM error (i.e.,p = 1). Each plot is generated by simulating the inference process 300 times.Note that E[σ 2 t ] ≈ Var(μ t ) once t becomes sufficiently large. Furthermore, we discover thatσ 2 t does not vary much among different runs of the algorithm. Namely,σ 2 t ≈ E[σ 2 t ] with high probability. Soσ 2 t ≈ Var(μ t ) with high probability for large enough t. Note that the variance ofμ t increases at the early stage of the algorithm, because we always start with the sameμ 0 and it takes a certain number of Bayesian updates forμ t to become sufficiently dispersed.

APPENDIX F: COMPARISON OF EXACT OPTIMIZATION WITH OPTIMIZATION OF PROXIES
In this appendix, we compare the maximization of the variance reduction factor with the maximization of the proxies used in Sec. IV. We start with motivating the use of these proxies by studying the limiting behavior of V(μ, σ ; f , x) as σ → 0 or as f → 0 or 1.

Limiting behavior of the variance reduction factor
Consider the following three limiting situations: (i) when the variance vanishes, i.e., σ 2 → 0; (ii) when the fidelity is close to zero, i.e., f ≈ 0; and (iii) when the fidelity is equal to one, i.e., f = 1. We derive expressions for the variance reduction factor in these conditions (see Fig. 27 for a flowchart summarizing the results below). For simplicity, we assume in this section that either (i) 0 ≤ f < 1 or (ii) f = 1 and | (μ; x)| = 1. While the results here are stated in terms of the ancilla-free bias (22), they also hold for the ancilla-based bias (D2) (just replace with ). In the next part, we show how these approximations give us good proxies for maximizing the variance reduction factor. To investigate the performance of the proxies (F10) and (F11), we numerically maximize the expected posterior variance and the proxies (F8) and (F9) for small problem sizes L. The results of this optimization are presented in Figures 28-31. For L = 1, it turns out that the optimization problem (F9) for the ancilla-free bias can be solved analytically. We present this analytical solution in Appendix F 3.

Analytical expressions for an L = 1 slope proxy
In this appendix, we present an analytical solution to optimization problem (F9) for the ancilla-free bias when L = 1. In this case, the bias (22) can be written as the