Universal noise-precision relations in variational quantum algorithms

Variational quantum algorithms (VQAs) are expected to become a practical application of near-term noisy quantum computers. Although the effect of the noise crucially determines whether a VQA works or not, the heuristic nature of VQAs makes it difficult to establish analytic theories. Analytic estimations of the impact of the noise are urgent for searching for quantum advantages, as numerical simulations of noisy quantum computers on classical computers are heavy and quite limited to small scale problems. In this paper, we establish analytic estimations of the error in the cost function of VQAs due to the noise. The estimations are applicable to any typical VQAs under the Gaussian noise, which is equivalent to a class of stochastic noise models. Notably, the depolarizing noise is included in this model. As a result, we obtain estimations of the noise level to guarantee a required precision. Our formulae show how the Hessian of the cost function, the spectrum of the target operator, and the geometry of the ansatz affect the sensitivity to the noise. This insight implies trade-off relations between the trainability and the noise resilience of the cost function. We also obtain rough estimations which can be easily calculated without detailed information of the cost function. As a highlight of the applications of the formula, we propose a quantum error mitigation method which is different from the extrapolation and the probabilistic error cancellation.


I. INTRODUCTION
To make use of noisy intermediate-scale quantum (NISQ) devices in the near future [1], we have to seek a classically intractable task that hundreds of qubits can resolve under the lack of the error correction. A promising framework to realize it is hybrid quantum-classical algorithms, where most of the processes are done on a classical computer, receiving the output from a quantum circuit which computes some classically intractable functions. Especially, variational quantum algorithms (VQAs) have attracted much attention, where the cost function of a variational problem is computed by utilizing low-depth quantum circuits and the optimization of the variational parameters is done on a classical computer. For example, the variational quantum eigensolver (VQE) [2][3][4] is a VQA to obtain an approximation of the ground state of a Hamiltonian, and beyond [5][6][7][8][9][10][11][12][13][14][15]. The quantum approximate optimization algorithm (QAOA) [16][17][18] is another attracting VQA for combinatorial optimization problems. Quantum machine learning algorithms [19,20] for NISQ devices have also been proposed in various settings [21][22][23][24][25][26].
The noise is one of the most crucial obstacles to overcome toward achieving quantum advantage via VQAs. The heuristic nature of VQAs makes it difficult to analytically assess the effects of the noise on the performance of VQAs. To go beyond heavy numerical simula-tions of noisy quantum computers on classical computers limited to small scale problems, analytic estimations of the impact of the noise are urgent for obtaining knowledge about intermediate scale problems with potential quantum advantage. In fact, this issue has been actively studied in recent years, and some analytic results have been obtained, for example, on the characterization of the impact of local noise in QAOA [27], noise resilience of the optimization results [28,29], noise-induced barren plateaus [30], noise-induced breaking of symmetries [31], effects of the noise on the convergence property of the optimizations in VQAs [32].
In this work, we establish analytic estimation formulae on the error in the cost function of VQAs due to the noise, which are applicable to any typical VQAs under Gaussian noise. Especially, we focus on the effect of the noise on the expectation value in order to investigate ultimately achievable and unachievable precision, aside from the statistical error due to the finiteness of the number of measurements. Gaussian noise is equivalent to a class of stochastic noise models given in Eq. (4). Notably, depolarizing noise can be decomposed into this form of the stochastic noise channels, and hence, is included in this model. The correspondence from the stochastic noise model to the Gaussian model is given by introducing virtual parametric gates associated with the noise. Our formulae essentially come from the expansion of the cost function with respect to the fluctuations in the parameters due to the noise. This fact implies that a picture of the noise based on fluctuations of parameters of virtual parametric gates can serve as a powerful tool for performance analysis of VQAs. In fact, we propose a quantum error mitigation method based on this expansion includ-ing the virtual parameters, which is different from existing error mitigation methods such as the extrapolation [33,34] and the probabilistic error cancellation [34,35].
Applying our formulae, we can estimate the order of magnitude of both sufficient noise level and the necessary one to achieve a desired precision. Moreover, we can gain an insight of what properties of the problem affect the sensitivity of the cost function to the noise. More concretely, our formulae implies that the sensitivity to the noise is affected by the Hessian of the cost function, or the spectrum of the target operator and the geometry of the ansatz. Trade-off relations between the trainability and the noise resilience of the cost function are repeatedly implied in some forms as a result of our formulae.
We also obtain computable rough upper and lower bounds of the precision of the noisy cost function for a VQA task, whose usefulness is verified in numerical simulations of the Heisenberg spin chain and a toy model.
The rest of the paper is organized as follows. In Sec. II A, we describe the setup of the VQA under the Gaussian noise. The correspondence between the Gaussian and the stochastic noise models including depolarizing noise model is shown in Sec. II B. In Sec. III A, the main theorem (Theorem 2) is shown. Then, estimations of a sufficient order of the smallness of the noise to achieve a given precision is followed. In Sec. III B, we propose an error mitigation method based on the main theorem. Next, in Sec. IV we establish upper and lower bounds of the error in the cost function, which show how the spectrum of the target operator and the geometric structure of the ansatz affect the sensitivity of the cost function to the noise. An estimation of a necessary order of the smallness of the noise to achieve a required precision is followed. We provide rough estimations which can be easily calculated without detailed information of the cost function at the tail of Sec. IV. In Sec. V, we demonstrate the usefulness of the rough estimations by numerical simulations of Heisenberg spin chain and a toy model. The conclusion is drawn in Sec. VI. A summary table of important notations is presented in We consider the following parameterized quantum circuit where U i (θ i ) = exp[−iθ i A i /2] satisfying A 2 i = I with the identity operator I, and W i is a generic non-parametric gate. Typical parameterized quantum circuits such as the hardware efficient ansatz [4,26] satisfy the above requirements. We focus on a VQA to minimize the cost function C( θ) given by the sum of the expectation values of the target Hermitian operators H l (l = 1, 2, · · · , L) as where |φ l (l = 1, 2, · · · , L) are the input states. As a model of the noise, we consider independent Gaussian noise in the parameter, where each parameter θ i independently fluctuates as θ i + η i with the Gaussian random variable η i with zero mean and the variance σ 2 i . In other words, the Gaussian noise channel G Ai,σi defined below is inserted after each U i (θ i ): where f σ = e − η 2 2σ 2 /( √ 2πσ) is the probability density function of the zero-mean Gaussian distribution with the variance σ 2 , ρ is any density operator, and U Ai,η (ρ) := e −i η 2 Ai ρe i η 2 Ai . The incompleteness of the control and statistical error in the parameters obtained as a result of an optimization (e. g. in the stochastic gradient descent [36]) may result in such fluctuations in the parameters. Moreover, considering Gaussian noise in "virtual parameters", we can also treat stochastic noise models as the Gaussian noise model as shown in the next section.
In this paper, we only focus on the effect of the noise on the expectation value in order to investigate ultimately achievable and unachievable precision, aside from the statistical error due to the finiteness of the number of measurements.

B. Correspondence to the stochastic noise model
Here, we show the correspondence relation between the Gaussian and the stochastic noise models along the same lines with Nielsen and Chuang's textbook [37]. We consider the case where M SNC stochastic noise channels with respect to operators B ν (ν = 1, 2, · · · , M SNC ), B 2 ν = I are inserted in the circuit, where ρ denotes a density operator, 0 < p ν < 1/2 is the error probability. Then we have the following correspondence between the Gaussian and the stochastic noise models: holds with the corresponding variance Hence, if we consider the stochastic B ν -noise (ν = 1, 2, · · · , M SNC ), it can be treated as Gaussian noise with respect to the virtually inserted parametric gate V ν (ξ ν ) := exp[−iξ ν B ν /2] at the place where the noise occurs, where ξ ν ≡ 0 throughout the optimization. Therefore, the cost function C noisy ( θ) evaluated under the stochastic noises and the fluctuations in the optimizing parameters is given as In the following, C( θ) denotes the abbreviation of C( θ, ξ) = C( θ, 0). Especially, the partial derivative of the cost function with respect to a virtual parameter ξ ν at ( θ, ξ) = ( θ, 0) is denoted by ∂ ∂ξν C( θ). Hereafter, ν is used only to denote the indices of the stochastic noise channels and its corresponding virtual parameters in distinction from those of the optimized parametric gates. A benefit of introducing the virtual parameters is to treat stochastic noises mathematically in the same way as the fluctuations in the parameters. However, it should be noted that the virtual parameters ξ are just fixed to zero and nothing to do with the optimization. We call θ j a optimizing parameter in distinction from a virtual parameter. Nevertheless, there are cases where a virtual parameter ξ ν is equivalent to an optimizing parameter θ j when B ν -stochastic noise occurs alongside the parametric gate U j (θ j ) with A j = B ν . One example is when the generator of the parametric gate is a Pauli operator and a depolarizing channel is applied after this gate. In this case, the depolarizing channel is decomposed into stochastic noise channels with respect to all the Pauli operators, so one of the virtual parameters ξ ν coincides with the optimizing parameter θ j for the Pauli operator used in the gate. Especially, the relation ∂ ∂ξν = ∂ ∂θj gives a connection between the trainability of the optimizing parameters and the sensitivity to the stochastic noise as seen later. We also remark that a similar correspondence to Eq. (5) holds not only for Gaussian noise but also for any noise in the parameter whenever its probability density function is even, since only this property is used to show Eq. (5).
Especially, depolarizing noise is one of the most basic and serious error sources for noisy quantum computers. A key feature of the Gaussian noise model is its capability of treating depolarizing noise via the above correspondence. Depolarizing noise is described by the depolarizing channel where P i runs over all k-qubit Pauli operators except for the identity I =: P 0 , and p is the error probability. Since we can decompose the depolarizing channel into multiple stochastic noise channels with respect to each single Pauli operator, the above correspondence works.
Lemma 1. Let p < (4 k −1)/4 k . The k-qubit depolarizing channel D k,p can be decomposed as D k,p = stochastic Pauli noise channels with respect to k-qubit Pauli operators P i , where I is the identity channel, U Pi (ρ) = P i ρP i for arbitrary state ρ, and the corresponding error probabilityp is given as Equivalently, the k-qubit depolarizing channel D k,p can be decomposed into the Gaussian noise channels as D k,p = A proof of Lemma 1 is in Appendix C. We remark that Eq. (10) implies that for small error probability p in the same way as Eq. (17).

III. UNIVERSAL ERROR ESTIMATIONS
A. Estimation of the leading-order term of the error At first, we show an estimation of the error in the cost function due to the fluctuations in the parameters following a general probability measure not restricted to the Gaussian distribution. Because the fluctuations in the virtual parameters associated with the stochastic noises can be treated totally in the same way as those in the optimizing parameters, we treat the optimizing and virtual parameters together in the same notation as θ M+ν = ξ ν (ν = 1, 2, · · · , M SNC ), M tot := M + M SNC , and σ M+ν := σ SNC,ν . We state the following main theorems in the above notation for brevity. We will use the same notation also in the subsequent sections when it is convenient to treat the optimizing and virtual parameters together in the same manner.
Theorem 1. Let us assume that each parameter independently fluctuates as θ i + η i , where η i is a zero-mean random variable with probability measure P i . We assume that the moment generating function (mgf ) g i (t) = exp(η i t)dP i (η i ) of each η i exists and is analytic in a region including 0 and 1. We also assume that every odd moment are nonnegative: η 2α+1 i dP i (η i ) ≥ 0, where α is any positive integer. Then, the noisy cost functioñ with respect to this noise is estimated as follows: where E 0,l , E max,l are the minimum and the largest eigenvalues of H l , respectively, and σ 2 i is the variance of η i .
A proof of Theorem 1 is given in Appendix B 2. Theorem 1 gives a bound of the precision of an approximation of the errorC( θ) − C( θ) by 1 i . We remark that the Taylor expansion of the mgf reads where µ i are canceled out inside the bracket of the right hand side of Eq. (12). Hence, whether this approximation is effective or not depends on the behavior of third and higher-order moments of P i . We leave for future work the detailed analysis of noise with general probability distributions. In the following, we return to focusing on Gaussian noise.
We can apply Theorem 1 to the Gaussian distribution because every odd moment of the zero-mean Gaussian distribution is zero, and its mgf exp σ 2 i t 2 /2 obviously satisfies the assumptions of Theorem 1. Therefore, we obtain the following leading-order approximation of the error ǫ( θ) := C noisy ( θ) − C( θ) in the cost function due to Gaussian noise from Theorem 1: Theorem 2. We have the following estimation of the deviation of the cost function due to the fluctuations in the parameters following the Gaussian distribution: We remark that a similar analysis to Theorem 2 appears in Ref. [40]. Theorem 2 implies that the error ǫ( θ) is approximated as H l is in polynomial order of the number of qubit n, i. e., L l=1 (E max,l −E 0,l ) = O(n r ) with a positive number r (e. g. r = 1 for locally interacting spin systems, r = 4 for the Jordan-Wigner transformed full configuration interaction Hamiltonian of molecules [2,38,39]). Then, if all the variances are in the same order σ 2 i = O(σ 2 ) (i = 1, 2, · · · , M tot ), this approximation is valid when . Now, we explicitly apply Theorem 2 to the virtual parameters associated with the stochastic noises, and rewrite the estimation in terms of the error probability. Eq. (6) implies that for small error probability p ν from the Taylor expansion Then, applying Theorem 2 to the virtual parameters with the relations (5) and (17), we obtain the following corollary: Corollary 1. Let the stochastic noise channels E Bν ,pν (ρ) = (1 − p ν )ρ + p ν B ν ρB ν (ν = 1, 2, · · · , M SNC ) with the error probability 0 < p ν < 1/2 be inserted in the circuit with fluctuating optimizing parameters due to Gaussian noise, and hence the noisy cost function is given as Eq. (7). Then, we have the following approximation of the error: where ξ ν is the virtual parameter associated with E Bν introduced in Sec. II B to give the correspondence between the stochastic noise and the Gaussian noise models.
Especially, as a typical model, we consider a local depolarizing noise model such that the depolarizing channel D k,q k is inserted after each k-qubit gate, where we set q k = (4 k−1 − 4 −1 )c k q with q being the scaling of the error probability, and c k being the constant factor characterizing the difference in the error rates between different number-qubit gates. Let the fluctuations of the optimizing parameters themselves be negligible σ i = 0 (i = 1, 2, · · · , M ) in this case. Under this local depolarizing noise model, the following proposition holds: hold with a positive number r. Under the above local depolarizing noise model, we can achieve a given desired precision ǫ * in the sense that when the error probability has the scaling Proof. Applying Corollary 1 to the local depolarizing noise model in combination with Lemma 1, we obtain where each ξ ν denotes the virtual parameter associated with each stochastic Pauli noise channel in the decomposition of one of the k ν -qubit depolarizing channels in the circuit, and the total number of the stochastic Pauli noise channels M DP satisfies M DP = O(M ).
Since the second derivatives are bounded as ∂ 2 holds. Hence, if q satisfies Eq. (20), we obtain For example, when r = 1, to achieve ǫ( θ) ∼ 10 −3 (i. e. we set ǫ * = 10 −3 ) with n ∼ 100 qubits and the number of gates M ∼ 100, the error probability q ∼ 10 −7 is sufficient, according to this order estimation. As we will show in Sec. III B, a simple error mitigation method utilizing Theorem 2 can relax this stringent error estimation. We also remark that this order estimation does not mean that Eq. (20) is required to achieve the precision ǫ * , but it only shows that Eq. (20) is sufficient for that. Hence, larger error probability than this estimation might be acceptable in practice. Another estimation to give a necessary error level will be shown in Sec. IV by a lower bound (41).
From another point of view, the coefficients ∂ 2 (18) give the sensitivity to the noise. Especially, low sensitivity to the fluctuations in the optimizing parameters requires small diagonal components of the Hessian of the cost function. For a minimal point θ * , this means that the trace norm of the Hessian should be small for the low sensitivity to the fluctuations since the Hessian is positive, which implies the flat landscape of the cost function around the minima. However, the optimization in a flat landscape tends to be hard, e. g. due to the required precision of the gradient in gradient descent methods, which increases the required measurement number. Hence, Eq. (15) implies a trade-off relation between the sensitivity to the fluctuations and the trainability of the cost function.
The above argument can be extended to stochastic noise models when a part of the virtual parameters coincides with some optimizing parameters in the sense that B ν = A j -stochastic noise occurs next to U j (θ j ) gate. For example, when all A j are Pauli operators, and depolarizing noise D k,q k acting on the same qubit number k as A j is inserted after each U j (θ j ), one of the stochastic Pauli noise channels composing depolarizing noise is the stochastic A j -channel. In such a case, a part of the effects of the stochastic noise can be regarded as fluctuations in optimizing parameters. To proceed our analysis, we have to separately estimate the derivatives with respect to the virtual parameters which do not coincide with any optimizing parameters since these virtual parameters have nothing to do with the optimization landscape. Here, we call such virtual parameters proper virtual parameters. We define the noiseless precision δ( θ * ) of the minimization as δ( θ * ) := C( θ * ) − E 0 which attributes to poor expression power of the parameterized quantum circuit U ( θ) and to the non-globality of the minimization (i. e. θ * may be a local minimum). We assume that the parameters giving the minima of the noisy cost function does not significantly deviate from the noiseless ones [28].
For concreteness, we consider the case where all A j are Pauli operators. The noise model is the local depolarizing noise model, i. e. depolarizing noise D k,q k acting on the same qubit number k as A j is inserted after each U j (θ j ). We again set q k = (4 k−1 − 4 −1 )c k q with the scale q and the constant factor c k depending on k. In this case, remind that one of the stochastic Pauli noise channels composing depolarizing noise is the stochastic A j -channel. Then, the derivative with respect to the virtual parameter associated with this channel is equivalent to the derivative with respect to the optimizing parameter θ j . We exclude such non-proper virtual parameters, and only consider proper virtual parameters as the virtual parameters ξ = (ξ 1 , ξ 2 , · · · , θ MDP,prop ), where M DP,prop < M DP is the number of the proper virtual parameters. We note the number M DP,prop of the proper virtual parameters is again of order O(M ). Let m i be the number of qubits A i acting on. We also define k ν in the same way as in Eq. (21). Hence, m i = k ν holds for ν with ξ ν associated with the depolarizing channel next to the i-th parametric gate. For convenience, we rescale the parameters as θ i = √ c miθi . Let c := max ν c kν . Then, we can prove the following proposition (see Appendix B 3 for its proof): Proposition 3. Let all A j (j = 1, · · · , M ) be Pauli operators. Under the above local depolarizing noise model, the following inequality holds For a successful minimization, δ( θ * ) should be small, and hence the term term is also negligible for sufficiently small error probability q. Then, Eq. (24) implies that the trace norm of the Hessian of the cost function should be small if the error probability q is not sufficiently small compared to the required level of the error ǫ( θ * ) < ǫ * due to the noise. This fact implies the hardness of the optimization due to the flat landscape of the vicinity of the minima. Moreover, in this case, optimization algorithms utilizing the Hessian become hard since high precision of the estimation of the Hessian is required if the Hessian is small. Oppositely, at least we need q = O(ǫ * ) to achieve ǫ * > ǫ( θ * ) avoiding such hardness.

B. An error mitigation method
We can apply Theorem 2 to derive an error mitigation method. We can cancel the error by subtracting the leading term of the error 1 given that we know the error model, and σ 2 i is small enough so that the sub-leading order terms An advantage of this method is that we only use the noisy estimation of the derivatives of the cost function to mitigate the error, and we do not need to change the noise strength as in the extrapolation method [33,34], nor to sample various circuits as in the probabilistic error cancellation [34,35]. Using the parameter shift rule (B9), we can calculate the second derivatives from noisy evaluations of the cost function. The effect of the noise in this noisy estimation h i ( θ) of the second derivative is estimated by applying Theorem 2 again, which reads Therefore, the error-mitigated cost function C mitigated ( θ) defined as C mitigated ( θ) Eq. (27) is verified by observing that Hence, in this way, we can mitigate the error up to the sub-leading order This method is also applicable to the stochastic noise including depolarizing noise by applying Corollary 1. The overhead of this protocol is the evaluations of the noisy cost function at the π-shift of every parameter including the virtual parameters. π-shift of a virtual parameter ξ ν can be implemented by actually applying its generator B ν at the error occurs. In the case of depolarizing noise, each Pauli rotation gate is inserted to calculate the second derivative with respect to each virtual parameter. Although the extra noise is added as a byproduct of this inserted gate, the order estimation is not affected, since at most a single gate is inserted for each evaluation. We again consider the same local depolarizing noise model with the scaling of the error probability q as the one to obtain Eq. (21). We also assume that L l=1 (E max,l − E 0,l ) = O(n r ). Then, in order to achieve a given precision ǫ * , it is sufficient to have by applying this error mitigation. In comparison to Eq. (20), the order estimation of the sufficient noise level is relaxed by ǫ * /n r via this error mitigation. For example, when r = 1, to achieve the precision ∼ 10 −3 with n ∼ 100 qubits and the number of gates M ∼ 100, the error probability q ∼ 3×10 −5 is sufficient, which is about 10 2 times larger in comparison with the one without the error mitigation shown below Eq. (23), although it is still stringent. However, we again remark that this estimation is only the sufficient order of the error probability to achieve a given precision, but not necessary. Moreover, we can take into account the next-leading order in expansion (B8) to improve the error mitigation if the overhead is acceptable. Further analysis on the practical effectiveness of this error mitigation method including the finiteness of the sampling and the comparison with different error mitigation techniques will be done in a successive work.

IV. LOWER AND UPPER BOUNDS OF THE PRECISION
In this section, we focus on the deviation ǫ 0 ( θ) := C noisy ( θ) − E 0 of the noisy cost function C noisy ( θ) from the minimum eigenvalue E 0 as the error of the noisy VQA task to estimate E 0 . We show upper and lower bounds of ǫ 0 ( θ). The bounds reveal how the spectrum of the target operator and the geometric structure of the ansatz affect the sensitivity of the cost function to the noise. Especially, from the lower bound, we can estimate how small error probability is required to achieve a given precision under reasonable assumptions. We can also derive rough estimations of the bounds which can be easy to check, instead of calculating the Hessian of the cost function in (14), which would be too expensive to calculate just for the error estimations.
In the following, we focus on the case where all the input states |φ l (l = 1, 2, · · · , L) are the same as |φ l = |φ . In this case the cost function is reduced to the expectation value of a single Hermitian operator We denote the smallest, the second smallest, and the maximum eigenvalues of H by E 0 , E 1 , and E max , respectively. We assume that the eigenspace for the minimum eigenvalue E 0 of H is nondegenerate.
In the following analysis, we proceed based on the stochastic noise model. Especially, Gaussian fluctuations in the optimizing parameters can also be modeled as the stochastic noise E Ai,pi with respect to the generator A i of U i (θ i ) through the correspondence shown in Sec. II B, where the error probability p i is given as Moreover, the action of A ierror in the circuit to the cost function is the same as the shift of θ i by π. B ν -error of the stochastic noise can also be represented as the shift of the virtual parameter ξ ν by π. Hence, for convenience, we treat the optimizing and virtual parameters together in the same notation in the same way as in Sec. III A. Now, we introduce the quantity G i1,i2,··· ,i k ( θ) which describes the sensitivity of the state |φ( θ) := U ( θ) |φ to the π-shift of the parameters θ i1 , · · · , θ i k as follows: with k = 1, 2, · · · , M tot . In particular, G i ( θ) (i = 1, . . . , M ) corresponds to the diagonal components of the Fubini-Study metric of the ansatz states with the relation G i ( θ) = 4g i,i ( θ). Then, we obtain the following lower and upper bounds of the error, which are proved in Appendix B 4: Theorem 3. Let all the error probabilities satisfy p i < 1 (i = 1, · · · , M tot ). Then, the error ǫ 0 ( θ) is lower bounded as where R L ( θ) and δ( θ) = C( θ)−E 0 is the noiseless precision. Similarly, where (1 − p i )δ( θ) Especially, let us apply Theorem 3 at a minimal point θ * . Here, we again assume that the parameters giving the minima of the noisy cost function does not significantly deviate from the noiseless ones [28]. Then, the noiseless precision δ( θ * ) should be small enough for a successful minimization, and hence the terms R L ( θ * ) and R U ( θ * ) are negligible. Then, the bounds (31) and (33) are characterized by the spectrum of H and the sensitivity of the ansatz G i1,i2,··· ,i k ( θ * ) as It is noted that E max in the upper bound can be replaced with the maximum eigenvalue of the eigenspace accessible by the ansatz, according to the derivation of the bound. Hence, the upper bound implies how the strategy that restricting the expressiveness of the ansatz can be beneficial for reducing the error due to the noise. From the lower bound, larger gap E 1 − E 0 implies the larger error. As it is considered that larger spectral gap is a key to relax the computational complexity of estimating the ground state energy in general [42], which is actually the case in some cases [43][44][45], this fact implies a trade-off between the hardness of the optimization and the sensitivity to the noise.
Although it is impractical to calculate all 2 Mtot terms with G i1,i2,··· ,i k ( θ * ) (k = 1, 2, · · · , M tot ) in Eq. (35), we obtain the following rough bounds up to the terms with k = 1 because 0 ≤ G i1,i2,··· ,i k ( θ) ≤ 1 holds for any θ: where we have used the fact that is not so expensive. In fact, if we assume that the (virtual and optimizing) parameters θ i are sorted in ascending order of application of their corresponding parametric gates, then | φ( θ + π e i )|φ( θ) | 2 can be calculated by the shallowed circuit up to i-th gate because the gates after it are canceled in the inner product. Collecting the first-order terms with respect to the error probabilities p i in Eq. (35), we obtain the respective leading-order terms ( of the lower and the upper bounds in Eq. (35). We remark that rough bounds (36) include these leading-order terms and only drop a part of the higher-order terms from Eq. (35). Hence, rough bounds (36) captures the main part of Eq. (35) for small error probabilities. As the coefficients G i ( θ * ) of the leading-order terms coincide with the Fubini-Study metric, the geometric structure of the ansatz is connected to the sensitivity of the cost function to the noise. Especially, the bounds implies the following trade-off relation. Although small G i ( θ * ) is better for the noise sensitivity, it becomes hard to calculate the metric itself, which implies the hardness of the metric aware optimization methods such as the natural gradient [46][47][48]. On the other hand, it was shown that the average convergence speed in terms of the optimization steps of SGD can be faster by the smaller metric [32]. This result implies a possibility that the small metric simultaneously improves both the sensitivity to the noise and the convergence speed of the optimization. However, the flat landscape due to the small metric may have rather a bad effect for the optimization due to the high precision required to determine the gradient. In fact, the measurement number and the variance are not taken into account in the analysis of the convergence speed in Ref. [32].
We also have the following rougher upper bound: which is derived by further applying G i ( θ) ≤ 1 to the rightmost side Eq. (36) and observing that Eq. (38) is also available by replacing E max − E 0 with an upper bound of 2 H if accessible. In this way, we can use Eq. (38) for an easy check of the impact of the noise under a given error probabilities p i by using only accessible quantities.
As an another approach to roughly estimating the lower bound, it is reasonable to assume that most of G i1,i2,··· ,i k ( θ) are not close to zero since the state should considerably change as k gates are inserted in the circuit, unless some specific structure exists. Then, assuming that G i1,i2,··· ,i k ( θ * ) > c holds for some constant c > 0, we have Especially, if we assume that c ≈ 1, we can roughly estimate a lower bound of the precision as If we know some estimation or lower bound of E 1 −E 0 , we can use Eq. (41) for an easy estimation of a lower bound of the precision. It should be noted that Eq. (41) is not always true even if δ( θ * ) is small since it is based on the assumption G i1,i2,··· ,i k ( θ * ) ≈ 1, unlike upper bound (38). From the roughly estimated lower bound (40), in order to achieve a given precision ǫ * , the error probabilities need to satisfy In practice, the error in a part of the gates dominates that of the others, e. g. the error in the two-qubit gates usually dominates the single-qubit error. In such a case, the following proposition holds: Proposition 4. Let the error probability p j of M dom stochastic noise channels out of M tot dominate the others, and scale with p as p j = Θ(p). Then, in order to achieve a given precision ǫ * with ǫ * < E 1 − E 0 , the scale p of the error probability must satisfy Proof. Under the assumptions, condition (42) to achieve the precision ǫ * reads Hence, for ǫ * < E 1 − E 0 , the scale of the error rate p must satisfy Eq. (43).
Eq. (43) gives an order estimation of a necessary error level to achieve a desired precision ǫ * . Especially, if E 1 − E 0 = Ω(1), in order to achieve the precision ǫ * , we need This analysis can be straightforwardly applied to the local depolarizing noise model with the scaling of the error probability q as the one to obtain Eq. (21). For example, the error rate q ∼ 10 −5 or less is required when M dom ∼ 100, and ǫ * ∼ 10 −3 , which is 10 2 times larger than the sufficient order to achieve the same precision shown below Eq. (23). This stringent requirement seems reasonable without any error mitigation.

V. NUMERICAL SIMULATION
We demonstrate our results by numerical simulations using Qulacs [49]. Especially, we focus on rough bounds (36), (38) and (41) which can be practically accessible. We remark that all of these bounds are not exact because they are obtained by neglecting the terms R L ( θ) and R U ( θ) in the exact bounds in Theorem 3. That  FIG. 1. The 4-qubit alternating layered circuit used in our numerical simulations. Each layer of the circuit has separated entangling blocks. A single layer is composed of single-qubit Pauli X, Y, Z rotations followed by entanglers composed of the controlled-Z gates acting on the adjacent qubits within a single entangling block. Each entangling block moves to the next pair of qubits when moving to the next layer, where the boundary qubits are connected.
approximation is based on assuming that the noiseless precision δ( θ) is small enough. To reflect this fact, we call the upper and the lower bounds in Eq. (36) "rough" upper and lower bounds respectively. We call the upper bound (38) "rougher" upper bound, as it is rougher than the rough upper bound in Eq. (36). On the other hand, we call lower estimation (41) "extremely rough" lower bound, as it can be violated even if the noiseless precision is small enough because it is based on the additional rough assumption G i1,i2,··· ,i k ( θ * ) ≈ 1. To distinguish among these bounds with different degree of roughness, we indicate the bounds in Eq. We implement our simulations for VQE tasks of 4qubit Heisenberg antiferromagnetic spin chain and a toy model Hamiltonian associated with a variational compiling task [41]. For both simulations, we use an alternating layered ansatz (ALT) [50] shown in Fig. 1 as our parameterized quantum circuit U ( θ), where R P (θ i ) = exp[−iθ i P/2] is a single-qubit Pauli rotation gate with P = X, Y, Z. As explained in detail in the caption of Fig. 1, ALT has some entangling blocks in each layer which are alternated layer by layer. It has been shown that ALT has both good expressibility and trainability [51], which motivates our choice of the ansatz. For both models, we implement our simulations for the circuit with 4 layers. As for the model of the noise, the single(two)-qubit depolarizing channel is inserted after every single(two)-qubit gate. We call these errors single(two)-qubit errors. The single-qubit depolarizing channel on every qubit is also inserted after the final layer as a model of the imperfection of the measurements which we call the readout error. In the following, the single(two)-qubit or readout error probability p refers to the error probability p of the corresponding depolarizing channel D k,p defined in Eq. (8). To calculate bounds (36), (38) and (41), we decompose k-qubit depolarizing channels into 4 k − 1 stochastic channels and usep given by Eq. (9). For simplicity, we do not treat fluctuations in the optimizing parameters. In our simulations, we exactly calculated the noisy cost function by calculating the density matrix of the noisy circuits. To obtain a minimal point of the noisy cost function, we first find a good minimizer of the noiseless cost function by the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm [52][53][54][55] via SciPy [56] starting from a randomly chosen initial parameters. We repeat the above optimization until a good solution is reached to avoid becoming stuck in local minima. Minimization of the noisy cost functions is then done by the BFGS algorithm using this good parameter as the initial parameter. Although the above approach of course does not work in practice, we used it because our purpose is to demonstrate our bounds to estimate the precision of the noisy cost function to approximate E 0 at an in-principle-achievable good parameter.

A. Heisenberg spin chain
We consider the VQE of 4-qubit Heisenberg antiferromagnetic spin-1/2 chain with periodic boundary condition whose Hamiltonian is given as where P i (P = X, Y, Z) is the Pauli operator acting on i-th qubit with the identification P 5 = P 1 . Our task is to obtain a good parameter θ * to approximate the ground state energy E 0 of this H by C noisy ( θ * ). We used an ALT in Fig. 1 with 4 layers as our ansatz. The ALT we used can actually achieve very good noiseless precision δ( θ * ) < 10 −6 which is verified by optimizing the noiseless cost function using BFGS algorithm. Fig. 2 shows the dependence of the error ǫ 0 ( θ * ) on the error rate q in comparison with rough bounds (36), (38) and (41). Here, the error probability of the two-qubit error and the readout error are q, and the error probability of the singlequbit error is 10 −1 q. To calculate the bounds, we used the exact values E 1 −E 0 = 4.000 and E max −E 0 = 12.000 for the 4-qubit Heisenberg spin chain. More practically, we can use the bound to compute upper bounds (36), (38). We also show the bounds using this upper bound of H by red lines in Fig. 2. Our rough bounds actually well capture the scaling of the true error dependence on the error rate. Especially, extremely rough lower bound (41) works well as a lower bound of ǫ 0 ( θ * ), despite the fact that Eq. (41) is not always true as it is based on the rough assumption G i1,i2,··· ,i k ( θ * ) ≈ 1 which can be violated. This behavior is expected from the fact that the deviation of Eq. (41) from true lower bound (35) caused by the deviation of G i1,i2,··· ,i k ( θ * ) from 1 is suppressed if the error probabilities p i are small and the gap E 1 − E 0 is not so large. We also remark that bounds (36) which are always true under the small δ( θ * ) similarly work well.

B. Toy model
Next, in a similar way to [41], we consider a toy target operator which has an exact solution parameter θ * which is uniformly and randomly selected. That is, we generate artificial eigenstates by |ψ i := U ( θ * ) |i (i = 0, 1, · · · , 2 n − 1) for a given ansatz circuit U ( θ), where |i = |i 1 ⊗ |i 2 ⊗ · · · ⊗ |i n with the binary expansion i = i 1 i 2 · · · i n for i. We fix the smallest, the next smallest and the maximum eigenvalues E 0 , E 1 and E max = E 2 n −1 respectively. The remaining eigenvalues between E 1 and E 2 n −1 are randomly selected. Then, the target operator is In this way, the noiseless precision δ( θ * ) = 0 can be always satisfied. In our simulation, we consider the case where E 0 = 1.0 and E max = 100. We use the same ansatz shown in Fig. 1 with 4 qubits and 4 layers. Fig. 3 shows the dependence of the error ǫ( θ * ) in the optimized noisy cost function on the error rate q in comparison with rough bounds (36), (38) and (41). Here, the error probability of the two-qubit error and the readout error are q, and the error probability of the single-qubit error is 10 −1 q. E 1 − E 0 is set to 50. All of our rough bounds well capture the scaling of the true error dependence on the error rate as similar to the Heisenberg spin chain.
On the other hand, Fig. 4 shows the dependence of the error ǫ( θ * ) in the optimized noisy cost function on the spectral gap E 1 − E 0 in comparison with rough bounds (36), (38) and (41). Here the error probability of the single-qubit error is set to 10 −3 , and the error probability of the two-qubit error and the readout error is 10 −2 . According to Fig. 4, up to the moderate size of the gap, all the bounds including extremely rough lower bound (41) actually work well. However, for large gaps, extremely rough lower bound (41) breaks down and overestimates the error because the impact of the error in the rough approximation G i1,i2,··· ,i k ( θ * ) ≈ 1 is emphasized by the large value of the gap E 1 −E 0 . Fig. 4 also implies that the error tends to increase as the gap gets large as implied by the lower bound.

VI. CONCLUSION
We have established analytic formulae for estimating the error in the cost function of VQAs due to Gaussian noise. We can also apply our formulae to a wide class of stochastic noise including depolarizing noise via their equivalence with Gaussian noise. The first main result Theorem 2 gives the leading-order approximation of the error ǫ( θ) in the cost function due to the noise. The Hessian of the cost function as the coefficients of the noise effect implies a trade-off relation between the hardness of the optimization of the parameters and the noise resilience of the cost function. We have derived an order estimation of the sufficient error probability to achieve a given precision based on this formula. This estimation offers stringently small error probability if no error mitigation is taken into account. This is partially because, the estimation is nothing but a sufficient condition to achieve the given precision. On the other hand, the estimation of the necessary order of the error probability to achieve a given precision is provided based on the lower bound Eq. (41). Though this estimation actually gives a larger error probability, it is still stringent without any error mitigation.
Theorem 3 gives upper and lower bounds on the error ǫ 0 ( θ) for approximating E 0 . Especially for a minimal point, these bounds show how the spectrum of the target operator and the geometry of the ansatz affect the sensitivity of the cost function to the noise. The bounds also imply other trade-off relations between the hardness of the optimization and the noise resilience of the cost function attributed to the spectral gap property or the smallness of the Fubini-Study metric of the ansatz. Although it is impractical to calculate the full expression Eq. (35) of the bounds, we have also shown rough bounds which are easier to calculate. The numerical simulations of the VQE of the Heisenberg spin chain and the toy model Hamiltonian have demonstrated the usefulness of our rough bounds. These rough estimations may be utilized as a simple inspection to check the order of magnitude of the impact of the noise.
A highlight of the applications of our formula is the proposal of a quantum error mitigation method shown in Sec. III B. The essence of this error mitigation method is the cancellation of the error based on the expansion of the error with respect to the fluctuations of the parameters including the virtual parameters. An advantage of this method is that we only use the noisy estimation of the derivatives of the cost function to mitigate the error, and we do not need to change the noise strength as in the extrapolation method [33,34], nor to sample various circuits as in the probabilistic error cancellation [34,35]. Although the effectiveness of this method is still inconclusive since we have only an estimation of the sufficient order of the error probability for this method to work, there is a possibility of this method to be efficient. It may also possible to improve this method by taking into account higher order expansions. In a future work, further analysis will be done on this error mitigation method including the finiteness of the sampling and the comparison with other error mitigation methods. To take into account the statistical error due to the finiteness of the sampling, the effect of the noise on the variance of the cost function should also be considered in future works.

VII. CODE AVAILABILITY
Code to reproduce the numerical simulations in this work is available at [62]. A python module to compute rough bounds (36), (38) and (40) is also provided.
sensitivity of the state |φ( θ) to the π-shift of the parameters θi 1 , · · · , θi k defined in Eq. (30) Target operators and cost function H l , |φ l (l = 1, 2, · · · , L) L target Hermitian operators and input states, respectively E 0,l minimum eigenvalue of H l E max,l largest eigenvalue of H l H, |φ single target Hermitian operator and single input state, respectively E0 minimum eigenvalue of H E1 second smallest eigenvalue of H Emax cost function with explicit dependence on the virtual parameters ξ Cnoisy( θ) noisy cost function θ * minimal point of the cost function Error and precision ǫ( θ) error due to the noise in the cost function evaluated at θ ǫ0( θ) deviation of the noisy cost function from the minimum eigenvalue E0 at θ δ( θ) noiseless precision C( θ) − E0 of the cost function at θ for the minimization task ǫ * given desired precision RL( θ) terms related with the noiseless precision δ( θ) in lower bound (31)  In this section, we present proofs of main theorems and propositions in the main text. Here, we prove Proposition 1.
Proof. We define a map Using the relation we have From Eq. (B3), we obtain the equivalence between the Gaussian noise channel G Bν ,σSNC,ν with respect to B ν with the variance and the given stochastic noise channel E Bν ,pν as follows:

Proof of Theorem 1
Here, we give a proof of Theorem 1.
Proof. Let us introduce the multi-index notation for α ∈ N Mtot and θ ∈ R Mtot as follows: The partial derivatives of a function f are denoted as By Taylor expanding the integrand C( θ + η), we obtain the following expression from the definition of the noisy cost function (11) C( θ) where we denote i dP i (η i ) by dP( η).
Because of A 2 i = 1 the second derivatives read where e i denotes the vector whose i-th component is 1 and the other components are 0. Similar relation is used in Refs. [57][58][59][60]. By recursively applying the relation (B9), it turns out that the derivatives D 2α C( θ) have the form with some parameters θ i,1 (2) . Since L l=1 E 0,l ≤ C( θ) ≤ L l=1 E max,l holds for any parameter θ, we obtain [61] Then, applying Eq. (B11) to (B8), we obtain where the last equality follows from the assumption of the nonnegativity of the moments η α i dP i (η i ) ≥ 0 (even moments are always positive). Moreover, we have where µ (αi) i is the α i -th moment of P i defined in the main text, and the Taylor expansion (13) of the mgf is used to obtain the last equality. Therefore, we obtain Eq. (14).

Proof of Theorem 3
We define the shifting map S i which maps the cost function to the shifted one as S i C( θ) = C( θ+π e i ). Then, the noisy cost function can be written as (1 − p i + p i S i ) C( θ). (B19) Based on Eq. (B19), we can expand the error ǫ 0 ( θ) as Notably, the precision δ( θ) of the noiseless cost function is separated from the error due to the noise in the above expansion. Furthermore, the difference of the shifted cost function k l=1 S i l C( θ) from E 0 can be estimated as follows: Lemma 2. The following relation holds Combining Eq. (B20) with Lemma 2, we obtain Theorem 3.
To prove Lemma 2, we show the following relation between the fidelity and the expectation. holds.