Efficient estimation of trainability for variational quantum circuits

,


I. INTRODUCTION
Inspired by the success of machine-learning methods, variational quantum algorithms [1][2][3] have emerged as a promising way to harness the power of quantum computing in various domains ranging from quantum chemistry [4][5][6] to combinatorial optimization problems [7][8][9].These algorithms use the output of parameterized quantum circuits as variational ansätze, whose parameters are classically optimized through gradient-based methods.
Variational quantum circuits can suffer from trainability issues caused by the existence of barren plateaus [10], a limitation that has been extensively studied in the recent literature .It is characterized by an exponential vanishing of the cost function's gradient with the system size that makes training variational quantum circuits impossible for a large number of qubits.Barren plateaus can originate from various and fundamentally different phenomena.Their emergence was first shown in Ref. [10] for 2-designs (random unitary transformation matching the Haar distribution up to the second moment).Recent works linked barren plateaus to the expressibility of the ansatz [11] as well as noise [12] and entanglement.In particular, the authors of Ref. [13] showed that for architectures that can be split into a hidden and a visible subsystem, such as quantum Boltzmann machines or feed-forward quantum neural networks, an excess of entanglement between the two subsystems would result in a highly mixed state for the visible subsystem.This can lead to a flat landscape for the cost function.The effect of the structure of the cost function on the appearance of barren plateaus was also investigated in other works [14,15], and it was shown that global cost functions are more prone to exhibit barren plateaus.Note that shallow models such as quantum kernel machines [39][40][41][42][43] and reservoir computing models [44][45][46][47], while often easier to train than variational quantum algorithms, might also suffer from trainability issues of a similar nature [48].
Numerous investigations have proposed strategies to address the barren-plateau issue.In the context of entanglement-induced barren plateaus, most strategies rely on limiting the amount of entanglement [16][17][18][19][20].
In this paper, we propose an alternative approach to the problem by providing an efficient method to estimate the average gradients and their variance for a wide class of variational quantum algorithms.By studying the quantum channel associated to a random single-qubit rotation, we prove that under some simple conditions the first and second moments can be expressed as mixedunitary channels [49] composed of Clifford gates [50].Upon some additional general assumptions for the random angles distribution, we demonstrate that this allows us to exactly map randomly initialized circuits composed of Clifford gates and parametrized rotations to an ensemble of Clifford circuits.Moreover, we prove that the obtained ensemble can be efficiently sampled to compute quantities of interest, such as the variance of the gradient or the average of the cost function over the initial random parameters.Making use of the celebrated Gottesman-Knill theorem [51,52], we analytically prove the efficiency of our method that can be implemented on a classical computer with a complexity scaling polynomially in both the number of variational parameters and the system size.In addition, we show some numerical experiments to illustrate our method on examples of random circuits and faithfully reproduce the exponential suppression of the variance first found in Ref. [10] with polynomial resources and for circuits acting on up to 100 qubits.
Figure 1.A schematic representation of the mapping from a parameterized quantum circuit with random parameters to Clifford approximant circuits for first-order quantities (quantities that requires only the knowledge of the rotation 1-fold channels to be computed, see App.B).For a circuit with M parameters, a sample size of the order of M/ϵ 2 is enough to get an approximation of the average on the initial circuit with a precision ϵ on the observables mean values (see App. C).

II. THEORETICAL FRAMEWORK A. Variational problem
In variational quantum algorithms, a parameterized unitary transformation Û (θ) acting on n qubits is used as a variational ansatz to achieve a task expressed as the minimization of a cost function for some observable Ô and some initial n-qubit state ρ.This formulation is general and encompasses typical tasks, such as the preparation of a target state |ψ⟩ (setting Ô = − |ψ⟩⟨ψ|) or ground state search for some Hamiltonian Ĥ (setting Ô = Ĥ).The considered parameterized unitaries are typically composed of a succession of parameterized gates and fixed layers.Here we consider a generic ansatz of the form where each unitary Ûi (θ i ) = e −i θ i 2 Pi is a single qubit rotation associated to a given Pauli operator Pi ∈ { X, Ŷ , Ẑ}, while the Ŵk are fixed layers composed of a sequence of unparameterized gates that can act on multiple qubits.Upon absorbing Clifford gates in the fixed layers, we will consider the rather general class of circuits based on Z rotations [53].The unitary transformation Û (θ) depends on M continuous parameters gathered in the vector θ = (θ 0 , . . ., θ M −1 ).These rotation parameters can be optimized using classical gradient-descent techniques.The gradient of the cost function with respect to the k-th parameter can be conveniently estimated using the parameter-shift rule [54,55]: where e k is the canonical vector along the component k.
It is worth noticing that the ±π/2 shifts in the parameter θ k can be factored out and seen as an extra Clifford gate added to the fixed layer Ŵk .In fact, remarking that the phase gate Ŝ can be written Ŝ = e i π 4 e −i π 2 Ẑ and assuming Pk = Ẑ, we have: We define Ŵk,± = e ∓i π 4 Ŝ Ŵk such that we can write We denote the shifted unitaries appearing in the parameter-shiftrule.From what precedes we have with

B. Unitary ensembles and t-fold channels
To start the optimization process the rotation angles are randomly initialized according to some probability distribution p(θ).The initialized circuit can then be represented by a unitary ensemble U = { Û , P( Û )}, where P is a probability measure on U.One is often interested in computing averages of quantities that are polynomial of a given order t in the entries of Û .Such quantities can be completely determined by the knowledge of the t-fold channel [56] where ρ is an initial state of t copies of the original nqubit system.As an example, the expected value of the square of the cost function at the initialization can be evaluated using θ (ρ ⊗2 ) Ô⊗2 , where we denote Φ (t) θ the t-fold channel associated to the unitary ensemble { Û (θ), R M ∋ θ ∼ p(θ)}.In App.B we define second-order quantities (respectively first-order quantities) as quantities that can be obtained from the knowledge of the 2-fold (respectively 1-fold) channel.To give a concrete example, E θ C(θ) 2 is a second-order quantity.
More generally, one can characterize the expressivity of a given ansatz by comparing its t-fold channels to the ones obtained for a Haar (uniform) distribution over the whole unitary group [11,57,58].Unitary ensembles whose t-fold channels match the t-fold channels for the Haar measure, the so-called t-designs [59,60], have played a crucial role in the original discovery of the barren plateaus phenomenon [10].Moreover, in multiple cases random quantum circuits are approximate t-designs [61][62][63].For instance, the authors of [61] showed that quantum circuit composed of a polynomial number of gates randomly drawn from a universal set of two-qubit gates and applied to random pairs of qubits are approximate 2-designs.This result have been extended to cases where the gates are applied to nearest-neighbor qubits in [62] and [63].

C. Barren Plateaus
For a unitary ensemble U that describes parameterized ansätze Û (θ) with random continuous parameters θ and a possibly random architecture, a cost function C( Û (θ)) is said to exhibit a barren plateau if the probability of obtaining a gradient that deviates from zero by some ϵ > 0 vanishes exponentially with the system size n.More precisely, P U (|∂ k C| > ϵ) ≤ O(exp(−αn)) for some α > 0 [11].As mentionned earlier, barren plateaus were first found for unitary ensembles forming 2-designs [10] and connections to expressivity [11], noise [12], entanglement [13,[16][17][18][19][20] and the degree of locality of the cost function [14,15] were later discovered.In many cases, the average value of the gradient vanishes exactly, for instance when the rotation parameters are initialized uniformly in [−π, π].However, this does not imply a vanishing of the gradient amplitude on average, and thus does not tell much about the trainability of the model.In this unbiased case, the variance is a relevant quantity.Due to the Chebyshev inequality, one has so that a vanishing variance implies the existence of a barren plateau.On the other hand, a non-vanishing variance guarantees large fluctuations of the initial gradient and thus a good initial trainability, independently of the gradient bias.
For variational quantum algorithms, the gradient is to be estimated through measurements realised on an hardware platform.As argued in Ref. [64], probing an exponentially small gradient requires an exponential precision on the measurements and thus an exponential number of experiments, which is prohibitive.Barren plateaus are often seen as a quantum version of the vanishing gradient phenomenon in classical machine learning.The impact of vanishing gradients on the trainability of classical deep neural network is discussed in Ref. [65].The exact nature of barren plateaus was investigated further by the authors of Ref. [66] in a quantum machine learning context, using a least square loss function and relying on a neural tangent kernel formalism.In particular, the authors discussed the fundamental differences between barren plateaus and classical vanishing gradient, and they show that in a certain over-parameterized regime the training procedure might be robust to noise.Let us notice that variational quantum algorithms can suffer from other issues beyond barren plateaus, related for instance to the number of local minima of the loss landscape [67] or to the complexity associated with the classical optimization procedure [68].Also, higher-order moments may help diagnose the trainability of variational quantum algorithms [69].

III. RESULTS AND DISCUSSION
The main finding of this theoretical work is that under some rather general assumptions on the distribution of the rotation parameter θ, it is possible to map the 1-fold and 2-fold channels of a random rotation RZ (θ) to a finite unitary ensemble of Clifford gates.Moreover, we prove that such mapping allows us to estimate quantities of interest such as the gradient variance using only Clifford circuits.Finally, we illustrate our rigorous proofs through numerical experiments.The detailed mathematical proofs are presented in the Appendix.

A. Exact mapping and efficient sampling
As mentioned earlier, we will focus on the class of variational quantum circuits composed of fixed Clifford gates alternated with single qubit parameterized rotations along the X, Y or Z directions, such as the one depicted in Fig. 1.As explained in Sec.II A, we will restrict our study to rotations along Z, as we can obtain the cases of rotations along Y and X by adding extra Clifford gates to the different fixed layers of the considered ansatz.Let us consider a rotation along the Z-axis with a distribution that is symmetric about the θ = 0 angle [70].We show in App.A that the 1-fold channel corresponding to a first order average can be written as a convex sum of the unitary channels associated to the identity and the Pauli Z gates, as schematically represented on the upper part of Fig. 2. Note that this result has been derived and used in [71] in the case of a uniform probability distribution for θ in order to analyze a variational ansatz through the lens of ZX-calculus.
To compute the 1-fold channel for the randomly initialized ansatz of the form given in Eq. (2) with independent rotation parameters, one can simply compose the 1-fold channels associated to each rotation, interwined with the unitary channels associated to the fixed gates Ŵk .We find that the 1-fold channel of the ansatz is a convex sum of 2 M Clifford unitary channels, where M is the number of rotations.One can view this convex sum as an average over a finite ensemble of Clifford approximant circuits.Examples of such circuits are provided in App.F for a simple architecture similar to the one in Fig. 1.Although the number of Clifford approximant circuits in this ensemble is exponential in the number of parameters, we show in App.C 2 that a number of samples polynomial in M/ϵ 2 is sufficient to approximate the average of an observable expectation value (or more generally of any first-order quantity) to any desired precision ϵ.This result relies on a classical concentration argument, and is schematically represented in Fig. 1 for a simple circuit at the first order.From this, one can estimate the expectation value of the gradient, as it suffices to replace Û (θ) by Û± (θ) (as defined in Sec.II A) in the 1-fold channel definition to obtain the expectation of C(θ ± π/2).This gives the expectation of the gradient thanks to the parameter-shift rule.
We prove in App.A that the 2-fold channel associated to a random Z-rotation is also a linear combination of Clifford channels, provided that the probability distribution is an even function of θ.This result is de-picted in Fig. 2(b).To obtain this mapping, we use the Choi representation of quantum channels [49].The Choi operators representing unitary quantum channels given by tensor products of Z-rotations gates are diagonal.Hence, one can represent these channels by the diagonal coefficients of their associated Choi operators.Using this representation and the linearity of the expectation, we obtain a tractable representation of the average two-fold channel of a Z-rotation.The decomposition is then derived by solving a linear system of equations obtained by identifying the coefficients of the previous representation for the different channels involved.When the inequalities E θ [f + (θ)] ≥ 0 and E θ [f − (θ)] ≥ 0 with f ± (θ) = cos θ(cos θ ± 1) are satisfied, the previous linear combination is in fact a convex sum.Equivalently, the 2-fold channel of a Z-rotation with an even angular probability distribution is a Clifford mixed-unitary channel if The zeros of f + , f − are the angles {kπ/2, k ∈ Z} for which RZ (θ) matches a Clifford gate (see App. A).Indeed, if the distribution of θ is a convex sum of Dirac distributions at these angles, the average over θ becomes a discrete average over the corresponding Clifford unitaries.Hence, the associated 2-fold channel is indeed a convex sum of Clifford channels.One can also verify that the previous conditions are satisfied for distributions that are both even with respect to the angle θ and π-periodic.For example, the uniform distribution is included.In the case of a centered Gaussian distribution, the previous conditions are satisfied if and only if the corresponding width is large enough.
Provided the distributions of the rotation angles satisfy the conditions discussed above, the scheme can be extended to the second order, allowing to approximate second-order quantities, such as the average of the squared cost function E θ C(θ) 2 , using a set of Clifford approximant circuits.By the parameter-shift rule, the expectation of the squared gradient can be calculated from the knowledge of four quantities of the form . The latter can be estimated with Clifford approximants by replacing the Û ⊗2 term in the definition of the 2-fold channel by Û± ⊗ Û± .Hence the scheme covers the estimation of the gradient variance.Note that at the second order, the approximant circuits are obtained by replacing the rotation 2-fold channels by one of the four 2-qubit Clifford gates depicted on Fig. 2, yielding an ensemble of 4 M possible Clifford circuits.As for first-order quantities, a number of samples scaling linearly in M is enough to guarantee convergence.These rigorous results are summarized in the following theorem, whose detailed proof is shown in Appendices A and C.
Theorem III.1.For a variational ansatz composed of fixed Clifford gates and of M parameterized rotations along the X,Y or Z direction, if the random variational parameters (θ 1 , . . ., θ M ) are independent and symmetric with respect to one of the Clifford angles, i.e. ∈

{0, π
2 , π, 3π 2 }, then for any error ϵ > 0 and a probability 1 − δ to meet such accuracy, any first-order quantity can be computed using Clifford approximant circuits.Moreover, if the distribution of θ i satisfies the inequality then the same holds for any second-order quantity.
Finally one makes use of the Gottesman-Knill theorem, which states that for a Clifford unitary Û and an observable Ô acting non-trivially on N O qubits, the expectation value Tr |0⟩⟨0| ⊗n Û † Ô Û can be classically computed with a polynomial complexity in both n and N O .
Our method inherits this complexity, and in particular we can classically estimate the gradient expectation and variance with a polynomial complexity in n, N O and M .In App.D we extend the scheme presented above for the 2-fold channels to the case where the distribution of the random angle does not satisfy the convex condition of Eq. (10).In that case, it is still possible to use Clifford approximant circuits to estimate second-order quantities, but this comes at the price of an exponential complexity in the number of variational parameters M .This result is based on a sampling scheme proposed by the authors of [72].In that context, the method allows to trade an exponential complexity in the system size for an exponential complexity in the number of variational parameters M .
In App.E, we also present the extension of our scheme to the general case of N -fold channels.We prove that the N -fold channel associated to random Z-rotations can be decomposed as a real sum of Clifford unitary channels.From this decomposition, we derive a condition for the We emphasize that derivatives with respect to the other angles θ k give similar results (not shown).The results are for random circuits composed of a single layer of gates, with one rotation per qubit.Such rotations are randomly chosen among RX , RY , RZ .The rotation layer is followed by a layer of alternated CZ gates (note that this is the same type of architecture as that represented on Fig. 1).
The random rotation angles are independent and follow the uniform probability distribution on the interval [0, 2π].In order to get the estimation, we have randomly sampled 500 different circuit architectures.For each gates architecture, we have computed the average of the squared gradient assuming a uniform distribution of the rotation angles, using both a direct estimation and our method based on the mapping to Clifford approximant circuits.In particular, we have sampled 500 vectors of angle parameters for the direct estimation and 500 Clifford circuits for our method.Note that for the uniform distribution the average gradient vanishes, thus estimating the squared gradient is equivalent to estimate the gradient variance.
N -fold channel to be a convex sum of Clifford unitaries by imposing the coefficients of the combination to be positive.However we show that the obtained decomposition is not unique, so that the derived condition is sufficient but not necessary.Finding a sufficient and necessary condition on the distribution of a random angle that guarantees that the corresponding N -fold channels are Clifford mixed unitary channels remains an open problem.

B. Numerical simulations
To illustrate the applications of our exact mapping and the ensuing estimation method, we have performed numerical experiments on concrete examples.Let us consider a simple variational quantum circuit composed of layers of single-qubit rotations along either the X, Y or Z axes, alternated with fixed layers of Control-Z gates.Such an ansatz is shown for three qubits in Fig. 1.We further assume that the rotation angles are independent and identically distributed according to the uniform law . Squared statistical bias of the estimator considered in Fig. 3 for random circuits with n = 5 qubits versus the number K of Clifford approximant circuits.The results have been obtained with 500 randomly drawn circuit architectures.For each sample size K, we consider a bootstrap batch of 100 estimators (each estimator is obtained by sampling K circuits from a set of 2000 Clifford approximant circuits for each choice of the rotation directions).Then for each K, the statistical bias is derived from the bootstrap batch.The estimator true expected value is provided by the direct estimation of the average squared gradient with 4000 samples.The shaded area corresponds to the interval between the 20 and 80 percentiles of the estimated biases for the 500 random architectures.Var ÊK Variance of the estimator of the expected squared gradient with respect to the first parameter θ0 versus the number K of Clifford approximant circuits.Same type of random circuits as in Fig. 3 with n = 5 qubits.We have used the same bootstrap procedure as in Fig. 4. The shaded area corresponds to the interval between the 20 and 80 percentiles of the estimated biases for 500 random architectures.
We consider these architectures with random directions of the rotation gates.Up to a different fixed first layer, such random circuits have been showed to exhibit barren plateaus in Ref. [64].Note that in this particular case the averaging was done on both the rotations an- Clifford circuits.The circuits and cost function used are the same as the for Fig. 3.
gles and the rotations directions.Here we reproduce this result using Clifford approximants.To do so we sample both the exact circuit architecture by randomly selecting the rotation directions uniformly from {X, Y, Z}, and then we either sample the rotation angles directly or we sample a Clifford approximant circuit.For a uniform distribution we have ∀k ∈ Z, E θ [cos kθ] = 0 so the sampling of the replacement Clifford gates is uniform (as represented on Fig. 2 for r 1 = r 2 = 0).Moreover, by the parameter-shift rule (Eq. 3) it is clear that for uniformly distributed rotations the average gradient is analytically zero, thus it suffices to estimate the average of the squared gradient as Var In Fig. 3 the estimations of the average squared gradient using either direct evaluations or by sampling Clifford approximants are shown.Note that the average is taken over both the random rotation angles and the variable architecture (i.e. the random direction of the rotation gates).The estimation obtained from Clifford approximants accurately matches the direct estimation and the average squared gradient vanishes exponentially with the number of qubits, as expected.In addition, the evolution of the bias of the Clifford estimation with the number of approximant circuits K is shown in Fig. 4. The bias decreases polynomially with K.As appears in Fig. 5, the same trend holds for the variance of the Clifford estimators.These results are in agreement with the analytical scaling derived in App. C.
Using our method, we also reproduced the results showing the exponential suppresion of the gradient variance presented in Ref. [64] for up to 100 qubits.The results are presented on Fig. 6.
Finally, we illustrate the impact of an ansatz architecture on its trainability by evaluating the gradients variances for a set of randomly drawn ansatze and for a given cost function Hamiltonian.We consider random circuits acting on 40 qubits, and a Hamiltonian composed of a Each circuit is composed of 10 layers.For each layer l, a number m l is randomly chosen in the interval [0, n], and the layer is then built by first applying m l random single qubit rotation gates to randomly chosen qubits, and then applying a serie of entangling gates of the same type and with the same arrangement as the ones considered in Fig. 3.For each circuit and each parameter θi, the gradients are estimated using 10 4 Clifford approximant circuits.The Hamiltonian of the cost function is a sum of 10 random Pauli strings.Note that each curve corresponds to a different architecture.
sum of 10 randomly chosen Pauli strings.As for the results presented in Fig. 3, the considered variational circuits are composed of layers of rotations alternated with fixed entangling layers.Here we considered circuits with 10 layers.For each rotation layer, a random subset of rotations is replaced by identity gates (see Fig. 7 for details).The variances of the cost function partial derivatives with respect to the ansatz parameters are shown on Fig. 7, and Fig. 8 shows the average variance of the gradients versus the the mean value of the cost function for the different random circuits.These results show that for a given cost function, modifying the ansatz architecture has a strong effect on the trainability.We therefore believe that our method may be used to systematically examine such effects for large systems with a reasonable cost, hence guiding the design of better variational ansätze.

IV. CONCLUSIONS AND PERSPECTIVES
In this paper we presented a classically efficient method to estimate first and second-order expectation values for a large class of randomly initialized variational quantum circuits.This includes estimating the average gradient of the cost function and its variance, which can be used to estimate the trainability.Our method ap-0.0020.004 0.006 0.008 0.010 0.012 plies to the large class of circuits whose architecture is composed of fixed Clifford gates and single-qubit parameterized rotations, provided that the rotation angles are independent and that their distributions are symmetric with respect to an angle θ 0 ∈ {kπ/2, k ∈ Z} and satisfy The method relies on an exact mapping of randomly initialized variational quantum circuits to ensembles of Clifford circuits and on the Gottesman-Knill theorem.We provide rigorous convergence guarantees, and in particular we show that the complexity of the method scales polynomially in both the system size and the number of parameters of the considered ansatz.We investigated the generalization of the proposed scheme to the case of N -fold channels, and showed that the N -fold average of random Z-rotations can be expressed as a real combination of Clifford unitaries.However, such a decomposition is not unique, and finding a sufficient and necessary condition for the considered N -fold channel to be a Clifford mixed-unitary channel remains an open problem.Solving this problem is of great interest as it could allow to generalize the scheme presented in this work to ansätze with correlated variational parameters.
We believe that such a tool will prove very useful in future applications, as it could be employed to conduct classical optimization of architectures and initialization of large scale variational quantum circuits.As the absence of barren plateaus can be guaranteed by a large enough variance of the gradient, regardless of the exact origin of the potential barren plateaus, this method could be used to certify trainability for system with a very large number of qubits.

ACKNOWLEDGMENTS
This work was supported by Region Île-de-France in the framework of the Domaine d'Intérêt Majeur (DIM) Science et Ingénierie en Région Île-de-France pour les Technologies Quantiques (SIRTEQ).This work was granted access to the High Performance Computing (HPC) resources of Très Grand Centre de Calcul (TGCC) under Allocation No. 2022-A0120512462 made by Grand Equipement National de Calcul Intensif (GENCI).We would like to thank Zakari Denis for helpful discussions during the early stages of this work.Here we give the expression of the 1-fold channel for a single-qubit rotation around the Z axis.The rotations around the X and Y axis can then be obtained by combination with Hadamard and phase gates.Let us define Π0 := |0⟩⟨0| and Π1 := |1⟩⟨1|: Thus We recognize the characteristic function of the distribution of θ, namely Assuming this probability distribution is even in θ, we have ϕ(t) ∈ R, ∀t and we can define r 1 = ϕ(1) = ϕ(1) * = ϕ(−1).As we have 1 = Π0 + Π1 and Ẑ = Π0 − Π1 , we get and hence This is indeed a convex sum of Clifford channels under the condition that r 1 ∈ [−1, 1], which is always satisfied.
For distributions that are symmetric with respect to a Clifford angles ∈ {kπ/2, k ∈ {0, 1, 2, 3}}, we can factor out the corresponding rotation, which is (up to a phase) a Clifford gate.This way we can fall back to the case of an unbiased even distribution, i.e. symmetric with respect to the zero angle.Note that in the particular case of the uniform distribution over [0, 2π], we have r 1 = 0.

2-fold channel
In this section we will make use of the Choi representation of quantum channels, which allows to represent channels acting on two-qubits states by 16 × 16 matrices.For a quantum channel (i.e. a completely positive trace preserving or CPTP map) E, the Choi operator is defined by: Its corresponding matrix entries are: In the following, we will write for the quantum channel associated to a unitary transformation Û .We assume that Û is diagonal in the computational basis, so that we can write where we define the projectors Πij := Πi ⊗ Πj .For Û unitary, we have Û Û † = 1 = i,j λ ij λ * ij Πij , and hence λ ij = e iθij , ∀i, j.Therefore we have Thus the Choi matrix of E[ Û ] is diagonal whenever Û is of the form given in Eq. (A8).We can represent it by a 4 × 4 matrix M , whose entries are defined by M (ij),(kl) := Λ (ijkl),(ijkl) . (A10) Note that the matrix M is Hermitian and that its diagonal entries are always equal to one, due to Eq. (A9).In the following we will represent each channel by its associated matrix M in the basis (00), ( 01), (10), (11).
As done earlier, we will focus on rotations around the Z axis.We have we can write Φ (2) For the uniform distribution of θ in [0, 2π], we have E θ e ±iθ = E θ e ±2iθ = 0, and thus: We can represent Φ Z by its associated matrix One can verify that the following channels also have a diagonal Choi matrix, and we can use the same represen-tation of their diagonals, giving with Ŝ = Π 0 +iΠ 1 the phase gate.Gathering all together, we have The final result in the main text then follows by linearity and uniqueness of the Choi matrix.

b. Even distribution
Let us consider an even probability distribution of θ (i.e. a distribution for which θ has the same law as −θ).For such distributions we again have that ϕ Defining r 1 = ϕ θ (1) and r 2 = ϕ θ (2), we can write Hence we obtain: We can express M (Φ Z ) as a linear combination of the matrices of Eq. (A18), giving (A22) The coefficients a, b, c can be found by solving the linear system and one finds Therefore, the associated channel is Φ Remark.Defining CZ = Π0 ⊗ 1 + Π1 ⊗ Ẑ the control-Z gate and CZ X = ( X ⊗ X)CZ( X ⊗ X), we have and thus Therefore the decomposition of Φ Z into a convex sum of Clifford channels of Eq. (A25) is not unique.
The decomposition obtained in Eq. (A25) is a convex sum if one assume that (1 + r 2 − 2r 1 ) ≥ 0 and (1 + r 2 + 2r 1 ) ≥ 0. This condition holds if and only if namely if and only if This condition is fulfilled for the distributions that are π-periodic as in that case we have E θ [cos θ] = 0. Another example of distribution that satisfy this constraint is a Gaussian distribution with a large enough variance.In fact for a centered Gaussian distribution of variance σ 2 , we have r 1 = e −σ 2 /2 and r 2 = e −2σ 2 , so that the condition becomes One can show that this condition is equivalent to σ 2 ≥ σ 2 0 for some specific σ 0 ∈ R, yielding a requirement on the width of the gaussian.

Appendix B: First and second-order quantities
In this appendix, we define the notion of first and second-order quantities as quantities that can be obtained from the knowledge of respectively the 1-fold and the 2-fold channels for each of the random rotations appearing in a given ansatz.We also show that the average cost function and the average gradient are first-order quantities, while the average of the squared cost function and of the squared gradient are second-order quantities.

First-order quantities
Let us consider the ansatz defined by and denote the unitary channels associated to the different layers of the circuit.The whole circuit unitary transformation then reads The cost function is then given by and its expectation with respect to θ is θ (ρ) Ô .

(B5)
Here, we used both the linearity of the expectation and the definition of the 1-fold channel from Eq. ( 9).The cost function expectation can thus be obtained from the knowledge of the complete 1-fold channel Φ θ .Assuming that the angles {θ i } are independent from each other, the expectation against θ can be factored in expectations against the θ i 's, which allows to write: As explained in the main text, we can consider without loss of generality all the rotations to be Z-rotations.Then the channels E θi [U i (θ i )] are exactly 1-fold channels associated to a Z-rotation acting on a single qubit, and hence can be computed from the results of App. A. As stated earlier, we refer to quantities that can be obtained from the knowledge of the 1-fold channels associated to each rotations of the ansatz as first-order quantities.Hence the average cost function E θ [C(θ)] is a first-order quantity.
Another example of an interesting first-order quantity is the average of the gradient.From the parameter-shift rule and using the linearity of the expectation, we have: Here e k is the unit vector along the k-th component.The ±π/2 shifts in the parameter θ k can be factored out and seen as an extra Clifford gate added to the fixed layer Ŵk .In fact, assuming Pk = Ẑ and denoting Ŝ the phase gate, we have Ûk (θ k + π/2) Ŵk = e −i θ k 2 Pk e −i π 2 Ẑ Ŵk = e −i π 4 Ûk (θ k ) Ŝ Ŵk .Defining Ŵk,± = e ∓i π 4 Ŝ Ŵk , we get Ûk (θ k + π/2) Ŵk = Ûk (θ k ) Ŵk,+ .We can proceed likewise to define Ŵk,− .In the following, we can write ∀i ̸ = k Vi,± = Ŵi and Vk,± = Ŵk,± the modified fixed layers that include the considered shift.We have where . The average gradient is therefore a first-order quantity, namely depending on 1fold channels only.

Second-order quantities
We now turn our attention to the mean value of the squared cost function.This is given by For every state ρ of a system of 2n qubits (i.e., a doubled version of the original system where the copy is not connected by gates to the original circuit), we define (B10) Likewise we can define the doubled version of the circuit layers as giving Thus for independent rotations we have As for first-order quantities, we refer to quantities that can be obtained from the knowledge of the average 2-fold channels of the rotations layers E θi U i (θ i ) as secondorder quantities.
The average of the squared cost function is thus a second-order quantity, and as for the first order case, we can show that the squared gradient is also a second-order quantity.In fact, by making use of the parameter-shift rule, we see that to obtain the average of the squared gradient we have to compute the following four terms As done before, it suffices to replace the W (2) i in Eq. (B13) with Finally, the gradient variance can be computed as which is the sum of a first and a second-order quantity.
Appendix C: Proof of the sampling efficiency In this appendix we prove that to obtain an estimation of any first or second-order quantity for a given ansatz up to a precision ϵ and probability δ ∈ [0, 1] to meet this precision, it suffices to sample a number of Clifford approximant circuits K ∼ log(2δ)M/ϵ 2 .By invoking the Gottesman-Knill theorem, we obtain an estimation of any of the previous quantities with a complexity polynomial in both the size of the system and the number of variational parameters of the considered ansatz.

Details on the mapping
Here we give details on the mapping of the randomly initialized parameterized circuit to Clifford approximants.
Remark.We use the notations adapted to first-order quantities.The generalization to the second order and the shifted versions is straightforward as it suffices to replace each channel by its doubled and/or shifted version, as done in App.B.
Assuming that the θ i are independent from each other, averaging U(θ) over θ amount to replace each rotation channel U i (θ i ) by a convex sum of m Clifford unitary channels U ij with associated weight p ij .Thus E θ [U(θ)(ρ)] is replaced by a discrete average over m M Clifford unitary channels (with m = 2 for the 1-fold channel and m = 4 for the 2-fold channel): As we want to sample from that sum, we can define for each i a discrete random variable X i taking values in {1, . . ., m} such that P(X i = j) = p ij .This represents a choice of a given unitary in the previous convex sum.
Gathering these for all k we get a random vector X = (X 1 , . . ., X M ) ∈ {1, . . ., m} M that completely defines a unique unitary U(X) through: Thus we have: The main idea is now to approximate the k-fold channels by an empirical average over K samples of the previous Clifford circuits, namely:

Sampling efficiency
Our result relies on classical arguments for the sampling of bounded functions depending on a set of random variables using the McDiarmid's concentration inequality [73,74], which we remind below.
Definition C.1 (Bounded difference property).A function f : X M → R satisfies the bounded difference property if and only if ∃{c 1 , . . ., c M } such that ∀i ∈ {1, . . ., M }, ∀ (x 1 , . . ., x M ): Theorem C.1 (McDiarmid's inequality).Let f : X M → R satisfy the bounded difference property with bounds {c 1 , . . ., c M }, and a random vector X = (X 1 , . . ., X M ) taking values in X M , then ∀ϵ > 0 We will show that the quantities we want to estimate satisfy the bounded difference property and apply the McDiarmid's inequality to prove that our previous sampling is efficient.In the following we define where Ô is the cost function observable defined in the main text and, as in the previous section, U(x) the unitary channel associated to a given Clifford approximant circuit that is completely specified by a discrete vector x = (x 1 , . . ., x i , . . ., x M ) ∈ {1, . . ., m} M .By the Hölder inequality [49,75], f is upper-bounded: where ∥A∥ 1 , ∥A∥ ∞ are respectively the Schatten-1 norm and the spectral norm [49].Remark that ∥ρ∥ 1 = 1 for ρ is a density operator.Defining a second vector for which only the i-th component is changed x ′ = (x 1 , . . ., x ′ i , . . ., x M ), we get by using the triangle inequality Hence f satisfies the bounded difference property with c i = c = 2∥ Ô∥ ∞ , and we can apply McDiarmid's inequality, which gives almost the desired result.To go further, we define Clearly, f K satisfies the bounded difference property with the same bound c [to see this, we take all x ij equal except for x kl , and it follows that the difference . Thus McDiarmid's inequality applies to f K , which is a function of KM parameters: Therefore, choosing a precision ϵ > 0 and a probability 1 − δ ∈ [0, 1] to meet this precision, we get whenever the number of sampled Clifford circuits K is Note that in Eq. (C9), replacing the observable Ô by its normalized counterpart Ô/∥ Ô∥ with an associated precision ε gives the same scaling for K, as in that case ε = ϵ/∥ Ô∥.Hence we can always work with a normalized observable.However, if one is interested in the scaling with the system size n, we have to consider a sequence of observables Ôn , whose norms can present a particular scaling in n, so the presence of the norm of Ô in Eq. (C11) allows to keep track of this effect.In many situations of interest, the observables considered scale polynomialy in the system size, and so does K.Finally, one can use the Gottesman-Knill theorem which states that for a Clifford unitary Û and an observable Ô acting non-trivially on N O qubits, the expectation value Tr |0⟩⟨0| ⊗n Û † Ô Û can be classically computed with a complexity polynomial in both N O and the number of qubits n [24].Our scheme inherits this scaling and we can estimate the gradient variance Var θ [∂ k C(θ)] for each k with a classical computer in a complexity in O (n p N q O M ) with M the number of parameters in the variational quantum circuit.

Appendix D: Sampling efficiency in the general case
In this section we extend the previous scheme to more general distributions.We first discuss in App.D 1 the scaling of the sampling complexity with the convexity condition relaxed, i.e.where we no longer require the decomposition of the 2-fold channel [Eq.(A25)] to be a convex sum and only assume that the distribution of θ is even.Then, we study in App.D 2 the case of an arbitrary distribution of the rotation angles, which is not necessarily symmetrically distributed.Finally, we show that our previous scheme still applies at the price of an exponential factor in the number of variational parameters M in the number of Clifford approximant circuits to be sampled.Compared to a brute-force simulation, this method can be used to trade an exponential complexity in the system size for an exponential complexity in the number of variational parameters.

Sampling efficiency in the nonconvex case
Here we consider distributions of rotation angle θ that are even, but do not satisfy the convexity condition of Eq. (A28).In this case, our decomposition of the 1-fold channel remains convex while the 2-fold channel becomes a nonconvex sum, hence the coefficients for the Clifford channels can no longer be interpreted as probabilities.
We first show how one can still estimate such nonconvex sums via probabilistic sampling [72].Denoting we hereby assume Defining Eq. (D1) can be rewritten in terms of convex sums: Similar to App.C 1, we now define the random vector X = ( X1 , . . ., XM ) ∈ {1, . . ., m} M , with probabilities P( Xk = j) = pkj , and the rescaled random unitary channel Ũ( X) through Therefore, we recover the form of an expectation value similar to Eq. ( C3): This allows us to apply the same arguments as in App.C 2 by considering the function instead of f (x) defined in Eq. (C5).The function bound (C6) should be rescaled accordingly: where the scaling factor is defined as The number of sampled Clifford circuits previously derived in Eq. (C11) should therefore be scaled with the same factor: Note that the factor γ k ≥ 1 can be regarded as a measure of "nonconvexity" in the decomposition of the k-th channel.In the case of a convex sum, where q kj > 0, ∀k, j, the scaling factor is simply γ = 1 M = 1 and we recover the previous results.We now show that γ k is upper-bounded.Following our discussion in App.A, it suffices to consider the 2fold channel for a single-qubit Z-rotation, where the decomposition can be possibly nonconvex.Without loss of generality, let us rewrite Eq. (A25) as for some k, where (1 + cos 2θ + 2 cos θ) , Defining the non-negative function φ(θ) := 1 4 (1 + cos 2θ + 2 cos θ) We then get

(D14)
Here the function φ(θ) reaches its maximum for θ = ± π 3 , ± 2π 3 .Therefore, the factor γ k reaches its upper bound 5  4 if the distribution of θ is a sum of Dirac-delta distributions peaked at θ = ± π 3 and/or θ = ± 2π 3 , in which case we obtain the worst-case scaling of the number of sampled Clifford circuits (D10): Combining the above result with the Gottesman-Knill theorem, for a cost-function observable Ô acting nontrivially on N O qubits, our scheme implies a complexity of at most O n p N q O ( 5 4 ) M M for the estimation of gradient variance Var θ [∂ k C(θ)] for each k on a classical computer in the general scenario, where n is the number of qubits, M is the number of parameters in the variational quantum circuit and p, q are some constants inherited from the Gottesman-Knill theorem.

Sampling efficiency for the general case
In this section, we extend our scheme to the most generic case, by considering an arbitrary probability distribution for the rotation angles θ, and derive the corresponding sampling complexity.As before, one needs only to consider the one-and two-fold channels for a single-qubit Z-rotation gate.In what follows, let us denote E θ e iθ := r 1 + is 1 and E θ e 2iθ := r 2 + is 2 .Note that r 1 = E θ [cos θ] and r 2 = E θ [cos 2θ] are defined in the same way as for the symmetric case before, while s 1 = E θ [sin θ] and s 2 = E θ [sin 2θ] are in general nonzero since we no longer assume the distribution of θ to be even.

a. 1-fold channel
The expression of the 1-fold channel for a single-qubit Z-rotation is given by Eq. (A2), which we develop below without assuming an even distribution in θ.We get: where Ŝ = Π0 +i Π1 is the phase gate, and one can use this definition together with Eq. (A3) to verify the equation above.
Here, the parameter s 1 can be understood as a measure of asymmetry in the probability distribution of θ.In the symmetric case, we have s 1 = 0 and the sum reduces to the convex one given by Eq. (A4).Following the same procedure as in App.D 1, this (possibly nonconvex) linear combination of Clifford channels can be estimated via sampling, and the number of required samples should be scaled, according to the nonconvexity of the sum, by a factor γ = Π M k=1 γ k [see definition in Eqs.(D1)-(D3) and Eq.(D9)].We now derive an upper bound for γ (1) k , the scaling factor associated to a single (the k-th) 1-fold Zrotation channel that can be decomposed in the form of Eq. (D16) in general.We proceed by applying the same argument as in Eqs.(D11)-(D14): This implies that the number of samples K (1) required for the estimation of the generic 1-fold channel [see Eq. (D15)] scales as Note that the bound derived above depends on the specific choice of the Clifford channels in the decomposition.
As the Clifford group does not form a linearly independent set, it should be possible to find a different decomposition that yields a different upper bound and further optimize the complexity.

b. 2-fold channel
The 2-fold channel for a single-qubit Z-rotation is given by Eq. (A14): For a generic probability distribution of θ, we have (D20) As one can verify, the Choi representation of the above terms are all diagonal, so that their sum can be represented via the M matrix as before: This can again be decomposed as a weighted sum of the channels E in Eq. (A18) and of the following Clifford channels: Note that the channels listed above are all diagonal in the Choi representation and hence the M matrices capture all their nonzero entries.Following the same reasoning as in App.A, we solve a linear system to obtain the following decomposition: Φ Remark.Denoting CN OT = Π0 ⊗1+ Π1 ⊗ X the CNOT gate and CN OT X := ( X ⊗ X) CN OT ( X ⊗ X) its conjugation by the X ⊗ X gate, we have Again, by solving a linear system one finds another decomposition of the two-fold channel that involves the above channels, namely: Φ Similar to our treatment with the 1-fold channel, let us derive an upper bound for γ (2) k , the scaling factor for the number of samples required for the estimation of the generic 2-fold k-th Z-rotation channel: (D26) This implies that the number of samples K (2) required for the estimation of the generic 2-fold channel scales as which is dominant over the complexity of the estimation of the 1-fold channel [Eq.(D18)] since 1 + √ 2 > 2. Again, combining the above result with the Gottesman-Knill theorem, for a cost-function observable Ô acting non-trivially on N O qubits, our scheme implies a complexity of no more than O n p N q O (1 + √ 2) M M for the estimation of gradient variance Var θ [∂ k C(θ)] for each k on a classical computer in the most generic case, where n is the number of qubits, M is the number of parameters in the variational ansatz and p, q are some constants inherited from the Gottesman-Knill theorem.
Appendix E: N -fold channel for a random Z rotation In this section we give a decomposition of the N -fold channel as a real sum of Clifford unitary channels.This allows us to extend our scheme to the estimation of N −th order quantities with a complexity scaling polynomialy in both the number of variational parameter M and the system size n when the decomposition is convex, and exponential in M otherwise.We give a sufficient condition on the distribution of the random angle θ for the decomposition to be a convex one.
Recall that for any unitary Û we defined E Û (ρ) := Û ρ Û † .In Eq. (D16) we obtained a decomposition of the 1-fold channel of a Z-rotation in terms of Clifford unitary channels for a generic distribution of the random angle, namely More generally, we have that for any θ ∈ R.This can be seen as a consequence of Eq. (E1) for a Dirac probability measure centered at θ. On can directly generalize this equation to obtain an expression of the N -fold channel as a real sum of Clifford unitary channels, as with m 0 + m 1 + m 2 + m 3 = N .As a result, the Nfold channel is given by a real combination of 4 N unitary Clifford channels that are composed of products of the gates 1, Ẑ, Ŝ and Ŝ † .This gives us a sufficient condition for the N -fold channel to be a convex sum of Clifford unitary channels, namely it suffices that the expectation values of all the coefficient E θ [λ I (θ)] be positive.Although this condition is sufficient, it is not necessary.In particular, in the case of the 2-fold channel, the expectation of coefficients associated to the multi-indices (2, 3) and (3, 2) is given by E θ − sin 2 θ , which is always negative.However, we proved that a convex decomposition exists for the uniform distribution.This is due to the fact that the decomposition of Eq. (E3) is not unique.In fact, the family of channels is not linearly independent.Consider two single qubit unitaries Û and V that are diagonal in the computational basis.As we are free to chose the global phase of these unitaries, we can always write them as Û = e iθ U /2 Π0 + e −iθ U /2 Π1 and V = e iθ V /2 Π0 + e −iθ V /2 Π1 .We saw in App.A that the product unitary Û ⊗ V can be represented by the diagonal of the associated Choi matrix, written as a 4-by-4 matrix M: This shows that for a tensor product of single-qubit unitaries, the matrices M in the basis ((00), (01), (10), (11)) are symmetric with respect to the anti-diagonal transposition.Therefore, the channels in P belong to a real vector space of dimension 9 (1 dimension for the diagonal, 2 × 3 dimensions for the complex exponentials of the first row, and 2 dimensions for the third term of the second row).As there are 16 channels in P, the family is not linearly independent.The condition that all the E θ [λ I (θ)] be positive is clearly too restrictive.One way to extend it to find back the condition we previously derived is to use the fact that to absorb the E θ − sin 2 θ factors into the coefficients associated to other channels.
Remark.To obtain the previous relation, we used the following channels: We showed that the N -fold channel associated to Zrotations can always be decomposed as a real linear combination of Clifford unitary channels.However, it remains an open problem to find necessary and/or sufficient conditions under which the N -fold channel can be decomposed a convex combination of Clifford unitaries, i.e. conditions under which the N -fold channel is a Clifford mixed-unitary channel.The knowledge of such conditions could allow to extend the scheme proposed in this work to ansätze with correlated rotation parameters.
Appendix F: Example of first and second order Clifford approximant circuits for a simple ansatz In this appendix we provide a sample of Clifford approximant circuits for the estimation of E θ [C(θ)] and E θ C(θ) 2 for the simple circuit depicted in Fig. 10.The generalisation to Clifford approximants for other quantities, such as the expectation of the squared gradient, can be derived from that example as it suffices to introduce the adequate Clifford gates to the fixed layers to obtain the right estimators (see Sec. II A and B).This circuit acts on three qubits and is composed of two layers of rotations that are alternated with fixed two-qubits Control-Z gates.To obtain a first order approximant for these circuits it suffices to randomly replace each rotation by either the identity gate (a wire) or the Pauli gate corresponding to the direction of the concerned rotation gate.Three examples of first order Clifford approximant are represented in Fig. 11.The second order approximant are derived by first mapping each rotation along X or Y to a rotation along Z, making use of the identities X = Ĥ † Ẑ Ĥ and Ŷ = ( Ŝ Ĥ) Ẑ( Ŝ Ĥ) † where Ĥ, Ŝ are respectively the Hadamard and phase gates.As a result we get the ansatz with layers of Z rotations alternated with fixed layers composed of Clifford gates represented on Fig. 12.This circuit is then doubled vertically to give Figure 11.Examples of first-order Clifford approximant circuits for the ansatz of Fig. 10.Assuming the probability distribution of the angles is even, we replace each rotation by a Clifford gate that is sampled according to Eq. (A4).
a circuit acting on six qubits.Finally, each pairs of rotations sharing the same angle is randomly replaced by two single-qubit gates according to the scheme of Fig.

Figure 2 .
Figure 2. Schematic representation of the mapping rules from Z-rotations with a random parameter to unitary ensembles composed of Clifford gates.Panels (a) and (b) respectively are for first and second-order averages.The mapping here is for probability distributions that are even with respect to θ: we denote r1 = E θ [cos(θ)] and r2 = E θ [cos(2θ)].The coefficients pi are the probabilities dictating how the corresponding Clifford circuits are sampled.

Figure 3 .
Figure 3.Estimated average of the squared gradient of the cost function with respect to the first variational parameter versus the number n of qubits.We emphasize that derivatives with respect to the other angles θ k give similar results (not shown).The results are for random circuits composed of a single layer of gates, with one rotation per qubit.Such rotations are randomly chosen among RX , RY , RZ .The rotation layer is followed by a layer of alternated CZ gates (note that this is the same type of architecture as that represented on Fig.1).The random rotation angles are independent and follow the uniform probability distribution on the interval [0, 2π].In order to get the estimation, we have randomly sampled 500 different circuit architectures.For each gates architecture, we have computed the average of the squared gradient assuming a uniform distribution of the rotation angles, using both a direct estimation and our method based on the mapping to Clifford approximant circuits.In particular, we have sampled 500 vectors of angle parameters for the direct estimation and 500 Clifford circuits for our method.Note that for the uniform distribution the average gradient vanishes, thus estimating the squared gradient is equivalent to estimate the gradient variance.

Figure 6 .
Figure 6.Estimated variance of the gradient of the cost function with respect to the first variational parameter versus the number n of qubits.Each variance is estimated using 10 5 Clifford circuits.The circuits and cost function used are the same as the for Fig.3.

Figure 7 .
Figure 7. Variance of the cost function gradient with respect to the variational parameters in decreasing order and for 20 random circuits.The circuits are acting on n = 40 qubits.Each circuit is composed of 10 layers.For each layer l, a number m l is randomly chosen in the interval [0, n], and the layer is then built by first applying m l random single qubit rotation gates to randomly chosen qubits, and then applying a serie of entangling gates of the same type and with the same arrangement as the ones considered in Fig.3.For each circuit and each parameter θi, the gradients are estimated using 10 4 Clifford approximant circuits.The Hamiltonian of the cost function is a sum of 10 random Pauli strings.Note that each curve corresponds to a different architecture.

Figure 8 .
Figure 8. Mean value of the cost function vs. variance of the cost function gradient, for 20 random circuits and 40 qubits.Each point represents a circuit, and the circuits considered are the same as the ones of Fig.7.The x-axis is the average of the variances of the gradients with respect to the Np circuits rotation parameters θi, i = 1, . . ., Np.

Appendix A: 1 -
fold and 2-fold channels of a random Z-rotation 1. 1-fold channel

Figure 10 .
Figure 10.Initial variational circuit with random rotation angles.

Figure 12 .
Figure 12.Equivalent form of the initial circuit with Z-rotations only.

Figure 13 .
Figure 13.Examples of second-order Clifford approximant circuits for the ansatz of Fig.10.