Connecting ansatz expressibility to gradient magnitudes and barren plateaus

Parameterized quantum circuits serve as ans\"{a}tze for solving variational problems and provide a flexible paradigm for programming near-term quantum computers. Ideally, such ans\"{a}tze should be highly expressive so that a close approximation of the desired solution can be accessed. On the other hand, the ansatz must also have sufficiently large gradients to allow for training. Here, we derive a fundamental relationship between these two essential properties: expressibility and trainability. This is done by extending the well established barren plateau phenomenon, which holds for ans\"{a}tze that form exact 2-designs, to arbitrary ans\"{a}tze. Specifically, we calculate the variance in the cost gradient in terms of the expressibility of the ansatz, as measured by its distance from being a 2-design. Our resulting bounds indicate that highly expressive ans\"{a}tze exhibit flatter cost landscapes and therefore will be harder to train. Furthermore, we provide numerics illustrating the effect of expressiblity on gradient scalings, and we discuss the implications for designing strategies to avoid barren plateaus.


I. Introduction
While quantum hardware is rapidly reaching the stage where it can outperform classical supercomputers [1], we remain in the Noisy Intermediate-Scale Quantum (NISQ) era in which the available devices are relatively small and prone to errors [2]. Variational quantum algorithms have gathered attention as a computational strategy that is well suited to the constraints imposed by NISQ devices [3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20]. In VQAs a problem-specific cost function is efficiently evaluated on a quantum computer, while a classical optimizer trains a parameterized quantum circuit to minimize this cost. The benefit of this paradigm is that it adapts to the qubit and connectivity constraints of NISQ devices, while keeping the circuit depth short to mitigate quantum hardware noise.
Central to the success of VQAs is the construction of a parameterized quantum circuit, which serves as an ansatz with which to explore the space of solutions to the target problem. Some noteworthy ansätze include the quantum alternating operator ansatz [5,21], coupled cluster ansatz [22][23][24], Hamiltonian variational ansatz [25], and hardware efficient ansatz [26]. To successfully find an optimal solution, the ansatz should ideally be both expressive and trainable. Specifically, the ansatz must be sufficiently expressive such that it contains a circuit that well-approximates the optimal solution. Concurrently, the cost landscape must be sufficiently featured to be able to train the parameters to find this optimal solution.
Recently, it was shown that VQAs can exhibit barren plateaus, where under certain conditions the gradient of the cost function vanishes exponentially with the size of the system [27][28][29][30][31][32][33][34][35][36]. In particular, Ref. [27] demonstrated that if an ansatz is sufficiently random that it matches the uniform distribution of unitaries up to the second moment (i.e., forms a 2-design), then the variance in the cost gradient will vanish exponentially with the number of qubits. Several strategies have been proposed to address this issue [37][38][39][40][41][42][43][44][45][46], such as clever parameter initialization or ansatz construction, while more research is needed to test these strategies on various problems.
In broad terms, the expressibility of an ansatz is determined by how uniformly it explores the unitary space. Thus the distance between the distribution of unitaries generated by an ansatz and the maximally expressive uniform distribution of unitaries is a natural measure of its expressibility [47]. Using such a measure, Ref. [48] calculated the expressiblity for several commonly used ansätze and, by using the cost gradients obtained in [28], suggested that in some cases it is possible for an ansatz to be both expressive and trainable. Additionally, Ref. [49] noted a numerical correlation between expressibility and trainability for analog systems. However, given that both expressibility and trainability are closely related to randomness, one might expect to be able to draw a more fundamental and general relationship between expressibility and trainability.
Here we demonstrate that this is indeed the case by analytically relating the trainability of an ansatz to its expressibility. This is done by extending the barren plateau phenomenon introduced in [27], which holds for ansätze that form exact 2-designs, to arbitrary ansätze. Specifically, we upper bound the variance in the cost gradient in terms of the distance the ansatz is from being a 2design. Since the degree to which an ansatz is a 2-design is a measure of its expressibility, this allows us to relate the gradient of the cost landscape to the expressibility of the ansatz. We find that the more expressive the ansatz, the smaller the variance in the cost gradient and hence the flatter the landscape. We note that an ansatz does not strictly need to be highly expressive to be used successfully, rather it just needs to contain a solution to the problem at hand. Thus our result highlights the importance of developing trainable problem-inspired ansatze.
Our main results can be summarized in Fig. 1. Given an ansatz, we analyze the space of unitaries accessible when sampling the parameters ( Fig. 1(A)) of a parametrized quantum circuit. Inexpressive ansätze, such as the one shown in Fig. 1(B), access a small region of the unitary group and can include the space of unitaries that solve certain problems but not the space that solve others. Our results do not preclude inexpressive ansätze having trainability issues, such as barren plateaus. On the other hand, highly expressive ansätze, which are generically used for many problems as they can access a much larger space ( Fig. 1(C)), are shown to lead to small gradients, and hence can have trainability issues.
Since our analytic bounds are upper bounds, they leave open the questions of how reducing the expressiblity of an ansatz changes the cost landscape, and hence how reducing the expressibility can be used to avoid the barren plateau phenomenon. To address these questions we provide extensive numerics studying the effect that tuning the expressibility of an ansatz may have on the scaling of gradient magnitudes. Specifically, we consider the effects of decreasing the depth of the circuits, correlating circuit parameters, and restricting either the direction or angle of rotations. We find that strongly correlating parameters [37] and/or initializing close to the solution (and then restricting the ansatz to explore the region close to the initialization [38]) to be the most effective approaches to avoid exponentially vanishing cost gradients.

A. General framework
Variational Quantum Algorithms (VQAs) encode an optimization task in a cost function whose minimum corresponds to the solution of the problem. Here we consider cost functions of the form 1 C ρ,H (θ) = Tr[HU (θ)ρU (θ) † ] , where ρ is an n-qubit input state, H is a Hermitian operator, and U (θ) is a parametrized quantum circuit depending on trainable parameters θ. The value of the cost C ρ,H (θ) (or of its gradient) are estimated on a quantum computer, and then are fed into a classical optimizer which attempts to solve the optimization task arg min θ C ρ,H (θ). The success of the VQA hinges on several factors. First, it is necessary to find an operator H such that the resulting cost is faithful for the given problem. That Variational quantum algorithms (VQAs) train the parameters θ in a parameterized quantum circuit to minimize a cost function as in Eq. (1). Each set of parameters corresponds to a unitary U (θ) being produced. The set of unitaries U accessible by U (θ) is a subset of the unitary group U(d), and the VQA can be successful if U overlaps with the space of solution unitaries Us that (approximately) minimize the cost. The expressibility of an ansatz quantifies the degree to which it uniformly explores the unitary group U(d). Given problems A and B, we denote their solution spaces as U A s and U B s respectively. (B) A low-expressibility ansatz contains solutions to problem A but not to B, while a high-expressibility ansatz as in (C) contains solutions to both problems. Lowexpressibility ansätze can lead to both small and large cost gradients. On the other hand, high-expressibility ansätze lead to predominantly flat cost landscapes, and thus are generally hard to train.
is, we require the minimum of C ρ,H (θ) to correspond to the solution of the optimization task. Evidently, for some applications, there may be multiple choices in H corresponding to faithful costs and therefore other factors will determine which to use. One such factor is how easily H can be measured on a quantum computer. Another relevant feature, as discussed further in Section II D, is the locality of H, i.e., the number of qubits it acts nontrivially on. We say that the cost function is global if H acts non-trivially on all qubits, while we use the term k-local for costs where H acts non-trivially on at most k qubits.
A second aspect that determines the success of a VQA is the choice in ansatz for U (θ). While discrete parameterizations are possible, usually θ are continuous parameters, such as gate rotation angles, in a parametrized quantum circuit. Generally, U (θ) is expressed as Here {W j } N j=1 is a chosen set of fixed unitaries and U j = e −iθj Vj is a rotation of angle θ j generated by a Hermitian operator V j such that (V j ) 2 = 1. The rotation angles {θ j } are typically assumed to be independent.
Once an ansatz has been fixed for the parametrized quantum circuit, then, as sketched in Fig. 1(A), each possible vector of parameters θ corresponds to a unitary U (θ) that is produced. For concreteness, given a set of different parameters {θ (1) , ...θ (2) , ..., θ (y) } we obtain the corresponding ensemble of unitaries where

B. Expressibility
For a VQA to be successful, a solution (i.e., a unitary which is by some measure close to the unitary that minimizes the cost) needs to be contained within the ensemble of unitaries generated by the ansatz. Specifically, defining U s as the set of solution unitaries, then the VQA will be successful only if U s U = ∅. When this condition is satisfied the ansatz is said to be complete for the given problem.
In the absence of prior knowledge about where the solution unitaries U s lie, the likelihood that the ansatz is complete can be maximized by using an ansatz that explores the total space of unitaries as fully and as uniformally as possible. Such ansätze are known as expressive ansätze. For example, consider having two problems (problem A, and problem B), with solution spaces respectively denoted as U A s and U B s . Figure 1(B) sketches U for an inexpressive ansatz which is complete with respect to problem A but incomplete with respect to B. Conversely, Fig. 1(C) shows U for an expressive ansatz which is complete with respect to both problems.
For many applications, information about the problem can be encoded in the ansatz. For instance, the quantum alternating operator ansatz [21] (or the Hamiltonian variational ansatz [25]), encode information of an appropriate adiabatic transformation. Such problem-inspired ansätze may be complete but inexpressive (e.g., Fig. 1(B) could denote a problem-inspired ansatz for problem A). However, problem-agnostic ansätze, which can be used for a wide range of problems, need to be sufficiently expressive to guarantee their completeness.
The expressibility of an ansatz, i.e., the degree to which it uniformly explores the unitary group U(d), can be quantified by comparing the uniform distribution of unitaries obtained from the ensemble U to the maximally expressive uniform (Haar) distribution of unitaries from U(d). More concretely, the expressibility of a circuit can be defined in terms of the following superoperator [47,48]: where dµ(V ) is the volume element of the Haar measure and dU is the volume element corresponding to the uniform distribution over U in Eq. (3). If A (t) U (X) = 0 for all operators X, then averaging over elements of U agrees with averaging over elements of the Haar distribution over U(d) up to the t-th moment, and thus U forms a t-design [54][55][56][57][58]. For our purposes it suffices to consider the behavior of A (t) In the context of minimizing a generic cost C ρ,H (θ) of the form specified by Eq. (1), we are interested in the expressibility of the circuit with respect to both the initial state ρ and the measurement operator H. The following quantities respectively capture these notions: Small values of ε ρ U and ε H U indicate that the ansatz is highly expressive. These measures generalize the notion of expressibility introduced in [47] where the expressiblity was defined in terms of ε ρ U for ρ = |0 0|. While the ρ and H dependence of ε ρ U and ε H U make them natural measures of the expressibility in the context of minimizing a cost C ρ,H (θ), cost function-independent measures of expressibility may allow the expected performance of different ansätze to be more easily compared. With this in mind, one could alternatively quantify the expressiblity directly in terms of the diamond norm of A U , which is an operationally meaningful distance measure to distinguish two quantum operations. We use the diamond norm here in line with the literature on εapproximate unitary designs [59]; however, alternative norms can be used (for a discussion, see [57]). For completeness we will formulate our results in terms of ε U , as well as the quantities ε ρ U and ε H U .

C. Gradient Magnitudes
For a variational quantum algorithm to run successfully it is not sufficient that the ansatz contains the solution; the cost landscape must also exhibit large enough cost gradients to enable this solution to be found.
The component of the gradient corresponding to the parameter θ k is determined by the partial derivative . For a generic ansatz of the form specified by Eq. (2), the average of ∂ k C over all parameters θ vanishes That is, the cost gradients are not biased in any single direction but rather average out to zero. Intuitively, this lack of bias can be understood as following from the fact that the average of a rotation exp(−iθ k V k ) is zero when V 2 k = 1 1. We show this in Appendix C, where we prove that ∂ k C = 0 by explicitly integrating over θ k .
However an unbiased cost landscape can be either trainable or untrainable, depending on the extent to which the gradient fluctuates away from zero. Therefore, to assess the trainability of an ansatz U (θ), we now recall the Chebyshev inequality. This inequality bounds the probability that the partial derivative of the cost deviates from its average of zero, in terms of the variance where the expectation value is taken over the parameters θ. Hence if the variance of the partial derivative is small for all θ k , then the probability that the partial derivative is non-zero is small for all θ k . On such landscapes, (potentially untenably) precise measurements are required to detect the path of steepest descent to navigate to the minimum.

D. Barren Plateaus
There is a growing awareness of the so called barren plateau phenomenon for variational quantum algorithms [27][28][29][30][31][32][33][34][35][36]. For a given ansatz U (θ), a cost C is said to exhibit a barren plateau if its gradients vanish exponentially with the number of qubits n. This is typically relaxed to a probabilistic definition, where the gradient vanishes exponentially with high probability. This would follow from Chebyshevs inequality, Eq. (9), if the variance in the partial derivative vanishes exponentially, i.e., if Var[∂ k C] ∈ O(2 −pn ) for any integer p > 0. For costs that exhibit barren plateaus, exponentially precise measurements may be required to determine the minimization direction, and hence the cost is effectively untrainable for large problem sizes.
To elucidate the conditions under which a layered parameterized ansatz U (θ), of the form of Eq. (2), gives rise to barren plateaus, consider a bipartite cut of U (θ) and write where Note that since we suppose the parameters θ j are uncorrelated, the circuits U L and U R are independent. These circuits are pertinent when quantifying gradients since taking the partial derivative of a circuit, as shown in Appendix D, effectively splits a circuit in two.
Ref. [27] then demonstrated that if the ensemble of unitaries generated by the ansatz U (θ) is sufficiently random (i.e., expressive) such that the ensembles U L or U R (associated with the circuits U L (θ) and U R (θ) respectively) form 2-designs, then the variance in the cost gradient vanishes exponentially with n. Specifically, let us denote the variance of the cost when just U R , just U L , and both U R and U L form 2-designs as Var R ∂ k C, Var L ∂ k C, and Var R,L ∂ k C, respectively. From Ref. [27] it follows that for x = R, x = L and x = R, L, where we have pulled out the n-dependent scaling factor explicitly. The prefactor g x (ρ, H, U ), which we define explicitly in Appendix E, is in O(2 n ) for typical choices in V k and H. Thus if U L or U R form a 2-design, the variance in the gradient vanishes exponentially in n. In other words, maximally expressive ansätze exhibit barren plateaus.

A. Analytic Bounds
In this section, we study the gradient of a generic cost C ρ,H (θ), Eq. (1), with an ansatz U (θ), Eq. (2), but relax the assumption that U L or U R forms a 2-design. By doing so, we extend the results on barren plateaus from Ref. [27] to arbitrary ansätze. As will become clear, this generalization enables us to relate the variance of the cost function partial derivative to the expressiblity of U (θ) in Eq. (4).
Let us start by noting that while maximally expressive ansätze exhibit barren plateaus, the converse is not necessarily true. In other words, highly inexpressive ansätze need not always experience large cost gradients, and in fact they may exhibit vanishing gradients. A trivial example of this phenomenon is provided by an ansatz composed of rotations that commute with the measurement operator [U (θ), H] = 0. Such an ansatz will leave the cost unchanged for any θ and so the variance in gradient in the cost of such an ansatz is necessarily zero. A more subtle example is an ansatz composed of a tensor product of single qubit rotations. Since this ansatz does not generate entanglement it is inexpressive; however, it has also been shown to exhibit a barren plateau for global cost functions [7,28]. It follows from these observations that it is not possible to meaningfully lower bound the gradients of an ansatz in terms of its expressiblity.
Therefore to relate cost gradients to expressibility we instead derive an upper bound. Specifically, our main result consists of a non-trivial upper bound for the variance of the cost function partial derivative for a general ansatz U (θ) in terms of the expressibility in (4). This bound is in terms of: (1) the variance of the cost gradient when either U L or U R form a 2-design, and (2) the expressibility of the ansatz as measured by the distance U L and U R are from being 2-designs. As shown in Appendix D, we prove the following.
Theorem 1. Consider a generic cost function C ρ,H (θ), Eq. (1), using a layered ansatz U (θ) of the general form in Eq. (2). The variance of the cost partial derivative obeys the following bounds: Var Here we used the shorthand ε ρ R := ε ρ U R and ε H L := ε H U L , and we have defined .
Theorem 1 establishes a formal relationship between the gradient of the cost landscape and the expressibility of the ansatz used. Namely, the higher the expressibility of the ansatz, that is the smaller ε H L or ε ρ R , the smaller the upper bound on the variance of the cost partial derivative. This, in combination with the fact that the cost gradient is unbiased, demonstrates that highly expressive ansätze will have flatter landscapes and consequently be harder to train.
In contrast to the bounds specified by Eqs. (13), which hold for three distinct cases (i.e., when U L is a 2-design, when U R is a 2-design, and when both U L and U R are 2-design), the bounds in Eqs. (14)-(16) all hold for any generic ansatz of the form in Eq. (2). Thus any single bound would suffice to bound the variance in the cost function partial derivative for an arbitrary ansatz.
We include all three bounds despite this fact since in any instance one bound may be tighter than the others and hence more informative. In particular, the relative tightness of the bounds depends on which parameter we are taking the derivative with respect to. This follows from the fact that Eq. (14) becomes an equality in the limit that U R tends to a 2-design, where as Eq. (15) becomes an equality in the limit that U L is a 2-design and Eq. (16) becomes an equality in the limit that both U L and U R are 2-designs. If we are looking at the derivative with respect to the final layer then U R is typically closer to being a 2-design than U L and so (14) will be tightest. Conversely, if we are most interested in the partial derivative with respect to a parameter in the first layer then (15) will be tightest. On the other hand, for parameters in a layer close to the middle (i.e. at depth D/2) and (16) will be tightest since, as shown in Appendix D, the derivation of this bound uses the most information about the ansatz.
In Appendix D, we extend Theorem 1 to cost functions of the form C gen = i Tr[H i U (θ)ρ i U (θ) † ], which allow for multiple input states and measurements. Thus our results also apply to quantum machine learning approaches that utilize training data [50][51][52][53].
Generalizing the Barren Plateau phenomenon. Theorem 1 may be viewed as an extension of the barren plateau phenomenon introduced in Ref. [27] to ansätze that form approximate, rather, than exact 2-designs. By combining Eq. (13) and Eq. (16), we find that the variance in the partial derivative for an arbitrary ansatz is bounded as Here the first term on the right is the variance of a maximally expressive ansatz (namely, one that forms a 2design) and f (ε H L , ε ρ R ) is the expressiblity dependent correction term defined in Eq. (17). Expressions similar to Eq. (18) are obtainable from Eq. (14) and Eq. (15).
For perfectly expressive ansätze, f (ε H L , ε ρ R ) vanishes and Eq. (18) reduces to Eq. (13), regaining the result of Ref. [27]. In this case, the variance in the gradient vanishes exponentially with the size of the system n, i.e., the ansatz exhibits a barren plateau. Similarly, if the expressibility of an ansatz increases exponentially with the size of the problem, i.e., if f (ε H L , ε ρ R ) ∈ O 1 2 kn for k > 0, then Var ∂ k C again vanishes exponentially and the ansatz exhibits a barren plateau. However, more generally, when f (ε H L , ε ρ R ) scales non-exponentially the upper bound allows for the variance in the partial derivative to be non-vanishing. Thus, there is leeway for imperfectly expressive ansätze to avoid barren plateaus.
In Ref. [60] it was proven that the barren plateau phenomenon is necessarily associated with the concentration of cost functions values about their mean. More concretely, it was shown that the probability that the cost function deviates from its mean is determined by the variation in the gradient of the cost. Thus our bounds also imply that the degree to which the cost concentrates about its mean increases with increasing expressibility. In Appendix F, we provide an alternative proof of this following on from the results of Ref. [33].
Diamond Norm Reformulation. For local costs the term ||H|| 2 2 scales exponentially with the size of the system and therefore for large systems (14) becomes exponentially loose. This issue can be mitigated by reformulating Theorem 1 in terms of ε U , Eq. (7). We obtain the following theorem in Appendix D.

FIG. 2. Ansatz employed in numerical simulations.
The ansatz is composed of alternating random single qubit rotations and ladders of C-Phase operations. The colored boxes indicate the gates which are fixed to rotate by the same angle and in the same direction when we correlate the ansatz layers (yellow), correlate qubits (blue) and correlate both the layers and qubits (green).
Theorem 2. Consider a generic cost function C ρ,H (θ), Eq. (1), using a layered ansatz U (θ) of the general form in Eq. (2). The variance of the cost partial derivative obeys the following bounds: Var Var where we use the shorthand ε R = ε U R and ε L = ε U L and with f (x, y) defined in Eq. (17).
Again, Theorem 2 formally establishes that highly expressive ansätze experience flatter cost landscapes. Furthermore, a relation similar to (18) can be derived from Theorem 2. Hence, Theorem 2 also provides an extension of the barren plateau result of Ref. [27]. However, since ||H|| 2 ∞ ∈ O(1) for all H, (19) does not experience the same looseness for local costs of large systems as (14). On the other hand, since ||H|| 1 may scale exponentially in n, (20) may become loose for large systems and therefore we expect (15) to generally be more useful than (20).

B. Numerical Simulations
Since the analytic bounds in the previous section are upper bounds, we have no guarantee that inexpressive ansätze will exhibit larger cost gradients. The bounds thus leave open the question of whether/how reducing the expressiblity of an ansatz changes the cost landscape. Moreover, they leave open the question of how one can avoid the barren plateau phenomenon that is observed for maximally expressive ansätzes.
One can conceive of numerous ways in which the expressibility of an ansatz can be tuned, each of which could have a different impact. In this section, we consider four such ways: decreasing the depth of the circuits, correlating circuit parameters, and restricting either the direction or angle of rotations. We then numerically investigate the effect these have on the cost gradient scaling.
For completeness, in our numerics we consider both a 2-local cost where the measurement operator is composed of Pauli-z measurements on the first and second qubits, and a global cost where the measurement operator consists of Pauli-z measurements across all qubits, [28]. In both cases, following [27], the system is prepared in the pure state, ρ = |ψ 0 ψ 0 | ⊗n where |ψ 0 = exp(−i(π/8)σ Y )|0 . We further consider a layered hardware efficient ansatz, consisting of D alternating layers of random single qubit gates and entangling gates as shown in Fig. 2. Specifically, the entangling layer, is composed of a ladder of controlled-phase operations, C-Phase, between adjacent qubits in a 1-dimensional array. The single-qubit layer consists of a series of random single qubit rotations where R k i l (θ i l ) is a rotation of the i th qubit by an angle θ i l about the k i l = x, y or z axis. In the maximally expressive version of the ansatz the x, y or z rotation directions {k i l } for each qubit on each layer are chosen independently and with equal probability, and the rotation angles {θ i l } are independently and randomly chosen in the range 0 to 2π. Our numerics are implemented using TensorFlow Quantum [61]. Circuit depth. One of the simplest ways of reducing the expressiblity of an ansatz is reducing the depth D of the circuit. It was shown in [28] that global costs with a hardware efficient ansatz experience barren plateaus irrespective of the depth of the circuit. However, local costs only exhibit barren plateaus for deep circuits (D ∈ Ω(poly(n)) but are trainable for shallow circuits (D ∈ O(log(n)).
We obtain similar results here. As shown in Fig. 3(A), for the global cost the variance in the partial derivative is seemingly independent of the depth of the circuit and vanishes exponentially with the size of the system n. Conversely for local costs, as shown in Fig. 3(D), exponentially vanishing partial derivatives are observed for systems up to 12 qubits for depths D 100. However shallow circuits D 50 exhibit an approximately constant scaling for n 8. Correlating parameters. A more sophisticated means of reducing the expressibility of the ansatz is to correlate the rotation angles [37]. Here we consider three different means of correlating parameters, as sketched in Fig. 2, and plot the corresponding variance in the cost partial derivative in the central panel of Fig. 3. In the first, shown in yellow, we correlate the qubits (but allow the angles to vary between layers), i.e., k i l = k i l and θ i l = θ i l for any two qubits i and i . In the second (plotted in green) we correlate the different layers (but not the qubits), i.e., k i l = k i l and θ i l = θ i l for any two layers l and l . Finally, as shown in blue, we correlate both the qubits and layers. In this case all the qubits rotate in same direction and by the same angle, i.e., k i l = k i l and θ i l = θ i l for any two qubits i and i and layers l and l . In other words, all parameters are correlated. The data for only y (x) rotations is indicated by the solid (dashed) lines respectively.
In contrast to varying circuit depth, here we obtain similar results irrespective of whether a local or global cost is used. Correlating both the qubits and the layers results in the least expressive ansatz and correspondingly the largest variation in cost gradients is observed. Indeed, in this case the variance in the cost gradient is approximately constant. In contrast, correlating just the qubits, or just the layers, increases the cost gradients and reduces the scaling of the cost gradient with system size but an exponential scaling is still observed.
Restricting rotation direction. One might also consider reducing the expressibility of the ansatz by reducing the single qubit rotation gates to a subset of directions. We explore this in the right panel of Fig. 3. In blue we plot the variance when only rotations in a single direction, namely in the x (dark blue) or y (light blue) direction, are implemented. We do not plot the case when only z rotations are implemented since in that case U commutes with H L = σ z 1 σ z 2 and H G = n i=1 σ z i , and so the cost landscape is entirely flat. For a local cost, reducing the expressibility of the ansatz by restricting to single direction rotations seemingly removes the exponential gradient scaling. However, for a global cost the scaling remains exponential.
Restricting rotation angles. A final way to reduce the expressiblity of an ansatz is by reducing the range the rotation angles θ are chosen from. That is, choosing the θ i l in the range [θ i l ,θ i l + 2πr] whereθ i l is a fixed initialization point. For r = 1 the ansatz explores the entire solution space but for r < 1 the ansatz is constrained to exploring a subset of the solution space where the rotation angles θ i l deviate fromθ i l by at most 2πr. However, with a little thought, it is clear that, in contrast to the previous three approaches we have discussed, restricting the rotation angles of the ansatz does not change the cost landscape but rather limits the region of the landscape explored by the ansatz. Thus, in general, reducing the rotation angles does not effect the cost gradients experienced. This intuition is confirmed by the numerical results displayed in the top panel of Fig. 4. Here we randomly initialize the parameters by randomly choosingθ i l in the range [0, 2π]. We find the the cost partial derivatives for different r values perfectly overlap in this case, i.e., for a random initialization, restricting the ansatz to a limited range of rotation angles does not change the partial derivatives observed.
On the other hand, if the parameters are initialized close to the solution, varying r has a substantial effect on the observed partial derivatives for local costs, and a reduced effect for global costs. This is seen in (B) and (C) of Fig. 4 where we initialize to identity, i.e., pickθ i l = 0 for all i, which is close to the solution for this simple problem. In this case, for r close to 1 (as shown in red and yellow) the variance in the partial derivative again vanishes exponentially with n. However, for small angle ranges, r 0.1, as shown in blue, we find that the partial derivative of a local cost ceases to exhibit an exponential scaling. To some degree, a similar effect is displayed for global costs; however, the effect is reduced and is only visible in the data here for r ≈ 0.025.
This change in partial derivative scaling for small r for initializations close to the solution is plausibly explained by the fact that the global minimum of costs exhibiting barren plateaus tend to sit within a steep and narrow gorge [28], as sketched in Fig. 1(C). By initializing close to the solution we are likely to be initializing within the narrow gorge. In this case, when r is close to 1 the ansatz still explores the entire cost landscape and therefore the variance in the partial derivative will be unchanged. However, for smaller r the ansatz is constrained to the region around the the narrow gorge itself, and hence a larger variance in partial derivatives is observed.
Outlook for ansatz design. Figure 3 suggests that reducing the depth of a circuit and correlating parameters are the most effective strategies for amplifying the observed cost gradients. However, the optimal solution, of course, may not lie within a shallow or highly correlated ansatz. When deep and/or uncorrelated circuits are required, as is expected to be the case for many problems of interest, then a perturbative strategy may instead be effective. That is, one could start the variational algorithm using a shallow, highly correlated ansatz and as the cost is iteratively minimized gradually grow the ansatz [8,15,38] and decorrelate the parameters [37].
Restricting the angle range also appears to provide an effective strategy for increasing cost gradients, but for it to be practical it is necessary to initialize close to the solution. This, of course, requires either prior knowledge of an approximate solution to the problem at hand or an effective pre-training strategy to obtain such an approximate solution. The viability of either of these options warrants further investigation.
Correlation and tightness of bounds. In Fig. 6 we study the correlation between the cost gradients and our upper bounds. To quantify this correlation we include the Spearman correlation coefficient [63], as well as its corresponding p-value, which approximately gives the probability of uncorrelated data generating a Spearman coefficient at least as large as the one found. For local costs we obtain a Spearman value of at least 0.9 with a p-value of less than 0.05 in all cases, indicating a strong correlation between our upper bound and the actual variance l ,θ i l + 2πr], such that for r = 1 (red) the ansatz explores the entire solution space but for r 1 (blue) the ansatz is constrained to exploring close to the initialization point defined by {θ i l }. In (A), the angles {θ i l } are a fixed (randomly chosen) initialization point away from the solution (here we consider a local cost but the data for a global cost is essentially unchanged). In (B) and (C), which correspond to global and local costs respectively, the anglesθ i l = 0 for all l and i, which is close to the global minimum of the cost. In all cases the derivative is taken with respect to θ 1 1 , the rotation angle of the first qubit in the first layer and the variance is taken over an ensemble of 1000 unitaries.
in the gradient. The correlation is weaker in the case of global costs, highlighting that in the case of a global cost expressibility is not the only phenomenon that may induce a barren plateau. This is to be expected given the results of Ref. [28], which show even very shallow and/or non-entangling circuits (i.e. highly inexpressive circuits) may exhibit barren plateaus when using global costs. For completeness, the results presented here are extended in Appendix G, where we study directly the correlation between cost gradient and the expressibility measures ε ρ R and ε H L , with similar correlations observed. Figure 6 additionally highlights that, as expected, the bounds are tightest for higher expressibility ansätze but may be relatively loose for lower expressibilities. More specifically, in all cases considered here, the bounds are The dashed line indicates the predicted variance in the partial derivative for a perfect 2-design from Ref [62]. The derivative is taken with respect to θ 1 D/2 , the rotation angle of the first qubit in the middle layer (D/2) and the variance is taken over an ensemble of 1000 unitaries. We chose to show the state and Hamiltonian dependent bound here, Eq. (16), because as we are looking at the gradient with respect to θ 1 D/2 this bound is tightest.
tight to within a couple of orders of magnitude, with the bounds tightest for ansätze that are high depth, uncorrelated and use the full range of rotation directions 2 . This phenomenon is more clearly demonstrated in Fig. 5 where we plot both the variance in the partial derivative of the cost and the Hamiltonian and state dependent expressiblity bound, Eq. (16), as a function of ansatz depth for the 8 qubit local cost. The bound captures the qualitative behaviour of the cost gradients, decreasing with increased circuit depth. While moderately loose at low depths, the bound becomes tight for deep circuits.

IV. Discussion
In this work, we extended the well-known barren plateau result. This result was restricted to ansätze that form 2-designs [27], while we extended it to arbitrary ansätze in our Theorems 1 and 2. In practice, this extension may prove to be quite useful, since many ansätze of interest are not exact 2-designs but rather are some approximate notion of this [59,[64][65][66]. Our results can potentially provide useful bounds on the variance of the gradient in this realistic scenario of approximate 2designs.
The key to our extension was to consider the expressiblity of the ansatz. This can be precisely defined in terms of the distance of the ensemble of unitaries accessible by the ansatz from being a 2-design. Hence, our extension linked two key properties of ansätze: their expressiblity and their gradient magnitudes. Our bounds demonstrate that increasing the expressibility of an ansatz can result in smaller cost gradients. We believe that this connection is very interesting, and there is certainly much more to be explored along these lines. For example, it would be interesting to connect our findings to recent results on the role of the growth of entanglement in generating barren plateaus. In particular, since highly expressive ansätze are necessarily highly entangling, our results would seem to imply those in Refs. [33,44].
To go beyond our bounds and look at the precise relation between expressiblity and gradients, we performed extensive numerics. We considered several different strategies by which one can vary the expressiblity. As highlighted in Fig. 6 and Fig. 5, we typically observed a strong correlation (especially for local cost functions) between the expressiblity and the variance of the gradient. However, the bounds are not perfectly tight. This may arise from the repeated use of the triangle and Cauchy-Schwarz inequalities in the derivation (Appendix D). Thus a natural question to ask is whether our bounds can be further tightened. Another direction would be to explore the nature of barren plateaus for global costs where the numerics suggest that the correlation between expressibility and cost gradients is weaker.
We remark that the numerical results presented here are necessarily problem specific, since they depend both on the choice in cost function and ansatz. Further work is required to ascertain the extent to which the trends observed here are universally observed. In particular it would be valuable to investigate whether any analytic results can be obtained to support them.
Nevertheless, there are several interesting trends shown in our numerics that even suggest potential strategies of avoiding or mitigating barren plateaus. As discussed above, correlating parameters and restricting rotation angles (especially when initializing near the solution) are two strategies that significantly mitigated barren plateaus in our numerics. Further exploring these and other strategies will be an important direction for future research.

Appendices
We begin by reviewing some definitions and prior results relevant for the rest of the appendices. We then provide proofs for the main results and theorems. √ Ω † Ω. More generally, the Schatten p-norm of an operator Ω can be defined as Ω p = (Tr[|Ω| p ]) 1/p , which satisfies Ω p Ω q for p q. The diamond norm of a Hermiticity preserving linear map S A is defined as where Ω AB ∈ L(H A ⊗ H B ) and I where W ∈ U(d).
Symbolic integration. We recall formulas which allow for the symbolical integration with respect to the Haar measure on a unitary group [67]. For any V ∈ U(d) the following expressions are valid for the first two moments: where v ij are the matrix elements of V . Assuming d = 2 n , we use the notation i = (i 1 , . . . i n ) to denote a bitstring of length n such that i 1 , i 2 , . . . , i n ∈ {0, 1}.
Useful Identities. We use the following identities, which can be derived using Eq. (A3) (see [28] for a review): where W is the subsystem swap operator, i.e., W |i |j = |j |i .

B. Definitions of Expressibility
In broad terms a parameterized quantum circuit can be considered expressive if the circuit can be used to uniformly explore the unitary group U(d). Thus, the expressiblity of a circuit can be defined in terms of the following superoperator where dµ(V ) is the volume element of the Haar measure and dU is the volume element corresponding to the uniform distribution over U. If A (t) U (X) = 0 for all operators X then the averaging over elements of U agrees with averaging over the Haar distribution up to the t-th moment. In this case U is said to form a t-design. For our purposes it suffices to consider the behavior of A (t) U for t = 2. Henceforth, we drop the t-superscript and denote A (2) U (·) as A U (·). In the context of minimizing a generic cost C of the form specified by Eq. (1), we are interested in the quantities The quantities ε ρ U and ε H U may be more readily computed by relating them to a generalization of the frame potential. To demonstrate how, let us first recall that the frame potential [48,56] of an ensemble U may be defined as where dU and dV are volume elements corresponding to the distribution over U. We then note that the quantity ||A U (|0 0|)|| 2 2 can be rewritten in terms of F U as follows where we use the left and right invariance of the Haar measure, i.e. Eq. (A2), as in Ref. [56], and where we defined In the context of the expressibility of a VQA we are interested in the more general quantity ||A U (X ⊗2 )|| 2 where X is a quantum state ρ or Hamiltonian H. Following the same approach as in Eq. (B5), we note that ||A U (X ⊗2 )|| 2 can be rewritten as where we have defined the operator dependent frame-potential as and F (X) The latter can be evaluated using Eq. (A6) to give Thus our expressibility measures can be related to state and Hamiltonian dependent frame potentials via Haar . (B12) We will use these expressions to evaluate the expressiblity of different ansätze in Appendix G C. Proof for Eq. (8) For a random layered parametrized ansatz of the form Eq. (2) and Eqs. (11)- (12), and the generic cost defined in Eq. (1), we now show that ∂ k C U = 0 for all k and therefore the cost landscape is unbiased.
To do so, let us first note that the cost function can be expressed as where we introduce the shorthand This rewriting emphasises the dependence of C on U k (θ k ), the rotation we are taking the partial derivative with respect to, by associating U L with the Hamiltonian H and Since 2π 0 sin(2θ k ) = 0 and 2π 0 cos(θ k ) 2 = 2π 0 sin(θ k ) 2 , uniform averaging of ∂ k C over θ k leads to which implies that ∂ k C U = 0.

D. Variance of the partial derivative derivation
For a random layered parametrized ansatz of the form Eqs. (2) and (11)- (12), and the generic cost defined in Eq. (1), then since where U L and U R are defined in Eq. (12), it follows that the partial derivative of the cost can be written as Since the average derivative of the cost vanishes, as discussed in Appendix C, its variance is given by Note that two different ensembles U L and U R can be generated using U L (θ) and U R (θ), respectively, as defined in Eq. (11). Let dU L and dU R denote volume elements corresponding distributions over U L and U R , respectively. Since U L and U R are independent, from the definition of dU and from Eq. (11), we get that dU = dU L dU R .
Then by substituting Eq. (D2) into Eq. (D3) and using Eq. (A7), we get where Next we substitute in A R (ρ ⊗2 ) to give where in the second line we use the explicit definition of Var R ∂ k C, the variance in the partial derivative of the cost when U R forms a 2-design, i.e., Rearranging we are left with which on using the triangle inequality followed by the Cauchy-Schwarz inequality reduces to The term ||X ⊗2 Lk || 2 can be bounded as follows. First we note that X † Lk = −X Lk , which implies that Let A = V k and B = U † L HU L . Since A and B are Hermitian, from the triangle inequality and the Cauchy-Schwarz inequality, we get (D11) Therefore, we find that Hence the bound takes the form which completes the proof.
Extension to generalized cost. This result can be further extended to cost functions of the following form for which the derivative with respect to the parameter θ k can be written as Therefore, from Eq. (D7) it follows that where X m Lk is defined in Eq. (D5) with H = H m . After substituting A R (ρ j ⊗ ρ k ), we get which implies that where we used steps similar to those used in deriving Eqs. (D10)-(D13).

Bound in Eq. (15).
Substituting Eq. (D2) into Eq. (D3), using Eq. (A7), and the cyclicity of the trace operation, we find that The rest of the derivation proceeds in the same manner as for the bound in Eq. (14). Extension to generalized cost. Similar to Eq. (D21), the bound in Eq. (15) can be extended for the cost functions of the form in Eq. (D14). In particular, we find that 3. Bound in Eq. (16).
To derive Eq. (16) we start by substituting Eq. (D2) into Eq. (D3), using Eq. (A7), and the cyclicity of the trace operation to find that where we have introduced the short hand ρ R := U R ρU † R and H L := U † L HU L . Next we substitute in A L (H ⊗2 ) and A R (ρ ⊗2 ) to find that the variance is given by (D25) Here we defined for x = L and x = R, and where ω R = ρ and ω L = H. The integrals I 1 and I 2 are given by (D27) withρ = U ρU † andH = U † HU . The integrals I 1 and I 2 can be evaluated using Eq. (A8) as follows: (D29) Using Cauchy-Schwarz this reduces to where we have used ||W || 2 = d. Finally, by expanding ||Z xk || 2 , using the triangle inequality and the fact that V 2 for x = L and x = R, and where ω R = ρ and ω L = H. Thus we are left with Extension to generalized cost. Similar to Eqs. (D21) and (D23), the bound in Eq. (16) can be extended for the cost functions of the form in Eq. (D14). In particular, we find that Here we derive bounds Eqs. (19)- (21), in which the expressiblity is quantified in terms of the diamond norm. This is a natural alternative way of formulating the bounds, since the diamond norm is an operationally meaningful measure of the distinguishability of two quantum operations that is often used to define ε-approximate t-designs.
To derive Eq. (19) we start with Eq. (D9) and invoke the Hölder's inequality as follows: The term ||X ⊗2 Lk || ∞ can now be bounded as follows. Given that X † Lk = −X Lk , it follows from the unitary invariance and sub-multiplicativity of the infinity norm that We additionally note that ||E(X)|| 1 X 1 ||E|| for any channel E and operator X, therefore Thus we are now left with The derivation of Eq. (20) is entirely analogous.
To derive Eq. (21) we start with Eq. (D29) and again use Hölder's inequality in terms of the infinity and one norm to find where we have used ||W || ∞ = 1. Finally, by expanding ||Z xk || ∞ , using the triangle inequality and and the fact that for x = L and x = R. We additionally note that ||E(X)|| 1 X 1 ||E|| for any channel E and operator X, therefore, Thus we are left with E. Variance in partial derivative for exact 2-designs.
In this Appendix, we provide the explicit expressions and the derivation of the variance in the partial derivative for a random layered parametrized ansatz of the form Eqs. (2) and (11)- (12), and the generic cost defined in Eq. (1). These quantities have been investigated in [27]; however, only the highest order terms in n were given. Here we provide higher order terms for completeness.
Explicit expressions Let us denote the variance of the cost when just U R , just U L , and both U R and U L form 2-designs as Var R ∂ k C, Var L ∂ k C, and Var R,L ∂ k C, respectively. These variances are given by Derivation From Eq. (D2), we have Since the cost gradient is unbiased, as in Eq. (8), the variance in the partial derivative is given by Then Var R ∂ k C, Var L ∂ k C, and Var R,L ∂ k C can be calculated by the integration in Eq. (E6) over U R , U L , and both U R and U L , respectively.
Integrating over only U R gives where the first equality follows from Eq. (A6) and the second equality follows from the fact that the trace of a commutator is always zero. Form the cyclicity of the trace operation and the arguments similar to Eqs. (E7) and (E8), we get In order to calculate Var R,L ∂ k C, we note that Tr([V k , U † L HU L ] 2 ) in Eq. (E9) can be written as The integral of the first term over U L in Eq. (E11) can be calculated using Eq. (A5) as follows: The integral of the second term in Eq. (E11) can be calculated using Eq. (A4) as follows: Finally, after combining everything we get In Ref. [33] it was shown that for ansätze where the reduced state on the measured qubits obeys a volume law, typical local cost function values concentrate exponentially fast in n to its mean. This result was complemented by a proof that for ansätze that form 2-designs, i.e. maximally expressive ansätze, local costs concentrate exponentially fast to a fixed value. Here we show that this proof may be generalised to non-perfectly expressive ansätze.
Specifically we show that for a k-local cost C k we have that Here χ is an expressibility dependent correction defined as where, as previously, W is the subsystem permuation operator.
Proof. The start of the proof of is identical to Ref. [33].
where σ = |0 0| − 1/d. The first inequality follows from Hölder's inequality. For the second inequality, we used the relation between the trace norm and the Hilbert-Schmidt norm and invoked Jensen's inequality. We used k to denote qubits that are not measured for defining the cost function C k .
G. Numerically studying the correlations between expressibility and cost partial derivatives In this Appendix we present numerical results on the correlations between the cost gradient and expressibility. Specifically, we consider the layered parametrized ansatz detailed in Section III B of the main text and plot the variance in the PQC gradients as a function of its expressibility.
We can calculate the expressibility measures ε ρ R and ε H L via their reformulation in terms of the state and Hamiltonian dependent frame potentials F  Haar ), it follows that ε ρ R (ε H L ) may also be exponentially small. We therefore find the ratio of the true frame potential to the Haar frame potential more insightful to plot. That is, we consider the ratios The larger these ratios, the more inexpressive the ansatz, with the ratios tending to 1 for maximally expressive ansätze (exact 2-designs).
In Fig. 7 and Fig. 8 we plot the variance in the partial derivative as a function of with Section III B of the main text, we focus on three different ways of tuning the expressibility of an ansatz; namely decreasing the depth of the circuits, correlating circuit parameters, and restricting either the direction of rotations.
To numerically quantify the degree of correlations between the variance in the partial derivative of the cost and the expressibility we include in Fig. 7 and Fig. 8 the Spearman correlation coefficient [69] and its corresponding p-value. Overall we find a clear correlation between partial derivatives of the cost and expressibility, with the variance in the derivatives increasing with increasing F L . Specifically, combining all the different ways of tuning the expressibility, the Spearman coefficient for the correlation between the variance in the partial derivative of the cost and the ρ was found to be 0.78 with a p-value of 1.19 × 10 −7 . Similarly, for the correlations with H was 0.80 with a p-value of 1.18 × 10 −7 .
It is noteworthy that the Hamiltonian dependent frame potential captures the effect of locality on cost gradients as the circuit depth is tuned. As observed in Section III B, increasing the depth of the circuit reduces cost partial derivatives for a local cost but not a global cost. The state dependent frame potential cannot capture this effect since it is independent of the choice in measurement operator H and therefore necessarily independent of the locality of H. Conversely, the while the Hamiltonian frame potential for a local cost decreases with increasing depth, inline with the decreasing variance in partial derivatives, the Hamiltonian dependent frame potential for the global cost is effectively constant (even as the depth of the circuit substantially increases) reflecting the effectively constant variance in partial derivatives.
Nonetheless, the correlation between the variance in the cost partial derivative and the expressibility is not perfect, as is clear for example, from Fig. 8(F). This is entirely compatible with our analytical bounds, which are upper bounds and therefore do not enforce perfect correlation between the variance in the partial derivative and the expressibility. Thus while Fig. 7 and Fig. 8 demonstrate a clear correlation between the variance in the partial derivative of the cost and expressibility, further work is required to understand the intricacies of this correlation.   Correlations between cost partial derivative and the Hamiltonian-dependent frame potential. This setting here is entirely equivalent to that described in Fig. 7; however, here we plot the variance in the partial derivative as a function of the ratio of Hamiltonian-dependent frame potentials