Avoiding barren plateaus using classical shadows

Variational quantum algorithms are promising algorithms for achieving quantum advantage on near-term devices. The quantum hardware is used to implement a variational wave function and measure observables, whereas the classical computer is used to store and update the variational parameters. The optimization landscape of expressive variational ans\"atze is however dominated by large regions in parameter space, known as barren plateaus, with vanishing gradients which prevents efficient optimization. In this work we propose a general algorithm to avoid barren plateaus in the initialization and throughout the optimization. To this end we define a notion of weak barren plateaus (WBP) based on the entropies of local reduced density matrices. The presence of WBPs can be efficiently quantified using recently introduced shadow tomography of the quantum state with a classical computer. We demonstrate that avoidance of WBPs suffices to ensure sizable gradients in the initialization. In addition, we demonstrate that decreasing the gradient step size, guided by the entropies allows to avoid WBPs during the optimization process. This paves the way for efficient barren plateau free optimization on near-term devices.


I. INTRODUCTION
In recent years the field of quantum computation has seen rapid growth fueled by the arrival of the first generation of quantum computers, dubbed noisy intermediatescale quantum devices (NISQ) [1]. The NISQ era is characterized by quantum computers with a small number of qubits and limited control. The number of coherent operations that can be performed is small and the implementation of famous algorithms with proven quantum speedups, such as Shor's algorithm [2], remains out of reach. To make use of the current generation of quantum computers, the so-called variational hybrid approach [3] was proposed. The idea is to use the quantum computer in a feedback loop with a classical computer, where it implements a variational wave function that is measured to compute the value of the so-called cost function. This information is then fed into a classical computer where it is processed and the variational wave function is subsequently updated aiming to find a minimum of the cost function, which provides an (approximate) solution to the computationally hard problem. The variational hybrid approach has seen a wide range of proof-of-concept applications on NISQ devices ranging from quantum chemistry [4,5] to quantum optimization [6,7] and quantum machine learning [8,9].
Despite the large number of suggested applications, the variational approach encountered also a number of obstacles, that have to be overcome for the future success of the method. In particular, the infamous emergence of barren plateaus (BPs) implies that expressive variational ansätze tend to be exponentially hard to optimize [10]. * Equal contribution; stefan.sack@ist.ac.at † Equal contribution; raimel.medina@ist.ac.at The main obstacle on the way to optimization lies in the fact that gradients of the cost function are on average zero and deviations vanish exponentially in system size, thus precluding any potential quantum advantage. Moreover, it has been shown that the classical optimization problem is generally NP-hard and plagued with many local minima [11]. The problem of BPs attracted significant attention, and numerous approaches were proposed in the literature. In particular, the early research focused on avoidance of BP at the initialization stage of variational algorithms [12][13][14][15][16]. In a different direction, the relation between occurrence of BPs and the structure of the cost function was studied [17,18]. Also notions of so-called entanglement-induced [19] and noise-induced [20] BPs were introduced. The relation between BPs and entanglement has lead to various proposals that suggest controlling entanglement to mitigate BPs [21][22][23][24]. However, measuring entanglement is hard, therefore making these approaches impractical on a real quantum device.
In this work we introduce the notion of weak barren plateaus (WBPs), in order to diagnose and avoid BPs in variational quantum optimization. WBPs emerge when the entanglement of a local subsystem exceeds a certain threshold identified by the entanglement of a fully scrambled state. In contrast to BPs, WBPs can be efficiently diagnosed using the few-body density matrices and we show that their absence is a sufficient condition for avoiding BPs. Based on the notion of WBPs, we propose an algorithm that can be readily implemented on available NISQ devices. The algorithm employs classical shadow estimation [25] during the optimization process in order to efficiently estimate the expectation value of the cost function, its gradients, and the second Rényi entropy of small subsystems. The tracking of the second Rényi entropy enabled by the classical shadows protocol allows arXiv:2201.08194v2 [quant-ph] 30 Jun 2022 for an efficient diagnosis of the WBP both at the initialization step and during the optimization process of variational parameters. If the algorithm encounters a WBP, as witnessed by a certain subregion having a sufficiently large Rényi entropy, the algorithm restarts the optimization process with a decreased value of the update step (controlled by the so-called learning rate). We support the proposed procedure by rigorous results and numerical simulations. The structure of the paper is as follows.
In Sec. II we introduce the theoretical framework and present our main results. In Sec. II A we introduce the framework of variational quantum eigensolvers (VQEs). Sec. II B introduces the phenomenon of BPs, which dramatically hinders the performance of VQEs. In Sec. II C we demonstrate WBPs to be a precursor to BPs. We explain why and how WBPs can be efficiently diagnosed in experiments and contrast this with much harder task of detecting BPs. Finally we propose a modification to the VQE algorithm, which allows prevention of the occurance of BPs by avoiding WBPs.
In Sec. III we present a bound for the expectation value of the second Rényi entropy in quantum circuits drawn from a unitary ensembles forming a 2-design. This bound allows us to use the second Rényi entropy, which is much easier to estimate, instead of the entanglement entropy. In Sec. III A we provide a formal definition of WBPs according to the value of the second Rényi entropy of the subsystem and prove that the occurrence of a BP implies the occurrence of a WBP. From this argument it follows that the absence of a WBP precludes the occurrence of a BP. In addition, we provide an upper bound (whose proof is found in Appendix A) for the measurement budget require in order to estimate a WBP using classical shadows. Finally, in Sec. III B we demonstrate numerically how the avoidance of WBPs precludes the presence of a BP using the popular BP-free small-angle initialization [15,26].
In Sec. IV, we explore how BPs and WBPs emerge at different stages in the VQE optimization and perform a systematic performance analysis. Next, in Sec. IV A we explore the relation of the learning rate and entropy growth for a single update of the VQE algorithm. We analytically and numerically illustrate how a large learning rate leads to an uncontrolled growth in subsystem entropies, essentially driving optimization to a WBP region. In Sec. IV B we explore the performance of the WBP-free VQE algorithm in different settings for the Heisenberg model on a chain. Finally, in Sec. IV C, we show that our approach allows for the efficient convergence to both, area-and volume-law entangled ground states and compare it to layerwise optimization [13], which is a popular heuristic for BP avoidance. This is illustrated using the Heisenberg model on a random 3-regular graph, additional results for the Sachdev-Ye-Kitaev (SYK) model can be found in the Appendix E which exhibits a nearly maximally entangled ground state.
Finally, in Sec. V we summarize our results, discuss their implications, and outline open questions.

II. AVOIDING BARREN PLATEAUS IN VARIATIONAL QUANTUM OPTIMIZATION
In this section we first introduce the framework of VQEs, i.e., the unitary ensemble, the cost functions, and the optimization algorithm, and discuss the BP problem. After this, we present our main result -a specific modification of the VQE that avoids the issue of BPs.

A. Variational quantum eigensolver
The aim of the VQE, initially introduced by Peruzzo et al., is to approximate the ground state |GS of a Hamiltonian H with a variational wave function |ψ(θ) . A quantum computer is used to prepare the variational function via the action of a set of unitary gates, |ψ(θ) = U (θ) |ψ 0 , where |ψ 0 is the initial state that is typically assumed to be a product state. The variational parameters are then iteratively updated to minimize the expectation value of the Hamiltonian, also-called cost function We consider a unitary circuit U (θ) of the form of the so-called "hardware-efficient" ansatz [4] where θ i l ∈ [−π, π) are pN variational angles, concisely denoted as θ. In this expression the product goes over spatial dimension i = 1, . . . , N that labels individual qubits and "time dimension", l = 1, . . . , p with p specifying a number of layers, see Fig. 1 (a). We choose the single-qubit gates to be rotations R i l (θ i l ) = exp − i 2 θ i l G l,i with random directions given by G l,i ∈ {σ x , σ y , σ y }. W l is an entangling layer that consists of two-qubit entangling gates represented by nearest-neighbor controlled-Z (CZ) gates with periodic boundary conditions, see Fig. 1 (a) for an illustration.
We focus our study on k-local Hamiltonians H, defined as sum of terms each containing at most k Pauli matrices. We take k to be finite and fixed, while the number of qubits N k. A particular example of a 2-local Hamiltonian from the many-body physics is provided by the Heisenberg (XXX) model subject to a magnetic field where V G refers to the vertex set of the graph G and, couplings are fixed J = h z = 1. In our simulations we consider two different graphs: a ring corresponding to a one-dimensional (1D) chain with periodic boundary condition, and a random 3-regular graph. The U (1) symmetry related to the conservation of the z component of the Random single qubit Clifford unitary Figure 1. (a) Illustration of the variational quantum circuit U (θ) |0 that is considered in the main text followed by the shadow tomography scheme [25]. spin under the action of H, as well as translational invariance present for chains with periodic boundary condition, can be explored to decrease the space of parameters by using a suitable gate set respecting this symmetry. However, for the sake of generality we focus on the hardwareefficient unitary ansatz defined in Eq. (1). Obtaining the energy expectation value E(θ) = ψ(θ)| H |ψ(θ) requires measuring a subset or all qubits in the circuit as we schematically show in Fig. 1 (a). For our example of a 2-local Hamiltonian on the 1D chain, the required measurements include the value of the σ z operator on all sites along with the σ a i σ a i+1 expectation values of all i = 1, . . . N (periodic boundary condition is assumed, so that bits 1 and N + 1 are identified) and a = x, y, z. Finding the optimal parameters θ requires minimization of the Hamiltonian expectation value E(θ ) = min θ E(θ) performed by a classical computer.
There is a plethora of sophisticated classical optimization algorithms that were applied to this minimization problem [28][29][30][31]. We use the plain gradient-descent (GD) algorithm due to its simplicity, which makes analytical considerations easier. A GD update step is given by where η is the learning rate, which controls the update magnitude. This update step is repeated until convergence of E(θ), which results from finding a (local) minimum of E(θ). The resulting VQE algorithm is shown schematically in Fig. 1 (b) by solid lines. Following the initialization of the variational angles θ, that may be chosen to be real random numbers, the quantum computer is used to prepare the variational state and provide quantum measurement results. The classical computer uses the measurements to estimate the value of the cost-function and its gradient, and performs an update of the variational parameters controlled by the learning rate η.

B. Barren plateaus and entanglement
Whilst the VQE described above is a promising framework for near-term quantum computing due to its modest hardware requirements, its performance may be ruined by the issue of barren plateaus [10,15,17]. Specifically, the BPs are defined as regions in the space of variational parameters where the variance of the cost function gradient (and consequently its typical value) vanishes exponentially in the number of qubits [10]: McClean et al. were among the first to theoretically investigate BPs. They showed that the appearance of a BP can be related to the circuit matching the Haar random distribution up to the second moment. More precisely, they showed that BPs are a consequence of the unitary ensemble E ∼ {U (θ)} θ forming a so-called 2design [10] (see Appendix B for details and the definition of a t-design). To understand the different circuit depth at which BPs are encountered, the authors in Ref. [17] introduced the concept of cost-function-dependent BPs. In particular, they argued that the emergence of BP occurs at different circuit depths, depending on the nature of the cost function.
In contrast, for a so-called global cost function, exemplified by the fidelity, Ref. [17] found that BPs already occur at very modest circuit depths p ∼ O(1). The emergence of BP for the fidelity is naturally related to "orthogonality catastrophe" in many-body physics: even a small global unitary rotation applied to the many-body wave function results in it becoming nearly orthogonal to itself. In terms of fidelity, this implies that it vanishes exponentially in the number of qubits. Moreover, most global state features -such as expectation values of general operators, fidelities with general states and global purities -cannot be efficiently accessed on NISQ devices, and are therefore not practical from an algorithmic point of view [25,[32][33][34]. Therefore, in what follows we do not consider the global cost functions and corresponding BPs.
Local cost functions, that are the focus of the present work are characterized by a later onset of BPs. Specifically, for a k-local cost function where k is fixed, the BPs will occur for circuit depth p ∼ O(poly(N )) that increases polynomially in system size [10,17]. In other words, for a large enough p the VQE algorithm will also suffer from a BP already at the very first step of the GD optimization, provided random choice of variational angles θ. We also note that gradient-free optimization strategies do not circumvent the BP problem since the optimization landscape is inherently flat [35].
The potential emergence of BPs at the initialization stage of the VQE and other algorithms spurred the investigation of different initializations strategies that avoid BPs. Until now, several BP-free initializations were considered in the literature. Ref. [12] suggests to initialize the circuit with blocks of identities, Ref. [13] suggests to optimize the ansatz layer by layer, and Ref. [14] suggests to use a matrix product state ansatz that is optimized by a separate algorithm [36] and map that to a quantum circuit. In this work we will focus on small single-qubit rotation as suggested in Ref. [15].
More recently, it was observed that the entanglement entropy defined as a trace of the reduced density matrix, S = − tr ρ A ln ρ A (where ρ A = tr B ρ is the reduced density matrix where A is the subset of qubits that are measured and B is the rest of the system) is another source for the occurrence of BPs [19]. The community has subsequently dubbed this kind of BP, entanglementinduced BP [19,21,23,24]. In this work, we will however show that entanglement-induced BPs and BPs for local cost functions, are in fact one and the same. Avoiding entanglement-induced BPs is equivalent to avoiding BPs for local cost functions, the details are presented in Sec. III.
Experimentally probing a BP is a hard task. The estimation of the exponentially small gradient in Eq. (4) requires a number of measurements that is exponential in the number of qubits, and therefore invalidates any potential quantum speedup. Moreover, small values of gradient encountered in BP have to be distinguished from the case when gradient vanishes due to convergence to a local minimum. Experimentally diagnosing BPs via entanglement is also impractical. For example, quantum circuits that implement 2-design and thus lead to BPs for local cost functions are characterized by typical volume-law entanglement that approaches nearly maxi-mal values. Checking volume-law entanglement scaling on any device is a formidable challenge.
In the process of variational quantum optimization, the majority of approaches to mitigate BPs apply to the initialization stage [12,37,38] and not during the optimization. In Sec. IV, we illustrate the importance of BP mitigation during the optimization. This motivates the need to devise a BP mitigation strategy for the initialization and during the optimization procedure that is efficient. This procedure is discussed in the algorithm proposed below.

C. Weak barren plateaus and improved algorithm
In order to devise an efficient algorithm for BP-free initialization and optimization of the VQE we introduce the notion of WBPs. Specifically, for a Hamiltonian that is k-local, we define the WBP as the point where the second Rényi entropy S 2 = − ln tr ρ 2 A of any subregion of k-qubits satisfies S 2 ≥ αS Page (k, N ), where the Page entropy in the limit k N corresponds to the (nearly) maximal possible entanglement of subregion A, where we explicitly used that the Hilbert space dimension of region A is 2 k and its complement B has Hilbert space dimension 2 N −k . The naive choice for the parameter α is α = 1. Given some a priori knowledge of the entanglement structure of the target state |GS , the choice can however be more informed to help avoid large entanglement local minima, see Sec. III. The notion of WBP is practical since it is defined by k-body density matrices, being much easier to access on a real NISQ device. The fact that the prevention of a WBP is sufficient for avoiding the BP may be understood by the intuition from quantum many-body dynamics and the process of thermalization or scrambling of quantum information. In the thermalization process the small subsystems are first to become strongly entangled, as is witnessed by the proximity of their density matrix to the infinite temperature density matrix. This intuition suggests that it is enough to keep in check the density matrices of small subsets of qubits. If their entanglement or other properties are far away from thermal, the system overall is still far away from the BP.
Practically, the WBP can be diagnosed using the shadow tomography scheme [25]. This scheme enables an efficient way of representing a classical snapshot of a quantum wave function on a classical computer. In essence, the shadow tomography replaces the measurements performed in the computational basis with a more general measurements, that turns out to be sufficient for reconstructing linear and non-linear function of the state, such as expectation values of few-body observables and second Rényi entropy of few-body reduced density matrices respectively.
Relying on the shadow tomography, we propose the following modification of the VQE shown by dashed lines in Figure 1 (b). In essence, we suggest to use the tomography to simultaneously measure the cost function value and the k-body second Rényi entropy. For the derivative we require an additional 2pN tomographies (two for each parameter) to compute the gradient exactly using the parameter shift rule [39,40], a detailed derivation of the computational cost for each operation is presented in Appendix A. Access to the second Rényi entropy allows prevention of the appearance of WBPs not only at the initialization step, but throughout the optimization cycle. The explicit algorithm works as follows.
Algorithm 1 WBP-free optimization with classical shadows 1: Choose α, default is α = 1 see Sec. III A for details 2: Choose θ such that S2 < αS Page (k, N ) 3: Choose learning rate η 4: repeat see Appendix A for details

5:
Obtain classical shadowsρ (t) (θ) 6: else 10: Start again with smaller η ← η 11: If a WBP is diagnosed at the initialization, one may have to take a different initial value of the variational angles or change the initialization ensemble. These aspects are discussed in detail in Sec. III. In addition, the WBP can occur in the optimization loop. This can be mitigated by keeping track of the second Rényi entropies in the optimization process. If the WBP condition is fulfilled, one must restart the algorithm with a smaller learning rate. In Sec. IV we discuss the optimization process in greater details. In particular, we show how the learning rate is related to the potential change in entanglement entropy, which implies that a smaller learning rate is generally better at avoiding WBPs.

A. Definition and relation to barren plateaus
As mentioned in the above, BPs for local cost functions are a consequence of the unitary ensemble E ∼ {U (θ)} θ forming a 2-design [10,17], which leads to an exponentially vanishing gradient variance, i.e., a BP. What is important to note is that the exponential decay is simply a witness of the emergence of a 2-design. Another, equivalent witness is the second Rényi entropy, where we have the following.
These results are known in the literature, and in the context of random quantum circuits, can be found in Refs. [41][42][43]. However, for completeness we also provide a proof in Appendix C.
The theorem above implies that a large amount of entanglement naturally follows from the similarity between the considered circuit and a random unitary (2-design). Such similarity also gives rise to the vanishing variance of local cost function gradients that define BPs. Therefore, so-called entanglement-induced BPs [19] and BPs for local cost functions are the same. In fact, entanglement provides an intuitive picture for the emergence of BPs and its circuit depth dependence. Every entangling layer in the circuit typically increases entanglement of the resulting wave function, until it saturates to its maximal value for any subregion of k-qubits at a circuit depth p ∼ O(poly(N )). If the second Rényi entropy for half of the subsystem k = N/2 has saturated, it has saturated for all smaller subsystem sizes and is thus a sufficient check for a BP. Computing the second Rényi is however typically exponentially hard in subsystem size on NISQ devices (for single-copy access this was recently proven in Ref. [33,34]). It is therefore only practical to check a small subregion where k is small and independent of system size.
The above considerations naturally lead us to introduce the notion of WBPs as a modification of the BP that is computationally efficient to diagnose on NISQ devices. More formally we have as follows.
Definition 2. (Weak barren plateaus) Let H be an Nqubit Hamiltonian, and A is a region containing k qubits. We define a weak barren plateau by the second Rényi entropy of the reduced density matrix This definition works for any k, however it is reasonable to use k that corresponds to the number of spins involved in interaction terms in the Hamiltonian H since it provides a natural length scale. Moreover, in such a case the reduced density matrix of subregion with k spins contains all necessary information needed to extract the expectation values of Hamiltonian terms localized inside this region.
While a WBP is a necessary condition for a BP, it is however not sufficient (which motivates the term weak ). From a practical perspective we are actually interested only in avoiding a BP. For this, WBPs provide a powerful tool, since the following holds. Corollary 2.1. If we find a particular subregion A such that ρ A does not satisfy the weak barren plateau condition, i.e. Definition 2, it is on average also not in a barren plateau where the variance is exponentially small.
Proof. This assertion immediately follows from negating Theorem 1.
The corollary above formalizes the intuition behind the dynamics of entanglement in a circuit: if the state restricted to the smaller subsystem has not scrambled, then neither has the state restricted to a larger subregion. In practice, using classical shadows we can efficiently check one subregion of size k with a total measurement budget where is a desired accuracy and δ is a failure probability (over the randomized measurement process). Parameters and δ do not depend on the number of qubits, whereas the factor tr ρ 2 A is upper bounded by one for weakly entangled states and can be as small as 2 −k when entanglement is large. Moreover, checking all size k subregions incurs an additional overhead of only k ln N . A derivation of this result is presented in Appendix A, see Eq. (A7). Provided that k is small and does not scale with system size, N , this can be efficiently implemented on NISQ devices.
If any of these subregions avoids the WBP condition, we are guaranteed to also avoid an actual BP. For simplicity, in the numerical results below we check for the WBP condition for a particular region containing the first k qubits, i.e., A = {1, · · · , k}.
This argument is also intuitive to see by considering a causal cone (blue region) that indicates the extent of the so-called scrambled region (i.e., extend of a subregion with entropy close to the maximal value) in the circuit, see Fig. 2 (a). Such a scrambled region grows with every consecutive entangling layer W l (see Eq. (1)). When this region extends beyond k qubits, the WBP is reached (left orange dashed line). Later, when the "scrambling lightcone" has extended to the full system, the BP is reached (right orange dashed line). Once the BP is reached all smaller regions are also fully entangled and will satisfy the WBP condition on average. Fig. 2 provides a numerical illustration for the Corollary 2.1 stated above. We use the hardware-efficient circuit, presented in Eq. (1), and compute the gradient variance and second Rényi entropy as a function of circuit depth p for different system sizes N . We fix |ψ 0 = |0 as the initial state, which is simply all qubits in the zero state. Panel (b) shows the exponential decay of the gradient variance that is usually used to diagnose a BP. Panel (c) shows the corresponding bipartite second Rényi entropy. We see that it indeed approaches the Page value (gray dashed line). The Page value is not fully reached since we are considering the second Rényi instead of the von Neumann entanglement entropy, this difference however becomes negligible once the subsystem size is decreased. This numerically illustrates that when the 2-design is reached both the gradient variance and bipartite second Rényi entropy have converged. In panel (d) we consider a smaller region of two qubits and see that the second Rényi for this region saturates to its maximal value at a significantly lower circuit depth. This illustrates the emergence of the WBP that precedes the onset of the BP after a few more entangling layers. Before the WBP is reached, gradients are well behaved and do not decrease exponentially with the system size.
Finally, we address the effects of the control parameter α, that enters in Definition 2 of the WBP. The naive choice is α = 1, which means that a WBP is reached if the subregion is maximally entangled with the rest of the system. However, in the case when some a priori knowledge about the entanglement properties of the target state |GS is available, it can be used to set a smaller value of α. If, for instance, the ground state is only weakly entangled, a choice of α 1 may be appropriate. In this way Algorithm 1 in Sec. II C can also help in avoiding convergence to highly entangled local minima. We discuss this in more detail in Sec. IV B.

B. Illustration of WBP-free initialization
In order to illustrate the notion of WBP in a more specific setting we apply it to the initialization process of the VQE. Specifically, we focus on the family of initializations that was proposed earlier in order to avoid the issue of BPs [15,26]. The one-parametric family of initializations restricts the single-qubit rotation angles from ansatz Eq. (1) as θ i l ∈ θ [−π, π), where θ ∈ [0, 1) is the control parameter. This strategy allows the onset of the BP to be delayed to arbitrary circuit depths by tuning θ accordingly.
Similarly, it allows the onset of WBPs to be delayed. Depending on the parameter θ one can afford a deeper circuit without encountering a WPB in the initialization when compared to the full parameter range ( θ = 1). It is straightforward to see that for θ = 0, the ansatz is WBP free for all circuit depths. Indeed, in the absence of the single-qubit rotations, the entangling gates in W l do not create any entanglement [since the CZ gates used in Eq. (1) are diagonal in the computational basis], leaving |0 invariant. Note that, for example, the identity block initialization, proposed by Grant et al. works in a similar way in that the unitary is constructed such that it also implements the identity and one is equally left with the zero state.
In Fig. 3 we numerically illustrate the influence of θ on the growth of entanglement and its relation to the gradient variance. Panel (a) illustrates the growth of the second Rényi entropy in the circuit for three different small-angle parameters θ and panel (b) shows the corresponding gradient variance. Outside of the WBP the gradient variance vanishes at most polynomially in system size N . This illustrates that the avoidance of a WBP is sufficient for avoiding a BP and thus allows for a simple strategy for constructing BP-free initializations. The encounter of BP in the variance of the gradient of the cost function is visible only for the case θ = 1, and it is preceded by the onset of a WBP. We use a system size of N = 16 for (a) and N = 8, · · · , 16 for (b), color intensity corresponds to system size, same as in Fig. 2. Data is averaged over 100 random instances, variance is for the local term σ z 1 σ z 2 .

IV. ENTANGLEMENT CONTROL DURING OPTIMIZATION
A. Bounding entanglement increase at a single optimization step In Sec. II we presented how the general VQE can be extended with minimal overhead to avoid WBPs in the optimization procedure. The learning rate, as presented in Algorithm 1, hereby plays a crucial role. A smaller learning rate, as observed in Fig. 1 (c)-(e) is more likely to avoid a WBP. To understand this phenomenological observation on more rigorous grounds, let us consider a sufficiently deep circuit (with a polynomial number of layers in system size), so that the optimization landscape is dominated by WBPs. Careful selection of the parameters allows for an initialization outside of a WBP. However, to remain in the WBP-free region, the optimization has to be performed in a controlled manner, such that the optimizer does not leave the region of low entanglement due to large learning rate and does not end in a WBP.
Since WBPs are defined in terms of the second Rényi entropy S 2 , we need to bound the change in S 2 between iteration steps t and t + 1. For practical purposes, we instead use the purity (tr ρ 2 A = e −S2 ). The change in purity is upper bounded by [44] tr ρ 2 Figure 4. We numerically illustrate the continuity bound Eq. (7) and its relation to the learning rate η for t = 0, i.e. at the beginning of the optimization schedule. This shows that one should be careful with the choice of the learning rate since a large learning rate leads to a big change in the trace distance and change in purity. We use a system size of N = 10 and a random circuit with circuit depth p = 100 and small qubit rotations ( θ = 0.05) to generate a BP-free initialization. Data is averaged over 500 random instances.
where T A (t) ≡ T (ρ A (t), ρ A (t + 1)) is the trace distance between the reduced density matrices at iteration steps t and t + 1, and we assume that region A has k qubits.
Assuming that the states at consecutive update steps of gradient descent are pertubatively close (see Appendix D for details), as measured by the trace distance, one can show that where F i,j (θ) = 4 Re[ ∂ i ψ|∂ j ψ − ∂ i ψ|ψ ψ|∂ j ψ ] is the quantum Fisher information matrix (QFIM) [45] and η is the learning rate. Inequalities (7)- (8) imply that the learning rate η can be used to limit the maximal possible change of the purity. 1 Provided that the change in purity is sufficiently small, the Taylor expansion can be used to argue that the corresponding change in the second Rényi entropy S 2 , related to the purity as e −S2 = tr ρ 2 A , also remains controlled. Therefore, the choice of an appropriately small learning rate can guarantee the avoidance of a WBP at t + 1, provided the absence of one at t.
To illustrate the bound numerically, we prepare an initialization outside of the WBP using a small angle parameter θ and compute the change in the purity tr ρ 2 A after one GD update step for different learning rates η. The results of this procedure for four different learning rates are shown in Fig. 4. We see that larger learning 1 A similar continuity bound which does not require the QFIM can be found in terms of the maximum operator norm of the gate generators. We acknowledge Johannes Jakob Meyer for this remark.
rates correspond to a bigger change in purity and are thus more prone to encounter a WBP. At the same time, all data points are below the theoretical bound. While up to the best of our knowledge the bound Eq. (7) is not proven to be tight, we observe that points corresponding to the extreme learning rates closely approach the theoretical line. Using Eq. (8), the bound can be efficiently approximated on NISQ hardware: the QFIM can be estimated efficiently on a quantum device using techniques suggested in Ref. [31] or Ref. [46] using classical shadows. For the computation of the gradient one can use the parameter shift rule [39,40] also with shadow tomography. The expression can thus be efficiently evaluated on a real device and used together with the continuity bound to estimate a suitable learning rate η. However, in practice this might not be needed and simply following Algorithm 1 could be more efficient and easier to implement.

B. Optimization performance with learning rate
Finally, we illustrate Algorithm 1 in practice. To this end we first prepare a WBP-free initial state using small qubit rotation angles and compare the performance of GD optimization with different learning rates. If we start with a large learning rate, η = 1, corresponding to red lines in Fig. 5 (a)-(c), we see that the energy expectation value in Fig. 5 (a) rapidly (within one or two update steps) converges to a value far away from the target ground state energy E GS . At the same time, panel (b) reveals that this can be attributed to an onset of a WBP, as the second Rényi entropy spikes up to the Page value. Finally, panel (c) shows that the gradient norm also is convergent, though at non-zero value. We attribute this to the fact that the system gets trapped in the WBP region.
As suggested by Algorithm 1, we thus decrease the learning rate to η = 0.1 and start again. This time a WBP is avoided, the algorithm however gets stuck in a local minimum with large entanglement entropy. In this instance a choice of parameter α that defines an onset of a WBP in Def. 2 being smaller than one may be beneficial. For instance, setting α = 0.5 could help avoiding the suboptimal local minima characterized by large entanglement, see gray dashed line in Fig. 5 (b). Note that the large gradient persistent after many iterations for the blue line in Fig. 5 (c) may also indicate that the learning rate is chosen too large for the width of the local minima.
Provided that our algorithm uses α = 0.5, the system would satisfy a WBP condition even for learning rate η = 0.1, forcing us to restart the algorithm with an even smaller learning rate. Setting η = 0.01, we see that the algorithm is now able to converge very close to the true ground state energy (violet line in Fig. 5 (a)-(c)). In particular, the norm of the gradient assumes the smallest value among all learning rates. We note, that the further decrease of the learning rate (i.e., to η = 0.001) degrades For η = 0.01 the algorithm avoids large entanglement region and gets a good approximation for the ground state. Finally, setting even smaller learning rate (green lines) degrades the performance. The normalized second Rényi entropy of the true ground state is S2/S Page (k, N ) ≈ 0.246. (c) Shows the corresponding gradient norm. A small gradient norm equally corresponds to the BP and the good local minima found with η = 0.01 and 0.001. We use a system size of N = 10, subsystem size k = 2, and a random circuit (see Eq. (1)) with circuit depth p = 100 and small qubit rotations ( θ = 0.05) to generate a BP-free initialization. Here we choose α = 0.5 indicated by the gray dashed line, see the last paragraph of Sec. III A for a discussion on the choice of α. Data is averaged over 100 random instances. the performance of GD. While WBPs are not encountered during the optimization process, the GD optimization converges slower within the considered number of iterations and to a larger energy expectation value. This highlights the fact that it is best to choose the highest possible learning rate, that still avoids a WBP. We speculate, that an optimization strategy that adapts the learning rate at each optimization step would give the best performance, though testing this assumption is beyond the scope of the present work.

C. Classical simulatability and performance comparison
Now that we have illustrated the procedure outlined in Algorithm 1 in detail, let us comment on the restrictions that our algorithm imposes, its relation to classical simulatability and finally compare our method with other common means for mitigating BPs.
To avoid WBPs and thus BPs we require that the second Rényi entropy of a small subregion is less than a fraction α of the Page value, where α ∈ (0, 1] and the default choice is α = 1. While this restriction does place a limitation on the entanglement generated by the circuit for a region of k qubits, it does not imply classical simulatabilty of the circuit. Indeed, it is the scaling of the entanglement entropy with system size that is important for classical simulatability of a quantum system. Only in the special case when the entanglement entropy of the quantum state scales poly-logarithmically with the number of qubits, we can simulate the states on a classical computer in polynomial time [47][48][49]. In contrast, the criteria for WBP, Def. 2 is generally consistent with volume-law entanglement as we illustrate below, thus allowing our algorithm to be applied to systems that cannot be efficiently simulated on a classical computer. Here we focus on two types of systems: namely systems where the ground state satisfies area law, which implies that the entanglement entropy of an arbitary bipartition of the state scales with the size of the boundary S(ρ A ) ∼ |∂A|, as well as volume law, which implies that it scales with the volume, S(ρ A ) ∼ |A| (see Ref. [50] for a review on these concepts). For area-law states in 1D the entanglement entropy is constant and therefore allows for an efficient classical representation using techniques such as matrix product states [51]. The 1D Heisenberg model, considered in the previous subsection, is an example for such a system.
The Heisenberg model, however, can be made hard to simulate classically by considering a random-graph geometry illustrated in Fig. 6 (a), instead of a 1D chain. This leads to nonlocal interactions and a volume-law entanglement scaling for a typical bipartite cut. Due to the non-local nature of the model we choose α = 1 since we have no prior knowledge on the entanglement properties of the ground state. We again use the small-angle initialization [15,26] to generate a BP-free initial state. We compare this with layerwise optimization [13], which is another common heuristic for avoiding BPs. There the circuit is initialized with a single layer, which is optimized, the circuit is then grown by one layer at a time and optimized while keeping the parameters in the previous layers constant. Fig. 6 (b)-(c) reveal that for the Heisenberg model on a graph layerwise optimization ends up in a WBP during the optimization for both learning rates that we considered. The small-angle initialization successfully avoids the WBP for both learning rates, however good convergence is only achieved with η = 0.01. This is similar to the situation encountered in the Heisenberg model in 1D, see Fig. 7, where a too large learning rate prevents convergence to the basin of attraction of the local minimum. Likewise to the case of 1D Heisenberg model, the fact that learning rate η = 0.1 does not lead to convergence to a minimum can be revealed through the norm of the gradient which stays large even after 500 iterations.
In addition to the Heisenberg model on the random graph, we also considered the SYK model [52] that features a volume-law entangled ground state [53]. In Appendix E we illustrate that our method is also successful in preventing the BP occurrence and results in finding the SYK ground state.

V. SUMMARY AND DISCUSSION
The main result of this work is the introduction of the concept of WBPs, which in essence provides an efficiently detectable version of BPs. In particular, we propose to use the classical shadows protocol to estimate the second Rényi entropy of small subregions that are independent of system size. If these subregions avoid nearly maximal entanglement -a condition sufficient for avoiding WBPs -the system also avoids conventional BPs. Building on this definition of the WBP, we proposed an algorithm that is capable of avoiding BPs on NISQ devices without requiring a computational overhead that scales exponentially in system size.
In order to illustrate the notion of WBPs and the proposed algorithm, we studied a particular BP-free initialization of the variational quantum eigensolver. Furthermore, we considered an optimization procedure that uses gradient descent. Phenomenologically, we observed that the encounter of a BP during the optimization crucially depends on the learning rate, which controls the parameter update magnitude between consecutive optimization steps. A smaller learning rate is less likely to lead to the encounter of a BP during the optimization. However, choosing the learning rate to be very small degrades the performance of GD. These results support the feasibility of the proposed algorithm for efficiently avoiding BPs on NISQ devices. While our results and numerical simulations are focused on VQEs, they readily extend to other variational hybrid algorithms, such as quantum machine learning [8,54,55], quantum optimization [6,56,57], or variational time evolution [58,59].
Although the issue of avoiding BPs at the circuit initialization is a subject of active research [12][13][14][15][16], the influence and role of BPs in the optimization process has received much less attention [60]. Our results indicate that entanglement, in addition to playing a crucial role for circumventing BPs at the launch of the VQE, is also important for achieving a good optimization per- . LW encounters a WBP for both learning rates that we consider (green star). In contrast, SA avoids the WBP for both learning rates. Good performance and further convergence in the local minimum is only achieved through a smaller learning rate of η = 0.01. We use a system size of N = 10 and a random circuit from Eq. (1) with circuit depth p = 100. Data is averaged over 100 random instances.
formance. In addition, our heuristic results in Sec. IV suggest that postselection based on the entanglement of small subregions may help to avoid low-quality local minima that are characterized by higher entanglement. Algorithm 1 allows for such postselection by appropriately tuning the value of α. Doing so, however, requires some prior knowledge about the entanglement structure of the target state. This may be inferred from the structure of the Hamiltonian (for instance, for a Hamiltonian that is diagonal in the computational basis, the eigenstates are product states with no entanglement), or by targeting small instances of the computational problem using exact diagonalization.
Beyond that, one could imagine an algorithm where the learning rate is not only adapted when a WBP is encountered, but dynamically adjusted at every step of the optimization process. This may allow for efficiently maneuvering complicated optimization landscapes by staying clear of highly entangled local minima. VQE, for instance, is known to have many local minima [11], but a systematic study of their entanglement structure, required for devising such dynamic entanglement post selection procedure, has yet to be done.
Another important question concerns the effect of noise, which has been suggested to be an additional source for the emergence of BPs [20]. Noise cannot be avoided on NISQ machines and has a profound impact on any near-term quantum algorithm, which is difficult to analyze analytically. Fortunately, none of the tools we propose are especially susceptible to noise corruption. In fact, both the classical shadow protocol and the estimation of observables and purities are stable with respect to the addition of a small but finite amount of noise, and there have even been some proposals for noise mitigation techniques [61,62].
Finally, we comment on the possibility of testing Algorithm 1 on a real NISQ device. While the shadows protocol can readily be implemented on near-term devices to diagnose WBPs, whether a variational circuit with enough entangling layers that lead to a BP can be realized on a NISQ device is not entirely clear at this stage. Nevertheless recent results of Ref. [63] observed convergence of the out-of-time correlators to zero, indicating that a 2-design might already have been reached. This implies that large entanglement, as present in a BP, could be realizable on available NISQ devices, and opens the door to experimental studies of the effect of entanglement on the optimization performance on current NISQ machines using the proposed shadows protocol.

ACKNOWLEDGMENTS
We thank Marco Cerezo, Zoe Holmes, and Nicholas Hunter-Jones for fruitful discussion and valuable feedback. We also acknowledge Adam Smith, Johannes Jakob Meyer, and Victor V. Albert for comments on the paper. The simulations were performed in the Julia programming language [64] using the Yao module [65].

APPENDIX Appendix A: Classical shadows and implementation details
Shadow tomography attempts to directly estimate interesting properties of an unknown state without performing full state tomography as an intermediate step.
Aaronson and Aaronson and Rothblum showcased that such a direct estimation protocol can be exponentially more efficient, both in terms of Hilbert space dimension (2 N in our case) and in the number of target properties (we use L to denote this cardinality). These techniques do, however, require copies of the underlying quantum state to be stored in parallel within a quantum memory and highly entangled gates to be performed on all copies simultaneously. This is too demanding for current and near-term quantum devices.
Huang et al. developed a more near-term friendly variant of this general idea known as prediction with classical shadows. Similar ideas have been independently proposed by Paini and Kalev and Morris and Dakić, respectively. As explained in detail below, the key idea is to sequentially generate state copies and perform randomly selected single-qubit Pauli measurements. Such measurements can be routinely implemented in current quantum hardware and enable the prediction of many (linear and polynomial) properties of the underlying quantum state. Importantly, the measurement budget (number of required measurements) still scales logarithmically in the number of target properties L, but it may scale exponentially in the support size k of these properties. This is not a problem for local features, like subsystem purities or terms in a quantum many-body Hamiltonian, but does prevent us from directly estimating global state features like fidelity estimation.
The general measurement budget that is required to simultaneously estimate L local observables using classical shadows, necessary for the energy expectation value estimation, is provided in Theorem 3. Typically the estimation of L observables would scale linearly in L (essentially every term is estimated individually). This is traded with a ln L dependence instead and an exponential dependence on the support k of the operators. The cost for estimating the subsystem purities and thus second Rényi entanglement entropies is provided in Eq. (A7) and is exponential in k (this dependence was recently proven to be unavoidable [34]). However since for the WBP check outlined in the main text k is small, this is generally an efficient operation. Lastly, the cost for estimating the gradients is given in Eq. (A9). The efficiency of using classical shadows to estimate the energy expectation value and gradients is system dependent (see Ref. [25] for the application of classical shadow tomography to the lattice Schwinger model). For the estimation of the purities, the shadow protocol, however, generally provides the most efficient technique currently available [70]. One possibility to circumvent these restrictions is to use a hybrid scheme where the energy and gradients are estimated with either classical shadows or the usual approach dependent on the structure of the Hamiltonian while the second Rényi entropies for the WBP check are always estimated using classical shadows.

Data acquisition via classical shadows
We use randomized single-qubit measurements to extract information about a variational N -qubit state represented by a density matrix To this end, we repeat the following procedure a total of T times. For 1 ≤ t ≤ T we carry out the following.
2. Select N single-qubit Pauli observables independently and uniformly at random.
3. Perform the associated N -qubit Pauli measurement (single shot) to obtain N classical bits (0 if we measure "spin down" and 1 if we measure "spin up").
4. Store N single-qubit "postmeasurement" states, |s , where an i th qubit measurement outcome, s i , can take six possible values denoted as |0 , |1 if qubit is measured in z basis, |+ and |− for x basis, and, finally, | + i and | − i for y basis. Here, |± = (|0 ± |1 ) / √ 2 denote Pauli-x matrix eigenstates and | ± i = (|0 ± i|1 ) / √ 2 are two Pauli-y eigenstates. In practice, this is achieved by applying random single-qubit Clifford gates that effectively implement a change of basis such that the usual z-basis measurement can be used, see Fig. 1 (a) for a visualization.

(Implicitly) Construct the N -qubit classical shadowρ
Repeating this procedure a total of T times provides us with T classical shadows ρ (1) (θ), . . . , ρ (T ) (θ). These are random matrices that are statistically independent (because they are constructed from independent quantum measurements). By construction, each classical shadow reproduces the true underlying state in expectation (over both the choice of Pauli observable and the observed spin direction): This approximation becomes exact in the limit T → ∞ of infinitely many measurement repetitions. But the main results in Refs. [25,68] highlight that convergence actually happens much more rapidly. This is, in particular, true for subsystem density matrices. The tensor product structure of classical shadows, Eq. (A1), plays nicely with taking partial traces. Let A ⊆ {1, . . . , N } be a collection of |A| = k qubits. Then, is a k qubit shadow that can be used to approximate the associated subsystem density matrix. More precisely, Eq. (A2) asserts (A4) which can (and should) form the basis of empirical averaging directly for the subsystem in question. Here is a mathematically rigorous result in this direction. In what follows, the range (or weight) of an observable is the number of qubits on which it acts nontrivially. For example coupling terms in the Heisenberg Hamiltonian (2) have range k = 2, while the external field terms have range k = 1.
We emphasize that it is not necessary to form global shadow approximations. If O l only acts nontrivially on subsystem A l ⊆ {1, . . . , N } (O l =Õ l ⊗ I ¬A l ), then tr (O lρ (θ)) = tr Ô lρA l . Theorem 3 is slightly stronger than a related result in Ref. [25] (it does not require median-of-means estimation). Conceptually similar results have been established in Refs. [71] and [72,73]. Notably, the authors of Ref. [74] pointed out to us that they provided a similar statement as in Theorem 3 in their work. We present a formal proof in Appendix A 5 below.

Estimating subsystem purities
Suppose we are interested of estimating a collection of multiple subsystem purities where A ⊆ {1, . . . , N } labels different subsystems of size |A| = k each. Then, we can use the corresponding subsystem shadows, Eq. (A3), to approximate each p A by empirical averaging: It is important that we restrict our averaging operation to distinct pairs of classical shadows (t = t ). This guarantees that the expectation values factorize, i.e.
where the last equality is due to Eq. (A3). Formula (A6) is an empirical average over all distinct shadow pairs contained in the data set. It converges to the true average , and the speed of convergence is governed by the variance. As data size T increases, this variance decays as see, e.g., Ref. [75,SM Eq. (12)]. In the large-T limit, this expression is dominated by the first term in parentheses, 4 × 2 k p 2 (θ)/T , and Chebyshev's inequality allows us to bound the probability of a large approximation error. For > 0, provided that the total number of measurements T is large enough to suppress the higher-order contribution in the variance bound (this is why we write ). In this regime, a measurement budget that scales as suppresses the probability of a sizable approximation error (≥ ) below δ. It is worthwhile to point out that this bound depends on the subsystem purity under consideration. Smaller purities are cheaper to estimate than large ones. It is also important to note that the accuracy parameter has to be small enough in order to accurately capture the purity in the WBP regime, which decays exponentially fast, but only with the subsystem size k.
The δ-dependence in Eq. (A7) can be further improved to ln(1/δ) by replacing simple empirical averaging in Eq. (A6) by median-of-means estimation [25]. Doing so would allow us to estimate all possible L = N k ≤ N k size-k subsystem purities with only a k ln N -overhead. Median-of-means estimation does, however, worsen the dependence on by a constant amount. Empirical studies conducted in Ref. [76] showcase that such a trade-off only becomes viable if one wishes to approximate polynomially many subsystem purities.

Estimating gradients
To perform the GD update step suggested in Algorithm 1 we require the knowledge of gradient ∇ θ E(θ), which consists of pN derivatives ∂ i,l E(θ). The derivative can naively be approximated using finite difference, though for variational single-qubit rotation gates, as used in the main text [see Eq. (1)], we can use the parametershift rule to compute the gradients exactly (up to finite sampling errors) [39,40]. The parameter-shift rule is given by where i labels the qubits and l cycles through all circuit layers, and e i,l is the unit vector. In order to approximate a single gradient, we need to estimate the difference of two energy expectation values E(θ + ) = ψ(θ + )|H|ψ(θ + ) with θ + = θ + (π/2)e i,l and E(θ − ) = ψ(θ − )|H|ψ(θ − ) with θ − = θ − (π/2)e i,l (we suppress i and l indices in θ ± for the sake of brevity). Typically, the Hamiltonian itself can be decomposed into a sum of L 'simple' terms: H = L l=1 h l , where often L can be proportional to the number of qubits, N . This allows expression of the gradient as a linear combination of 2L expectation values, each of which can be estimated by performing a collection of single-qubit Pauli measurements. If each term h l is supported on (at most) k-qubits, then Theorem 3 applies. Performing T ≈ 4 k ln(L/δ)/ 2 randomized Pauli measurements on state ρ(θ + ) and ρ(θ − ) each allows us to -approximate all 2L simple terms in Eq. (A8).
Unfortunately, approximation errors may accumulate when taking the sum over all 2L terms. Suppose that we obtain -accurate estimatorsÊ l (θ ± ) of contribution of the local Hamiltonian term to the energy E l (θ ± ) = ψ(θ ± )|h l |ψ(θ ± ) . A triangle inequality over all approx-imation errors then produces only This upper bound equals only if we rescale the accuracy of original approximation to /L. Inserting this rescaled accuracy into Theorem 3 produces an overall measurement cost of The number L of terms in the Hamiltonian typically scales (at least) linearly in the number of qubits N . This implies that the measurement budget, Eq. (A9), required to (conservatively) estimate gradients scales quadratically in the system size and thus is parametrically larger than the (conservative) cost of estimating purities of sizek subsystems, Eq. (A7). To obtain the full gradient ∇ θ E(θ) the procedure has to be repeated pN times since the parameter-shift rule has to implemented for every variational parameter. It should be noted though, that in principle this can be computed in parallel, provided large enough (quantum) computational resources. For example, different NISQ computers could be used to estimate different gradient components at the same time.

Example of error accumulation in an Ising model
The extra scaling with L 2 in Eq. (A9) is a consequence of error accumulation. If we use the same measurement data to jointly estimate many Hamiltonian terms, then all these estimators become highly correlated. And the effect of outlier corruption -which occurs naturally in empirical estimation -becomes amplified.
Here, we illustrate this subtlety by means of a simple example.
The task is to approximate this expectation value based on computational basis measurements. For each measurement, we either obtain outcome 0 · · · 0 (with probability 1 − p) or outcome 01 · · · 01 (with probability p). This dichotomy extends to our estimator and we are effectively faced with estimating the (rescaled) expectation value of a biased coin. The associated variance of such a coin toss can be easily computed and amounts to Unless λ = 0, 1 (where the variance vanishes), this variance it is proportional to L 2 = (N − 1) 2 and controls the rate of convergence. Asymptotically, a total number of independent coin tosses are necessary (and sufficient) to -approximate the true expectation value E Ê = tr (ρ(λ)H). This is a consequence of the central limit theorem and showcases that a measurement budget scaling with the number L of Hamiltonian terms is unavoidable in general. We emphasize that this is a contrived worst-case argument that showcases how correlated measurements can affect the approximation quality of a sum of many simple terms, while each term individually is cheap and easy to evaluate. A generalization to the Heisenberg Hamiltonian considered in the main text, see Eq. (2), is straightforward.

Proof of Theorem 3
Theorem 3 is a consequence of the following concentration inequality. Let O ∞ denote the operator and spectral norm of an observable. We also use · 1 to denote the trace norm. (t) be a classical shadow estimate thereof. Then, for ∈ (0, 1), This large deviation bound is a consequence of another well-known tail bound, see, e.g., Ref. [77,Theorem 7.30].
Theorem 5 (Bernstein inequality). Let X (1) , . . . , X (T ) be independent, centered (i.e., E [X t ] = 0) random variables that obey |X (t) | ≤ R almost surely. Then, for > 0 Proof of Theorem 4. Fix an observable O = O l with 1 ≤ l ≤ L and define X (t) = tr Oρ (t) − tr (Oρ). Then, by construction of classical shadows, each X (t) is an independent random variable that also obeys E X (t) = 0, courtesy of Eq. (A2). Next, let A ⊆ {1, . . . , N } with |A| = k be the subsystem on which the range-k observable acts nontrivially, i.e., where we also use ρ A 1 = tr(ρ A ) = 1 and the specific form of subsystem classical shadows Eq. (A3), that factorizes nicely into tensor products. Estimating the variance is more difficult by comparison. However, Ref. [25, Proposition S3] asserts In turn, σ 2 ≤ T 4 k and we conclude where the last line is a rough simplification of the exponent. Such a tail bound is valid for any O = O l and the advertised statement follows from taking a union bound (also known as Boole's inequality) over all possible deviations:

Appendix B: Unitary t-designs
Here, we briefly review the notion of unitary t-designs. The Haar measure is the unique left and right invariant measure on the unitary group U (d), where d here stands for the dimension of the full Hilbert space, d = 2 N . Unitary t-designs are ensembles of unitaries that approximate moments of the Haar measure. More precisely, let E be an ensemble of unitaries, i.e., a subset of U (d) equipped with a probability measure. For an operator O acting on the t-fold Hilbert space H ⊗t , the t-fold channel with respect to E is defined as Essentially, we are asking when the average of an operator O over the ensemble E equals an average over the full unitary group. A unitary t-design [78,79] is an ensemble E for which the t-fold channels are equal for all operators O, Being a t-design means we exactly capture the first t moments of the Haar measure with larger t better approximating the full unitary group. There are known constructions of t-designs for t = 2 and t = 3 [80][81][82][83][84].
For t = 1, it is known that any basis for the algebra of operators of H, including the Pauli group, is a 1-design.
In practice, one is more interested in when the ensemble of unitaries is close to forming a t-design. With this, given a tolerance t > 0 one refers to the ensemble E as being an approximate t-design if where · is the diamond norm -a worst-case distance measure that is very popular in quantum information theory, see, e.g., [85]. In the quantum-machine-learning literature the distance between the two t-fold channels is known as the expressibility of the ensemble E [15], the smaller the distance the more expressive the ensemble is.

Appendix C: Entanglement and unitary 2-designs
Random unitary operators have often been used to approximate late-time quantum dynamics. In the crudest approximation, it is assumed that the unitary matrix is directly drawn from the Haar measure. Although modeling quantum dynamics by random unitaries is an approximation, it has led to new insights into black hole physics [86][87][88] and produced computable models of information spreading and entanglement dynamics [89][90][91][92].
In what follows, we consider a weaker situation where the random unitary operator is drawn from an ensemble E forming a 2-design, and focus on the entanglement properties of N -qubits random pure states with U ∼ E. These results have been previously obtained, for example, Refs. [41][42][43] and references therein. Given a bipartition (A, ¬A) of the system, we begin by studying the distance of the reduced density matrix ρ A to the maximally entangled state ρ ∞ A = I A /d A , where d A is the dimension of the Hilbert space H A associated with region A. The full Hilbert space dimension is denoted by d = 2 N .
To do so we first use Jensen's inequality and afterwards employ the inequality (C2), The last term on the right-hand side is related to the purity: As we see, the only nontrivial dependence on U comes from the purity of the reduced density matrix. Let {|I = |i A , j ¬A } i,j be the computational basis for the Hilbert space H = H A ⊗ H ¬A (such that it respects the bipartition).
Let us now proceed with the calculation of the average purity. We first compute the reduced density matrix ρ A and write it as a sum over products of matrix elements of the unitary operator U : dU H U i,j U * i1,j1 = δ i,i1 δ j,j1 /d, dU H U i,j U l,m U * i1,j1 U * l1,m1 = 1 d 2 − 1 (δ i,i1 δ l,l1 δ j,j1 δ m,m1 + δ i,l1 δ l,i1 δ j,j1 δ m,m1 )− 1 d(d 2 − 1) (δ i,i1 δ l,l1 δ j,m1 δ m,j1 + δ i,l1 δ l,i1 δ j,j1 δ m,m1 ), we get that the following simple expression for the expected purity Finally, substituting Eq. (C7) into Eq. (C4) we obtain (C8) Note that the above result implies that when the complementary subsystem ¬A is (significantly) larger than A, the expected deviation of ρ A from the maximally mixed state is exponentially small.

Bounding the expected second Rényi entropy
Let us now explore the average value of the second Rényi entropy, which, as mentioned in the main text, can be easily estimated using the classical shadows protocol by Huang et al..
Computing the exact average value of the second Rényi is a complicated task. Hence, we instead provide a lower and an upper bound for it. On one hand, via Jensen's inequality, we have that which changes the focus of our attention to the expectation value of the purity of the reduced density matrix E E (tr ρ 2 A ). Using the result from the previous subsection Eq. (C7) and taking the logarithm, we get the following lower bound: Taking the large d limit and writing everything in terms of d A /d ¬A we find On the other hand, we have that for any state ρ A the following inequality holds: where S(ρ A ) is the von Neumann entropy of ρ A . Taking averages does not change this relation and we conclude E E (S 2 (ρ A )) ≤ E E (S(ρ A )). The expectation value of the von Neumann entropy is upper bounded by the Page entropy: Page conjectured that this analytical formula accurately captures the von Neumann entropy of a Haar random state. This conjecture was subsequently proven in Ref. [93]. Putting everything together, we obtain Considering now that the number of qubits inside region A is equal to k and assuming that d A /d ¬A = 1/2 N −2k 1 we arrive at the expression in Theorem 1, that is We see that whenever the unitary ensemble E forms a 2-design, the expected value of the second Rényi entropy is close to the Page entropy.  Figure 7. (a-b) The application of our algorithm to the problem of finding the ground state of the SYK model. For the initialization we consider the small-angle (SA) ( θ = 0.1) and identity block (IB) initialization [12] (using one block). We can see that only through the reset of the learning rate η, as suggested by Algorithm 1, WBPs are avoided during the optimization. The entanglement entropy of the target state is nearly maximal (indicated by the dotted line), we omit the WBP line for α = 1 for improved visibility. We measure energy in units of J and use a system size of N = 10, subsystem size k = 2 and a random circuit from Eq. (1) with circuit depth p = 100. Data is averaged over 100 random instances. use a small-angle initialization as well as the identityblock initialization [12] to illustrate our method.
The SYK model is a quantum-mechanical model of 2N spinless Majorana fermions χ i satisfying the following anticommutation relations {χ i , χ j } = δ ij . The SYK model was introduced by Kitaev [52] as a simplified variant of a model introduced by Sachdev and Ye [94]. The Hamiltonian of the model is where the couplings J i,j,k,l are taken randomly from a Gaussian distribution with zero mean and variance var[J i,j,k,l ] = 3! (N − 3)(N − 2)(N − 1) J 2 .
We can study Majorana fermions using spin-chain variables by a nonlocal change of basis known as the Jordan-Wigner transformation: such that {χ i , χ j } = δ i,j . With this representation, encoding 2N Majorana fermions requires N qubits. For our studies, we set J = 1 and consider a system of N = 10 qubits.
We study performance of VQE for SYK model using two different initializations. Fig. 7 (a)-(b) show that the WBP is avoided during optimization for if the learning rate is chosen appropriately. For a large learning rate (η = 1) both initializations encounter a WBP during the optimization (indicated by the gray and blue star). Once the learning rate is decreased (η = 0.1) the entanglement entropy slowly grows to the nearly maximal value associated with the ground state of the SYK model (dotted line) instead of uncontrollably reaching the Page value. For this model, it is important to use α = 1 (the default value) such that the entanglement entropy can grow during the optimization. Only if there is some a priori knowledge of the properties of the ground state, α can be chosen to be smaller.
The identity block initialization [12] here leads to the best optimization performance. We attribute this to the fact that the identity block initialization allows for a faster growth in entanglement since the parameter values are highly fine tuned. Our results suggest that sensitivity of the initialization ansatz to small perturbations may be beneficial for the cases when the ground state is nearly maximally entangled. These results highlight the advantage of using our algorithm. The tracking of the second Rényi entanglement entropy during the optimization reveals that the larger learning rates encounter a WBP while the smaller learning rates successfully avoid it.