Entanglement Devised Barren Plateau Mitigation

Hybrid quantum-classical variational algorithms are one of the most propitious implementations of quantum computing on near-term devices, offering classical machine learning support to quantum scale solution spaces. However, numerous studies have demonstrated that the rate at which this space grows in qubit number could preclude learning in deep quantum circuits, a phenomenon known as barren plateaus. In this work, we implicate random entanglement as the source of barren plateaus and characterize them in terms of many-body entanglement dynamics, detailing their formation as a function of system size, circuit depth, and circuit connectivity. Using this comprehension of entanglement, we propose and demonstrate a number of barren plateau ameliorating techniques, including: initial partitioning of cost function and non-cost function registers, meta-learning of low-entanglement circuit initializations, selective inter-register interaction, entanglement regularization, the addition of Langevin noise, and rotation into preferred cost function eigenbases. We find that entanglement limiting, both automatic and engineered, is a hallmark of high-accuracy training, and emphasize that as learning is an iterative organization process while barren plateaus are a consequence of randomization, they are not necessarily unavoidable or inescapable. Our work forms both a theoretical characterization and a practical toolbox; first defining barren plateaus in terms of random entanglement and then employing this expertise to strategically combat them.


INTRODUCTION
The rapid development of noisy quantum devices [1] has lead to great interest in hybrid quantum-classical variational algorithms, through which classical machine learning techniques are employed to prepare, sample, and optimize states on noisy quantum hardware [2][3][4][5][6].Not only do these algorithms show potential for a variety of near-term applications [7], they are inherently robust against certain coherent errors and are free to minimize decoherence effects through the exploration of unconventional gate sequences.Of particular interest are quantum neural networks (QNNs) [8], in which quantum input states are transformed into output states by a parametrized quantum circuit (PQC).The output states then undergo a series of measurements, collectively referred to as a cost function, and the measurement results are used to optimize the circuit.
Although QNNs offer a straight-forward approach, their implementation can be quite challenging.Among the greatest of these difficulties are barren plateaus [9]: regions of the cost function's parameter space where it is rather constant, varying too little for successful gradientbased optimization.While in shallow circuits these barren landscapes are cost function-dependent [10,11], the effect is cost function-independent for circuits that are sufficiently deep.Moreover, even gradient-free algorithms can be impacted [12,13].While certain restricted subsets of PQCs are somewhat resilient to barren plateaus [14,15], the most general implementation, known as the "hardware efficient ansatz", becomes exponentially barren with increasing qubit number.Numerous techniques have been suggested for the amelioration of barren plateaus, including layerwise and symmetry based training [16,17], correlated and identity-esq circuit initialization [18,19], and quantum convolutional neural network protocols [20], but they have yet to form a complete toolbox that is suitable for large-scale, general purpose QNNs.
Likewise, our understanding of barren plateaus is extensive yet far from complete.For some years, it has been understood that barren plateaus are a consequence of concentration of measure [21,22], stemming from the effects of randomness on the exponential dimension of quantum state space.More recently, the relationship between entanglement and barrenness has been explored by quantum scrambling studies [23] and in terms of visible and hidden units [24].However, we still lack comprehensive understanding of how entanglement induces barren plateaus with respect to cost function register size, qubit connectivity, and circuit depth.As a result, barren plateau mitigation strategies that rely on these insights have yet to be developed.
In this work, we give a detailed account of how random entanglement leads to barren plateau formation.While barren plateaus can be both noise-independent and noiseinduced [25], we here consider only the former variety.In particular, we derive the relationship between cost function barrenness and qubit entanglement, including the rate at which barrenness scales with circuit depth.As our findings quantify barrenness via the entanglement of specific qubit subsets, we develop partitioning methods that initially or continuously restrict such entanglement.This generates non-barren cost function landscapes and thus improves circuit learning.We find that initially partitioned circuits not only learn faster, but often produce less entangled solutions.As entangled states are more sensitive to decoherence, this factorizability can decrease the number of measurements required to accurately estimate the cost function, potentially reducing the prob-lematic number of expectation values required for each circuit iteration [26][27][28].
In order to verify and exploit these findings, we design a classical meta-learning protocol that avoids barren plateaus while generating an arbitrary circuit with rich entanglement structure.In contrast to other QNN meta-learning proposals [29], which address specific problem classes, ours is suitable for general PQCs.Moreover, as our meta-learning technique does not pre-train circuit output, it is itself immune to barren plateaus.Furthermore, we model a real-time regularization process that penalizes forms of entanglement that are potentially problematic and show that this method ameliorates barren landscapes, decreasing both training time and error.We also make the novel identification of barren plateaus as a form of Langevin noise in the circuit parameter space and demonstrate the effectiveness of injecting additional Langevin noise into the training process, a technique that has been used to combat overfitting in deep classical neural networks [30].Finally, we draw a parallel between entanglement dynamics and the improved performance of QNNs in certain measurement bases.

VARIATIONAL ALGORITHMS IN LAYERED 1D
QUANTUM CIRCUITS Before characterizing the relationship between entanglement and barren plateaus, we provide a brief overview of hybrid quantum-classical variational algorithms in 1D circuits.Examples of such circuits are shown in Figs. 1 (a) and (b).These circuits have a total number of n qubits partitioned into two registers: the cost function register R C , whose qubits are measured with some observable M C , and the non-cost function register R N , with qubits that are not directly measured.These registers have n C and n N qubits, respectively, such that n = n C + n N .In a 1D system, the qubits interact only with their nearest neighbors via two-qubit unitaries, denoted u k ij for interactions between the ith and jth qubit in layer k.As this work considers pure states, each u k ij can be fully describe with six rotation angles as [31] where R ij (θ) is a sinusoidal rotation matrix on axes i and j that can be expressed as Here, K ij is a Hermitian matrix that is equal to ±i at elements ij and ji and 0 elsewhere.The universal nature of this parametrization distinguishes our work from studies that impose a restricted unitary structure [14,15] and ensures that, for sufficient depth, our unitaries are random enough to generate barren plateaus [9,32].We note that the θ i which correspond to u k n C ,n C +1 are especially significant, as they entangle registers R C and R N , and we denote them θ E i when relevant.These two-qubit interactions are then organized into full layer unitaries where q is the remainder of k/2.As all interactions are pairwise, the u k ij in each single-layer unitary U k commute.
We describe the unitary of the full system as with total number of gate layers L. Fig. 1 (a) illustrates a generic example of such a circuit for n = 5, n C = 3 and L = 4.
In hybrid quantum-classical variational algorithms, circuit training is described by a cost function L, which is some function f of expectation value Fig. 1 (b) illustrates a specific learning example: a ground state compressor.The ground state compressor is a circuit that takes n = 9 qubit ground states |Ψ g i and their average z-axis magnetization as training data, where σ b i is the Pauli operator along axis b acting on qubit i.The circuit then learns to compress |Ψ g i into n C = 3 qubit equivalents |ψ g i in the x-basis by using their average x-axis magnetization as training labels.Here, we generate N g different |Ψ g i from randomly parametrized long-range interaction Hamiltonians where J z ij , J x ij , w i , and v are all random constants.In this case, M C is a series of m i and we choose L as the L1 loss between the training output and labels This circuit is an extension of that used in [33].We remark that this task is inherently global, requiring magnetization information from both R C qubits 4-6 and R N qubits 1-3 and 7-9.

THE EFFECT OF ENTANGLEMENT ON BARREN PLATEAUS
Barren plateaus are a manifestation of concentration of measure [34], meaning that they arise from the tendency of high-dimensional, random distributions to cluster about their mean.In a PQC, the measurement expectation value M C of the quantum circuit is determined by parameters θ i .For random circuit initialization, as the number of these parameters grows, the impact of the individual parameter uncertainties becomes small and, for the vast majority of parameter sets θ i , M C approaches its mean with very low variance such that ∂ M C ∂θi → 0. In the interest of building intuition, we can draw an analogy between the collective effects of parameters θ i on M C and the behavior of an average of N Gaussian distributions X = 1 , where N i are Gaussian distributions with mean µ and variance σ 2 .Assuming that all N i are independent, the uncertainty of individual N i are washed out and X = N (µ, σ 2 /N ).That is, the probability that X deviates from µ vanishes exponentially in N .
We emphasize that both concentration of measure and the barren plateaus that they produce are a product of randomness in large dimensional systems, not large dimensionality alone.For this reason, barren plateaus are typically discussed in the context of random PQCs and quantified in terms of unitary t-designs [35][36][37], or probability distributions that approximate the average of polynomial functions of degree ≤ t.Fig. 1 (c) illustrates the characteristic behavior of these features, as detailed in [9].As the circuits are randomly parametrized, their statistical behavior is described by the Haar distribution.Assuming that U is at least as random as a quantum 1-design, the mean of O i over the probability distribution of all Haar random unitary matrices U is µ Oi = 0, and the training dynamics rely solely on the variance of this quantity.For relatively shallow circuit depth L, it is known that the unitary approaches a quantum 2-design [37].As such, the variance of the gradient with respect to this unitary ensemble σ 2 Oi = var(O i ) decreases rapidly in L, ultimately reaching the steady-state 2-design value ∼ 2 −n [9].As n becomes large, randomly initialized circuit parameters cease to update and training fails.In what follows, we use omitted subscript O to refer in general to arbitrary parameters θ i , using the subscripted version O i to specify only when the distinction is relevant.For instance, our numerical data is calculated with O 1 .
To intuitively understand how random entanglement causes barren plateaus, we point out that for a randomly initialized parameter θ i to contribute to the concentra- tion of M C and thus to the vanishing of O, it must have some form of influence over the qubits of R C .For the qubits of R N , this interaction occurs via U and results in entanglement between the two registers.According to this reasoning, barren plateau emergence should be proportional to the spread of random entanglement.Fig. 2 (a) shows the emergence of barren plateaus vs circuit depth L. As L increases, σ 2 O decreases exponentially with √ L until approaching its asymptotic limit σ 2 B .While shallow circuits with smaller cost function registers R C initially enjoy greater σ 2 O , n determines σ 2 B for deep circuits and we will later conjecture that this asymptote corresponds to entanglement saturation between all qubits on the random circuit.As circuit depth is a form of discretized interaction time τ , this scaling is equivalent to the τ dependence of entanglement growth of two-level quantum systems in 1D [38].
To describe these entanglement dynamics quantitatively, we consider the density matrix of the output qubits In a compromise between simplicity and generality, in this work we describe the spread of circuit entanglement with the bipartite entanglement entropy where ρ α is the reduced density matrix of (n − 1)/2 connected qubits of register R α , taken so as to contain as many cost function qubits as possible.
where R q is the single-qubit subspace for each qubit q ∈ R N .
We now derive the relationship between S and σ 2 O .In particular, we consider R E , the subspace of qubits that are entangled with (or causal to) the cost function.We begin by proving that σ 2 O is dependent on the dimension d E of R E and then establish the link between d E and S. Let us assume that the circuit input is a product state, here specifically |0 .Then the output state ρ = |ψ ψ| can be written as ρ = ρ E ⊗ ρ D , where ρ E belongs to R E (is entangled with M C ) and ρ D does not.This factorization then implies that where Then, the expectation value of our observable M C becomes where U R and U L are the products of unitaries U k for k < l and k ≥ l, respectively and where K is the rotation generator for θ i .Assuming a randomly initialized circuit, this reduces the problem to that of [9], where using the Haar measure it is shown that µ Oi = 0 with respect to any θ i , with variance where d E = 2 n E is the dimensionality of the entangled subspace ρ E .We contrast this with the barrenness of a fully entangled circuit ∼ 1/d = 2 −(n E +n D ) = 2 −n , which can, for many applications, be numerous orders of magnitude smaller.
To implicate S in this barren plateau process, we note that for Given that if |ρ E | < |ρ α | we need to describe early entanglement spread with a smaller bipartition of S, we can assume that ρ D is fully contained in ρ β such that Tr as this is precisely the definition of the number of entangled qubits shared between ρ α and ρ β .Then, the total number of R C entangled qubits is For simplicity, in the above proofs we assumed that a given qubit is either completely entangled or disentangled.A similar result for the more general case of partial entanglement follows straight-forwardly by taking general |ψ = i c i |ψ i E |ψ i D , following the above steps, and repartitioning each component of the sum This mapping between the number and degree of cost function-entangled qubits and plateau barrenness highlights that circuit connectivity, and not simply overall circuit depth, is an accurate indicator for the barrenness of the training landscape.Fig. 3 (a) is a proof of principle illustration of this point.We remove the register connecting the u k 2,3 (unitaries between qubits q 2 and q 3 ) gates from each layer k for circuits where n C = 2 and L = σ z 1 σ z 2 , permanently separating, or partitioning, the registers R C and R N .As circuit depth grows, entanglement with the qubits of R N suppresses σ 2 O much faster than its partitioned counterpart σ 2 OP , which never exceeds the variance of circuit of total n = 2 and is therefore numerous orders of magnitude larger than the variance σ 2 O of the fully entangled system.While insightful, permanent partitioning is clearly not a practical solution for barren plateaus, as it limits not only the barrenness, but also the expressibility of the circuit to that of only n C qubits.

INITIALIZATION TECHNIQUES FOR BARREN PLATEAU MITIGATION
While permanent partitioning is tantamount to simply employing a circuit of smaller n, initial parameter restrictions can improve circuit trainability without reducing circuit expressibility.Intuitively, the advantages of this method stem from the role of entropy as a thermodynamic arrow that drives statistical processes forward.As a toy example, let us imagine a classical machine learning protocol where we would like to create a gaseous mixture with optimized concentrations of two gases.If we initially partition the gases, the learning algorithm can simply allow the gases to mix themselves by passing through a vent in the partition, sealing the vent when the ideal concentration is reached on one side.This process occurs independently, driven forward by entropic considerations.If, however, the gases are initially mixed and therefore have a maximum entropy configuration, the learning algorithm cannot succeed by simply unsealing a vent.The problem has been complicated and learning will fail unless more heroic measures are taken.
In this section, we explore methods for quantum equivalents of such entropy limiting initialization schemes and in Sec.we detail some of these "more heroic" measures.

Initial Entanglement Partitioning
One way to avoid barren plateaus without suppressing expressibility is to initially partition the circuit, like in Fig. 3 (a), but then to allow R C -R N entanglement throughout the training process.This method is fundamentally distinct from [15,19], as we only initialize a subset of two-qubit gates u k ij to the identity and have devised a cost function and entanglement-based strategy to motivate this choice.Furthermore, our treatment applies to universal PQCs of potentially great depth, not restricted subspaces of U [15].Fig. 3 (b) displays the n = 9 ground state compressor loss L = L g of Eq. 9 (gray) and training of L = | σ z 1 σ z 2 σ z 3 | (red) for both initially partitioned (solid lines) and fully random (dashed) initializations with L = 200.Throughout this work, the AMSGrad gradient descent algorithm is used for circuit parameter update [39].The corresponding bipartite entanglement entropies S are in the inset.
At first, initially partitioned circuits suffer a bout of decreased accuracy, which corresponds to a period of low yet rapidly growing entanglement S (inset) that is either insufficient to express the target state of interest (gray) or simply lower than those of many of the degenerate solutions (red).Later, initially partitioned circuits can produce lower error and require fewer training epochs, which we hypothesize stems from the responsiveness of the gradient during the initial phase of low entanglement, enabling the system to train unfettered by barren plateaus and driving interactions forward through entropy growth.
Furthermore, strategic initializations may be sufficient to avoid barren plateaus throughout training.We emphasize the relationship between plateau barrenness and S (or alternatively, in other works, n and L) is a product of circuit parameter randomness of at least a quantum 2design and can thus only be assumed for random circuit initializations.That is, such randomness cannot generally be assumed throughout the training process, as this represents an inherently structured organizing of circuit parameters.
When initially partitioned cost functions are learned with high accuracy (here, ground state compression, but also in both the partitioned training of L = σ z 1 σ z 2 σ z 3 in Fig. 5 (c)), S peaks towards the end of the rapid training period before dropping down to a lower steady-state value.This indicates that the initially partitioned circuit identifies an appropriate solution that is less entangled with the unmeasured qubits of R N , a potentially desirable quality as widespread entanglement can lower the coherence time of qubits.Moreover, an extension of this technique could be used to partially factor the cost function registers themselves, resulting generally in fewer required readouts for a given cost function determination and ameliorating the so-called "measurement problem" [26][27][28].
We indicate that this ultimate drop in bipartite entanglement is reminiscent of the late-stage decrease in tripartite mutual information noted in [33].The two phenomena are not in conflict, however, with the former indicating the disentanglement of measured and unmeasured qubits after the necessary information from those qubits had been collected, while the later signals that the global features of the input information are learned towards the end of the training process.In fact, both observations suggest that information locality is the salient feature of late-stage hybrid quantum-classical learning algorithms.Finally, we indicate that a reduction in S also occurs for non-partitioned circuits that can be learned with high accuracy (dashed gray in Fig. 3 (b) and dashed black Fig. 5 (c)), whereas it is absent from low-accuracy circuits (red).Indeed, some degree of automatic R C -R N factorization appears to be a natural feature of high-accuracy QNN training.
Finally, we comment that the cost function L = σ z 1 σ z 2 σ z 3 trains rapidly when initialized to a barren plateau, and neither its time nor accuracy are improved by mitigating these barren plateaus via initial partitioning.This suggests that certain classes of cost functions (in this case observables that target one of their eigenstates, as will be discussed in Sec. ) may be naturally resistant to barren plateaus.This could potentially be due to a rapidly accelerating ordering process, wherein even small σ 2 O quickly navigate L to a non-barren region of its landscape.

Entanglement Meta-Learning as Circuit Pre-Training
A similar yet more sophisticated solution for nonbarren initializations is classical pre-training of the circuit gates to control R C -R N entanglement.This process is a form of meta-learning [40], a branch of machine learning algorithms directed at optimizing the learning process of other algorithms.
We must be careful, however, in our choice of pretraining cost function.Simply minimizing S would itself be a form of randomly parametrized gradient descent algorithm on the output qubits and would thus, like previous meta-learning techniques, tend to generate the concentration of parameters that lead to barren plateaus [29].We reiterate that this observation is not in conflict with our claim that σ 2 O vanishes ∝ 2 −S for randomly initiated PQCs, as this relation is not universal, but rather applies to circuit unitaries that are Haar distributed, and can therefore only be assumed in random, not pre-trained, circuits.
As an alternative to S, we can combat barrenness by minimizing the collective entanglement S C of registers R C and R N , which considers the 2n-qubit space of both input and output registers.To define S C , we boost into the 2n-qubit pure state where |ψ i represent some set basis vectors in the ndimensional Hilbert space.We can then define the density matrix operator of the full 2n collective qubit system Now the reduced density matrix and corresponding entropy of entanglement of R C are defined over its 2n C input and output qubits as where Tr N is a trace over the 2n N input and output qubits of R N .O grows more similar to the variance of an n C qubit system, reducing the barren plateau effect by ≈ 2 n−n C orders of magnitude.We note that as Critically, the average magnitude of R C -R N interaction on a given layer k is not reduced.To see this, consider a rough metric of inter-register mixing where θ E i are the 3L rotation angles of the u k n C ,n C +1 gates that entangle the registers R C and R N .| sin(θ E i )| describes these interactions because it is the average of offdiagonal (or rotating) elements in the two-qubit rotation matrices u k n C ,n C +1 .The inset of Fig. 4 demonstrates that this quantity remains at its uniformly distributed value 2/π, even as the collective registers become increasingly factored.This indicates that while the total entanglement of collective R C and R N is reduced, the average inter-register interaction at any given layer k remains unaffected, providing a highly non-trivial circuit initialization.This method is distinct from [19] as it does not produce a network of identity-producing blocks, but rather a nearly arbitrary initialization with the sole yet crucial constraint of adjustable factorizability along a single connection.This distinction may be particularly important for deep circuits [41].What is more, this method assumes no specification of problem structure [29], making it generally applicable.
Although classical pre-training is untenable for circuits of large n, it can serve as meta-learning for the fundamental effect of entanglement on PQCs and their optimal initializations.As such, it may lead to some scalable generalization of the procedure.It could also be applied iteratively on subsets of large circuits, i.e., on the qubits which form the border between R C and R N .Most promisingly, recent advances in efficient subsystem entanglement measuring techniques, such as random measurements [42] and fidelity out-of-time correlators [43], may pave the way for on-hardware hybrid quantum-classical variational minimization of collective entanglement S C , or some analogous measure, enabling full-circuit, highvariance gradient initializations for arbitrary size PQCs.

DYNAMIC CONTROL OF BARREN PLATEAUS
As the difficulties of training the relatively simple cost function L = | σ z 1 σ z 2 σ z 3 | for deep circuits in Fig. 3 (b) alludes, initialization techniques can be insufficient for complete mitigation of barren plateaus.To combat this, we now propose a variety of methods to directly manage long-term entanglement of the R C and R N output registers.Returning to the analogy of optimal mixing of a classical bipartite gas in Sec., this section details quantum analogies to the more "heroic" methods that we can take to dynamically control the gaseous mixture's entropy, such as regularization of (penalization of the learning algorithm) or the introduction of additional dynamics into the system.

Hard Limit on RC -RN Entangling Gates
The simplest of these dynamic methods is imposing a hard limit on the number of R C -R N entangling layers L E in an otherwise deep circuit.This is explored in Fig. 4 (b).Although total depth L = 200, relatively large gradients can still be achieved while still permitting a considerable number of R C -R N interactions.As L E grows, σ 2 O decays with a similar scaling in L E as unrestricted circuits do in total gate number L, corroborating that barren plateaus indeed arise with the spread of cost function entanglement, not circuit depth itself.This method could be particularly fruitful when using a reinforcement learning algorithm [29,44], as the circuit could learn to process and extract the most relevant portions of R N before ultimately transferring them to R C in a limited number of L E .whereas the cost functions of (a) and (c) can be learned rather rapidly and accurately and do so with naturally lower levels of bipartite entanglement S that are more responsive to regularization, that of (b) trains slower and with less accuracy, with entanglement growth that is controlled very little by regularization.The black dashed line in (c) corresponds to S for unregularized, unpartitioned L = σ z 1 σ z 2 σ z 3 , which learns with equal effectiveness as its partitioned and regularized counterparts, highlighting the increased learnability eigenstate learning.

Entanglement Regularization
A yet more dynamic method for limiting entanglement is with regularization of R C -R N gates.Regularization adds a penalizing term with adjustable scale parameter to the original cost function L in order to implicitly limit the amount of cross-register entanglement.
i | sin(θ E i )| is proportional to the inter-register mixing measure | sin(θ E i )| of Sec. and serves to limit entanglement generating interactions.Moreover as the values θ i are already stored within the classical learning algorithm, this metric does not require additional queries to the quantum hardware.Scaling η by L results in an adaptive regularization process that resists entanglement in regions of poor solutions while relaxing to the original learning problem as L approaches zero.This adaptivity can be even more fruitful by making λ a decreasing function of L, such that η disturbs the learning process even less near optimal solutions.
The regularized gradient is then (24) During portions of the training process that are still largely random, the average of ∂L ∂θi over all Haar random unitaries µ Oi = 0, as we can assume by concentration of measure for deep circuits that i | sin(θ E i )| is approximately constant.The variance, however, does increases to We highlight that although regularization only directly augments the variance of parameters θ E i upon which it acts, we observe a similar increase on the unregularized angles, indicating that its mitigation of barren plateaus is a system-wide effect.Furthermore, we note that λ is adjustable and that the regularized variance grows quadratically in circuit depth, whereas σ 2 O is constant for deep circuits with a given number of qubits n.
Fig. 5 displays this learning process for an initially partitioned circuit trained with a λ which is piecewiseadaptive in (solid line) in comparison with an algorithm using only initial partitioning, that is, λ = 0 (dashed) for L = 200.Three different loss functions are used: (agray) ground state compressor L = L g (Eq.9), (b -red) Ground state compression can be achieved both faster and with greater factorization of the output solution, while solutions to L = | σ z 1 σ z 2 σ z 3 | have greatly improved accuracy.Although L = σ z 1 σ z 2 σ z 3 is rapidly learned both with and without regularization and/or initial partitioning, its regularized solutions still benefit from the increased factorizability, with the dashed black line in Fig. 5 (c -inset) showing its unpartitioned S.
As discussed in Sec., we again comment on the seeming resilience from barren plateaus of L = σ z 1 σ z 2 σ z 3 and other cost function measurements that target their eigenstates.While still beginning in a barren landscape with overwhelming probability for random circuit initializations, these algorithms learn equally well as without barren plateau mitigation.

Langevin Noise as Gradient Supplement
The results of Fig. 5 (b) raise an interesting point: the addition of regularization terms to the cost function can improve accuracy without significantly decreasing entanglement.We hypothesize that this is because η, while sometimes successfully limiting entanglement, is always providing additional perturbation in the form of noise.Langevin noise in particular has proven fruitful in classical machine learning and has been used to prevent overfitting in classical neural networks [30].
To motivate this hypothesis, we make the observation that the cost function gradient in barren plateaus can be conceptualized as a form of Langevin noise in circuit parameter space.Typically, Langevin noise is defined for functions g that vary with time τ .Then g(τ ) satis-fies the conditions that g(τ ) Lan ≡ g(τ )dτ = 0 and g(τ )g(τ ) Lan = 2Dδ(τ − τ ), where δ is the Dirac-delta function and D is some finite, non-zero diffusion constant [45].In the case of O, the moments are not integrals over time, but rather over the parameters θ i as described by the Haar measure, such that D = σ 2 O /2.Under this Langevin noise formulation, barren plateaus can be framed as an entanglement-induced diminution of D. To examine the utility of Langevin noise, we can add an additional noise term λ with derivatives g i = ∂G ∂φi and where φ i are an arbitrarily chosen subset of circuit parameters of size N .For uniformly distributed φ i ∈ φ on the interval [0, 2π), this yields the equivalent relation Fig. 6 (a) illustrates the effectiveness of such Langevin noise in barren landscapes, producing a high-accuracy solution for L = | σ z 1 σ z 2 σ z 3 | despite fully random initialization for n = 9 on a deep circuit (L = 200).We note that, like the increased variance of entanglement regularization, angles that are not directly perturbed by added Langevin noise still enjoy an increase in variance from the system-wide effect of the technique.

Natural Cost Function Bases
Finally, we discuss our repeated observation that successful learning in initially barren landscapes is greatly facilitated when the target output is an eigenstate of the cost function observables, a basis choice that we refer to as "natural".This observation of a natural basis has been made for other product state PQC objective functions, such as in the basis transformations of electronic structure reference states in quantum chemistry [2].Not only do such configurations learn more rapidly, their training rate and accuracy are not impacted by otherwise successful barren plateau mitigation techniques.
We have suggested a potential link between this trainability of natural basis cost functions and the tendency of these circuits to limit their own entanglement, navigating out of the barren landscapes of random matrices and into a tractable configuration.In particular, the variational preparation of measurement eigenstates has several entropy-based advantages, such as a vanishing gradient variance when circuit approaches an optimal solution.As discussed in Sec., this effect may also be due to a rapidly accelerating ordering process, wherein even small σ 2 O lead to L efficiently escaping into non-barren regions of its landscape.
Regardless of the origin of its effect, rotating cost function measurements into natural bases can be a strong barren plateau mitigating strategy.Fig. 6 (b) illustrates that by substituting L = σ z 1 σ z 2 σ x 3 for L = | σ z 1 σ z 2 σ z 3 |, we can obtain a desired solution |ψ such that ψ|σ z 1 σ z 2 σ z 3 |ψ = 0 much more effectively.The replacement of a single σ z with the operator σ x reduces the problem to an eigenstate optimization and results in a learning process that trains quickly and automatically limits entanglement, (entanglement behavior analogous to the black dashed line in Fig. 5 (c)), in contrast to the difficulties of the original problem (red dashed line in Fig. 3 (b)).

CONCLUSION
We have demonstrated the relationship between total qubit-cost function random entanglement and the barrenness of a learning landscape both analytically and numerically and oriented these findings within the context of many-body entanglement dynamics.Based on these results, we established various metrics for barren plateau prediction, both in terms of entanglement and, for a 1D system, circuit depth.We also proposed an input-output entanglement metric, whose minimization we suggest is key to circuit learnability.Using this knowledge, we went on to propose various mitigation schemes, including initial partitioning of cost function and non-cost function registers, meta-learning of lowentanglement high-interaction PQC initializations, limiting inter-register interaction, entanglement regularization, the addition of Langevin noise, and utilizing natural cost function bases.We demonstrated the effectiveness of these techniques, elucidating the role that entanglement minimization plays in both the assisted and unassisted training of QNNs and emphasizing that, as existing barren plateau proofs assume sufficiently random parametrizations which do not apply under all circumstance, barren plateaus can potentially be avoided or escaped in generic PQCs.
While these findings imply that QNN learning must strike a non-trivial balance between randomness, expressibility, and barrenness, they lay the groundwork for numerous mitigation techniques that may facilitate large-scale quantum circuit learning.Furthermore, these methods furnish various secondary benefits, such as solution factorization, novel paradigms of quantum metalearning, and increased understanding of circuit optimization, to name a few.Furthermore, they suggest that the growth of circuit entanglement could potentially be harnessed to drive the learning process.
Oftentimes, the presence of barren plateaus in PQCs is interpreted as an absolute impasse, as it is typically believed to preclude learning.However, this work emphasizes that not only can barren plateaus be amelio-rated through entanglement considerations, they should be understood as manifestations of circuit randomness that have not been proved to apply to more organized configurations, such as those which may manifest during the learning process.To understand the relationship between cost function barrenness and total circuit learnability, the evolution of circuit parameter distributions throughout the learning process should be characterized.Such a statistical characterization will also shed further light on the viability of barren plateau mitigation methods.

FIG. 1 .σ z 2 (
FIG. 1.(a) Diagram of a linear circuit with nC = 3, nN = 2, and L = 4. Qubits qi and qj interact in layers k through twoqubit unitaries u k ij , which comprise layer unitaries U k , and ultimately form total unitary U .The qubits of RC are then readout by the cost function operator MC .(b) Ground state compressor for randomly generated 9-qubit long-range interaction Hamiltonian (Eq.8) ground states.The circuit learns to represent the ground states |Ψg as 3-qubit representations |ψg .(c) An illustration of barren plateaus with L = σ z 1 σ z 2 (nC = 2) and n = 3, 5, 7, 9 (blue, orange, green, red).The variance σ 2 O of the partial cost function derivative O is known to decrease rapidly with increasing circuit layers L until ultimately reaching barren plateau magnitude σ 2 B ∝ 2 −n .

FIG. 2 .
FIG. 2. (a) Variance σ 2 O of cost function derivative O for L = n C i=1 σ z i in units of its barren plateau value σ 2 B vs number of gate layers L for n = 9 and various nC .(Inset) The change in entanglement entropy S vs circuit depth L for a 4-5 qubit bipartition, as defined in Eq. 11 and illustrated in the inset of (b).Whereas S describes the entanglement growth analytically for nC = 4, it is only an approximation for nC = 1, 2, 3, which could be improved by using an initial nC -nN bipartition.(b) Variance σ 2 O of cost function derivative O in units of its barren plateau value σ 2 B vs normalized change in entropy S for L = σ z 1 σ z 2 (nC = 2) and n = 3, 5, 7, 9 (blue, orange, green, and red).Larger values of n experience greater relative suppression of σ 2 O as σ 2 O /σ 2 B ∝ SB = 2 −S , which is the analytical solution for n = 5 (orange) and an approximation for other n.(Inset) Schematic of bi-partition entropy of entanglement S for n = 5.Full density matrix ρ broken into two subsets Rα and R β , where Rα always contains as much of RC as possible.
The remaining (n+1)/2 qubits are in R β , such that ρ α = Tr β [ρ], as illustrated in the inset of Fig. 2 (b).For pure states, this entropy is symmetric and S = −Tr[ρ β log 2 ρ β ] is equivalent.As this entanglement approximation assumes full entanglement within R α , it is most precise when R C = R α .Fig. 2 (a) displays σ 2 O vs circuit depth for a variety of n C in an n = 9 system.While all n C scale roughly as 2 −S (inset), n C = 4 is most accurately characterized as, for that case, R C = R α .Thus, while a single such partitioning is adequate for describing entanglement spread in configurations with |R C | ∼ |R α | = (n − 1)/2, various such partitions may be used to track short-term entanglement growth when |R C | |R α | or long-term entanglement growth when |R C | |R α |, such that the entanglement entropy does not temporarily stagnate, like Fig. 2 (b) n = 9 (red) or rapidly saturate, such as Fig. 2 (b) n = 3 (blue).The plot is scaled from initial entanglement S 0 and normalized to asymptotic difference S B − S 0 .In particular, n = 3 (blue) is initially saturated as it is nearly fully entangled with the minimal number of gates L = 2, while as |R C | < |R α |, n = 7, 9 (green, red) have superlogarithmic scaling for S → S B .At the expense of computational simplicity, more general metrics could be adopted, such as a n N -fold sum of bipartite mutual information I 2

FIG. 3 .σ z 2 σ z 3 |
FIG. 3. (a) Partitioned cost function derivative variance σ 2 OP for L = σ z 1 σ z 2 in units of its non-partitioned value σ 2 O vs number of gate layers L for n = 3, 5, 7, 9 (blue, orange, green, and red) and nC = 2.As expected, σ 2 OP is approximately a factor of 2 n N larger than σ 2 O due to the absence of random RC -RN entanglement.(b) Training loss L = Lg of Eq. 9 for ground state compressor in Fig. 1 (b) (gray) and L = | σ z 1 σ z 2 σ z 3 | (red) vs training epochs for L = 200.Solid lines are initially partitioned circuits and dashed lines are fully random initializations, with increased learning performance in the latter.The corresponding evolution of S is shown in the (inset).

FIG. 4 .
FIG. 4. (a) A deep circuit (L = 100) pre-training procedure that minimizes collective entanglement SC (Eq.21), the entanglement entropy between both the input and output registers of RC and RN for n = 3, 5 (blue, orange).As random entanglement decreases with SC , σ 2 O increases.Crucially, this pre-training procedure is unique from partitioned initialization as it permits non-trivial interaction at the level of individual circuit layers (inset), with magnitude of interregister interaction remaining 2/π ≈ 0.637, which is consistent with that of random θi.(b) Derivative variance σ 2 O for L = σ z 1 σ z 2 (nC = 2) and initially partitioned registers RC and RN in units of its non-partitioned value σ 2 B vs number of register entangling gate layers LE.Here, L = 200 and n = 3, 5, 7, 9 (blue, orange, green, and red).
Fig. 4 (a) shows the growth of σ 2 O for n = 3, 5 (blue, orange), n C = 2, and L = 100.As the entanglement is minimized, σ 2 O draws closer to its partitioned value σ 2 n≈n C , reducing the initialization problem of the barren plateau from O(2 −n ) to approximately O(2 −n C ) as the initialization S C of the circuit is minimized through training.This comparison is approximate because the pre-training method, while quite general, implies some inherent ordering that may distinguish it from a bipartition of 2-designs.As S C → 0 and R C and R N become factorized, σ 2

FIG. 5 .
FIG. 5. Loss function vs epochs for entanglement regularized learning of (a) ground state compressor L = Lg (Eq.9), (b) L = | σ z 1 σ z 2 σ z 3 |, and (c) L = σ z 1 σ z 2 σ z 3 with L = 200.In all cases, non-zero regularization terms lead to equal (c) or faster and more accurate (a and b) learning with decreased S, mitigating the effects of barren plateaus.Moreover, the behavior of S is reflective of the overall training difficulty (insets):whereas the cost functions of (a) and (c) can be learned rather rapidly and accurately and do so with naturally lower levels of bipartite entanglement S that are more responsive to regularization, that of (b) trains slower and with less accuracy, with entanglement growth that is controlled very little by regularization.The black dashed line in (c) corresponds to S for unregularized, unpartitioned L = σ z 1 σ z 2 σ z 3 , which learns with equal effectiveness as its partitioned and regularized counterparts, highlighting the increased learnability eigenstate learning.

FIG. 6 .σ z 2 σ z 3 |
FIG. 6. Preparing a state |ψ under barren plateau conditions (random initialization of L = 200) for n = 9 and cost function L = | σ z 1 σ z 2 σ z 3 | through both (a) the addition of Langevin noise on a subset of parameters and (b) substitution of measurement basis for which target state is an eigenstate.(a) Additional Langevin noise term λ N i |φi|L increases gradient variance σ 2 O i → (1 + λN π) 2 σ 2 O i with respect to parameters φi, thus helping to navigate barren plateau landscapes.This can be viewed as an increase in diffusion constant 2D.(b) By substituting the cost function σ z 1 σ z 2 σ x 3 , for which our target state |ψ is an eigenstate (or alternatively, choosing a "natural" cost function basis), we obtain a faster, more accurate learning process.

Ni
|φ i |L to the original loss function such that S.F.Y. and T.L.P. would like to thank the AFOSR and the NSF for funding through the CUA-PFC grant.T.L.P. acknowledges that this material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE-1745303.X.G. is supported by the Postdoctoral Fellowship in Quantum Science of the Harvard-MPQ Center for Quantum Optics, the Templeton Religion Trust grant TRT 0159, and by the Army Research Office under Grant W911NF1910302 and MURI Grant W911NF-20-1-0082.
and where |ψ out = U |ψ in is the output of the quantum circuit U with input state |ψ in .Unless otherwise specified, we take |ψ in = |0 .The classical learning algorithm then minimizes L by updating the parameters θ i through use of the partial derivatives