Reinforcement Learning for Many-Body Ground-State Preparation Inspired by Counterdiabatic Driving

The quantum alternating operator ansatz (QAOA) is a prominent example of variational quantum algorithms. We propose a generalized QAOA called CD-QAOA, which is inspired by the counterdiabatic driving procedure, designed for quantum many-body systems and optimized using a reinforcement learning (RL) approach. The resulting hybrid control algorithm proves versatile in preparing the ground state of quantum-chaotic many-body spin chains by minimizing the energy. We show that using terms occurring in the adiabatic gauge potential as generators of additional control unitaries, it is possible to achieve fast high-fidelity many-body control away from the adiabatic regime. While each unitary retains the conventional QAOA-intrinsic continuous control degree of freedom such as the time duration, we consider the order of the multiple available unitaries appearing in the control sequence as an additional discrete optimization problem. Endowing the policy gradient algorithm with an autoregressive deep learning architecture to capture causality, we train the RL agent to construct optimal sequences of unitaries. The algorithm has no access to the quantum state, and we find that the protocol learned on small systems may generalize to larger systems. By scanning a range of protocol durations, we present numerical evidence for a finite quantum speed limit in the nonintegrable mixed-field spin-1/2 Ising and Lipkin-Meshkov-Glick models, and for the suitability to prepare ground states of the spin-1 Heisenberg chain in the long-range and topologically ordered parameter regimes. This work paves the way to incorporate recent success from deep learning for the purpose of quantum many-body control.


I. INTRODUCTION
The ability to prepare a quantum many-body system in its ground state is an important milestone in the quest for understanding and identifying novel collective quantum phenomena. The degree to which ground states can be confidently prepared in present-day quantum simulators, delineates the limits of our capabilities to investigate the properties of new materials or molecules, and to propose innovative technological applications based on quantum effects, such as high-temperature superconductors and superfluids, magnetic field sensors, topological quantum computers, or synthetic molecules.
Quantum simulators, such as ultracold and Rydberg atoms [1, 2], trapped ions [3-6], nitrogen vacancy centers [7][8][9], and superconducting qubits [6,10], all require the development of state preparation schemes via realtime dynamical processes. Despite their high level of controllability, finding short protocols to prepare stronglycorrelated ground states under platform-specific constraints, is a challenging problem in AMO-based quantum simulation platforms, due to the exponentially large Hilbert space dimensions of quantum many-body systems. On this background, speed-efficient protocols * jiahaoyao@berkeley.edu † mgbukov@phys.uni-sofia.bg also become progressively more important for near-term quantum computing devices [11], where simulation errors grow with the protocol duration due to imperfections in the implementation of the basic gate operations.
Developing versatile methods for ground state preparation will enable quantum simulators to investigate hitherto unexplored quantum phases of matter, and determine the behavior of order parameters, correlation lengths and critical exponents. Theoretically, although an exact mathematical expression for the ground state might be known in some models, it remains still largely unclear how to prepare it in a unitary dynamical process. In generic models, the lack of closed-form analytical solutions motivates the use of numerical algorithms. Prominent examples for quantum state preparation include established quantum control algorithms, such as GRAPE [12] and CRAB [13], and variational quantum eigensolvers (VQE) [14], such as the quantum approximate optimization ansatz (QAOA) [15].
In this study, we present a novel hybrid reinforcement learning (RL)/optimal control algorithm based on an autoregressive deep learning architecture. We improve the current state-of-the-art for digital quantum control techniques by enhancing the capabilities to find optimal protocols that prepare the ground state of quantum many-body systems. The emerging versatile algorithm combines discrete and continuous control parameters to achieve maximum flexibility in its applicability to a num-arXiv:2010.03655v2 [quant-ph] 3 Oct 2021 ber of different models.
To cope with the complexity of preparing ordered states in quantum many-body systems, we introduce a novel ansatz inspired by variational gauge potentials and counter-diabatic (CD) driving [16][17][18][19]. This allows us to excite the system away from equilibrium in a controllable manner to find short high-fidelity protocols away from the adiabatic regime. We demonstrate that combining features of CD driving with the digital simulation character of conventional QAOA yields superior performance over a wide range of protocol durations and physical models. Compared to the standard counter-diabatic driving algorithms, CD-QAOA represents a more flexible ansatz which allows us to take into account (i) experimental constraints, such as drift terms that cannot be switched off, and (ii) control degrees of freedom not present in CD driving; (iii) CD-QAOA is not tied to a drive protocol which obeys specific boundary conditions (such as vanishing protocol speed). Unlike continuous CD driving, CD-QAOA offers a simple and easy-to-apply variational ansatz without reference to the exact ground state of the system, paving the way for versatile digital quantum control.
In particular, our RL agent constructs unitary protocols that transfer the population into the ground state of three nonintegrable spin models (spin-1/2 and spin-1 mixed-field Ising chains, and the anisotropic spin-1 Heisenberg chain) which feature long-range and topological order, and the integrable Lipkin-Meshkov-Glick (LMG) model which allows us to present simulations for a large number of particles. We show numerical evidence for the existence of a finite quantum speed limit in the nonintegrable mixed-field spin-1/2 Ising model: an almost perfect system-size scaling indicates that this behavior persists in the thermodynamic limit. Our RL agent has no access to unmeasurable quantum states which grow exponentially with the number of degrees of freedom in the system: this allows the protocols we find to generalize across a number of system sizes [for the spin-1/2 mixed-field Ising model], opening up the door to apply ideas of transfer learning to quantum manybody control. Finally, we demonstrate that the CD-QAOA ansatz has direct practical implications in digital quantum control: it leads to much shorter circuit depths while simultaneously improves the fidelity of the prepared state, which can be utilized to reduce detrimental errors in modern quantum computers.

II. GENERALIZED CONTINUOUS-DISCRETE QUANTUM APPROXIMATE OPTIMIZATION ANSATZ
To prepare many-body quantum states, we seek a unitary process U which brings the system from a given initial state |ψ i to the ground state |ψ GS of the Hamiltonian H (which we call the target state |ψ * ). Typically, Hamiltonians can be decomposed as a sum of two non-commuting parts H = H 1 +H 2 , e.g. the kinetic and interaction energy. We want to construct from a sequence τ of q consecutive unitaries (or their generators) τ j chosen from a set A, with τ j = τ j+1 . Each U (α j , τ j ) is parametrized by a continuous degree of freedom α j (e.g. time or rotation angle), i.e. U (α j , τ j ) = exp (−iα j τ j ). We formulate state preparation as an optimization problem which consists of determining (i) the sequence τ , and (ii) the values of the variational parameters α j , such that U |ψ i ≈ |ψ GS . Our goal is to prepare the ground state of a Hamiltonian H, without having access to the ground state itself. Therefore, we use energy as a cost function or energy-density E/N which has a well-behaved limit when increasing the number of particles N [20]. We denote the ground state energy by E GS = ψ GS |H|ψ GS .
Note that conventional QAOA is recovered as a special case where one only considers two unitaries U j = U (α j , H j ) = exp(−iα j H j ), j = 1, 2, and τ is one of the two alternating sequences. Whenever nested commutators of H j span the entire Lie algebra which generates transport on the complex projective space associated with the Hilbert space H of the system, applying QAOA is already enough to prepare any state, provided that the underlying circuit depth q is sufficiently large, and the optimal α j can be found [21]. While true in theory, this is often impractical, since (i) it requires access to in principle unbounded durations, (ii) it increases the number of optimization parameters α j , and -with it -the probability to get stuck in a local minimum of the control landscape, and (iii) the condition that nested commutators of H j span the entire Lie algebra is generally not satisfied for the H j 's of interest in quantum many-body physics due to, e.g., symmetry constraints.
(1)] allows us to utilize a larger set of unitaries A to construct the optimal sequence and to reduce the circuit depth q. Inspired by counter-diabatic (CD) driving, we find that a particularly suitable choice in the context of quantum manybody state manipulation, is given by the operators in the adiabatic gauge potential series [Sec. III]. Therefore, we call the resulting algorithm CD-QAOA. A different ansatz using more than two unitaries was considered in Ref. [22].
Compared to conventional QAOA, CD-QAOA introduces a discrete high-level optimization to find the optimal protocol sequence τ . The combined optimization landscape can be particularly difficult to navigate, due to the existence of so-called barren plateaus where exponentially many directions have vanishing gradients [23][24][25][26].
Additionally, the total number of all allowed protocol sequences, |A|(|A| − 1) q−1 [27], scales exponentially with the number of unitaries q, and presents a challenging discrete combinatorial optimization problem per se; indeed, state preparation, formulated as optimization, can feature a glassy landscape [28,29] [App. D]. However, overcoming these potential difficulties is associated with a potential gain: CD-QAOA allows retaining the flexibility offered by continuous optimization, while increasing the number of independent discrete control degrees of freedom to |A|; this enables us to reach larger parts of the Hilbert space in shorter durations, and with a smaller circuit depth, as compared to conventional QAOA.
Thus, we formulate ground state preparation as a two-level optimization scheme [30]. (1) Low-level optimization: given a fixed sequence τ , we find the optimal values of α j using a continuous optimization solver, e.g. SLSQP [31] [App. B]. To cope with the associated rugged optimization landscape [App. D], we run multiple realizations of random initial conditions and post-select the values which yield minimum energy. This continuous optimization problem is also present in conventional QAOA. (2) High-level optimization: in addition to the low-level optimization, we also perform a discrete optimization for the sequence τ itself, to determine the optimal order in which unitaries from the set A should occur. To tackle this combinatorial problem, we formulate the high-level optimization as a reinforcement learning (RL) problem. We learn the optimal protocol using Proximal Policy Optimization, a variant of policy gradient. The policy is parameterized by a deep autoregressive network, which allows choosing the control unitaries U (α j , τ j ) sequentially. In practice, we sample a batch of sequences from the policy, evaluate the energy of each sequence in the low-level optimization, and apply policy gradient to update the parameters of the policy. This two-level optimization procedure is repeated in a number of training episodes until convergence [App. A].

III. VARIATIONAL STATE PREPARATION INSPIRED BY COUNTER-DIABATIC DRIVING
A natural question arises as to how to choose the set A of unitaries for the generalized discrete-continuous QAOA ansatz. One possibility is to consider a set of universal elementary quantum gates, e.g., in the context of a quantum computer [32,33], and in this case α j are angles of rotation. We leave this exciting possibility for a future study, and focus here on many-body ground state preparation instead.
The complexity of many-body systems motivates the use of a physics-informed approach to defining the control unitaries in A. Suppose we initialize the system in the ground state of the parent Hamiltonian H(λ = 0); we target the ground state of H(λ = 1), seeking the functional form of a time-dependent protocol λ(t). If the instantaneous ground state of H(λ) remains gapped during the evolution, the adiabatic theorem guarantees the existence of a solution λ(t), t ∈ [0, T ], provided T is large compared to the smallest inverse gap along the adiabatic trajectory. However, when the gap is known to close (e.g. across a phase transition), or when the state population transfer has to be done fast, adiabatic state preparation fails.
Compared to the adiabatic paradigm, gauge potentials provide additional control directions in Hilbert space which enable paths that non-adiabatically lead to the target state in a short time. In many-body systems, it is not known in general how to determine the exact gauge potential required for CD driving. However, it is possible to define variational approximations [34,35] using an operator-valued series expansion [App. E] similar to a Schrieffer-Wolff transformation [36], or Shortcuts to Adiabaticity methods [33,37]. Nonetheless, recent numerical simulations suggest that the exact gauge potential in generic many-body systems is a non-local operator [34,38] which renders the series expansion asymptotic.
For these reasons, here we consider the constituent terms to every order of the variational gauge potential series, H j , independently, and use them to generate the set of unitaries A = {e −iαj Hj } for CD-QAOA [39]. We emphasize that our CD-QAOA ansatz is not designed to approximate the gauge potential itself, as opposed to Ref. [40], yet it yields similar benefits w.r.t. preparing the target state. In Sec. V we compare directly our approach with the variational gauge potential ansatz from Ref. [34].
Since CD-QAOA is a generalization of QAOA aimed to be useful in practice, we need to ensure the accessibility of the control terms H j . Because they appear in the first few orders of the gauge potential series, H j are (sums of) local many-body operators [cf. App. E]. Thus, in principle, there is no physical obstruction to emulate them in the lab, although this depends on the details of the experimental platform (especially for the interaction terms). Additionally, in the context of many-body systems where energy is extensive, in order to guarantee that we do not tap into a source of infinite energy, we constrain the norm of the generators α j H j : we view α j ≥ 0 as time durations, and fix q j=1 α j = T , with T the total protocol duration. This keeps α j on the same order of magnitude as the coupling constants in the parent Hamiltonian whose ground state we want to prepare.

IV. MANY-BODY GROUND STATE PREPARATION
We consider four non-integrable many-body systems of increasing complexity: the spin-1/2 and spin-1 mixedfield Ising models, the spin-1 Heisenberg model, and the integrable Lipkin-Meshkov-Glick (LMG) model where a large number of degrees of freedom is accessible in a classical simulation. The goal of the RL agent is to prepare their ordered ground states, starting from a product state. To generate training data, we compute numerically where [S α i , S β j ] = δ ij ε αβγ S γ j are the spin-1/2 operators. We use periodic boundary conditions and work in the zero momentum sector of positive parity. In the following, J = 1 sets the energy unit, and h z /J = 0.809 and h x /J = 0.9045. We initialize the system in the z-polarized product state |ψ i = |↑ · · · ↑ , and we want to prepare the ground state of H in a short time T , i.e., away from the adiabatic regime. We verified that similar results can be obtained starting from | ↓ · · · ↓ .
To acquire an intuitive understanding of the advantages brought by the gauge potential ansatz, consider first the non-interacting system at J = 0, for which the control problem reduces to a single spin. Both the initial and target states lie in the xz-plane of the Bloch sphere, and hence the shortest unit-fidelity protocol generates a rotation about the y-axis. In conventional QAOA, one would construct a y-rotation out of the X and Z terms FIG. 1. Spin-1/2 Ising model: energy minimization and the corresponding many-body fidelity [inset] against protocol duration T obtained using conventional QAOA (blue squares) and CD-QAOA (red diamonds) with circuit depths p = q/2 = 2 and q = 3, respectively. The dotted vertical line marks the quantum speed limit TQSL. CD-QAOA outperforms conventional QAOA. The initial and target states are |ψi = |↑ · · · ↑ and |ψ * = |ψGS(H) for hz/J = 0.809 and hx/J = 0.9045. The alternating unitaries for conventional QAOA are generated by AQAOA = {Z|Z +Z, X}[cf. Eq. (3)]; for CD-QAOA, we extend this set using adiabatic gauge potential terms to ACD−QAOA = {Z|Z + Z, X; Y, X|Y, Y |Z}. The cardinality of the CD-QAOA sequence space is |A|(|A| − 1) q−1 = 80. The number of spins is N = 18 with a Hilbert space size of dim(H) = 7685.
[cf. Table I] present in the Hamiltonian. For a single spin, this construction is always possible due to the Euler angle representation of SU(2), but for the interacting spin chain this is no longer the case. The role of the gauge potential Y is to 'unlock' precisely this geodesic in parameter space, and make it accessible as a dynamical process. This allows preparing the target state faster, compared to the original X, Z control setup. In the language of variational optimization, an accessible Y term includes the shortest-distance protocol into the variational manifold, and the RL agent easily finds the exact solution [App. F 1]. For the interacting system, J > 0, applying conventional QAOA using the two gates U j = e −iαj Hj with H 1 = Z|Z +Z and H 2 = X is straightforward, but it does not yield a high-fidelity protocol [ Fig. 1 (blue squares)]. It was recently reported that much better energies can be obtained, using a three-step QAOA which consists of the three terms in the Hamiltonian (3), Z|Z, X, and Z, applied in a fixed order [41]; invoking again an Euler angle argument provides an explanation: the X and Z terms effectively generate the Y gauge potential term.
In stark contrast to conventional QAOA, adding just the zero-order term H 3 = Y from the gauge potential series [App. E 3], we find that CD-QAOA already gives  a significantly improved protocol; this is achieved by the high-level discrete optimization which selects the order of the operators in the sequence. However, we can do better: since |ψ i is a product state while |ψ * is not, and because H 3 is a sum of single-particle terms, in order to create the target many-body correlations using a fast dynamical process, we also include the two-body firstorder gauge potential terms H 4 = X|Y and H 5 = Y |Z: this results in a nonadiabatic evolution that prepares the interacting ground state to an excellent precision [ Fig. 1 (red diamonds)]. In Ref.
[42], it was shown that, in the integrable limit h z = 0, one can prepare the ground state of the system at the critical point using a circuit of depth q = 2N with conventional QAOA. Albeit for the specific initial and target states chosen, we find that it only takes CD-QAOA a depth of q = 3 to reach the target ground state, independent of the system size N [43]. This result, though modeldependent, may come as a surprise at first sight, given that the mixed field Ising chain is a quantum chaotic system without a closed-form solution which makes it susceptible to heating away from the adiabatic limit.
Our data also reveals a finite many-body QSL at T QSL ≈ 4.5. Importantly, this QSL appears insensitive to the system size to a very good approximation [ Fig. 2], and we expect it to persist in the thermodynamic limit. The absence of a finite QSL in conventional QAOA in the mixed-field Ising chain suggests that the observation of a QSL using CD-QAOA depends on the specific set of unitaries related to the variational gauge potential, showcasing the utility of our ansatz for many-body control. Remarkably, we find an almost perfect system-size collapse of the target state energy density curves as a function of the total protocol duration T . In Sec. VI, we explore this feature and demonstrate the ability of the RL agent to learn on small system sizes and subsequently generalize its knowledge to control bigger systems with exponentially larger Hilbert spaces.
CD-QAOA performs successfully on the nonintegrable spin-1/2 mixed-field Ising chain, for a circuit depth as short as q = 3. This shows an advantage of our ansatz, when compared to conventional QAOA. However, the small size of the sequence space, |A|(|A| − 1) q−1 = 80 at |A| = 5, poses a natural question regarding the necessity of using sophisticated search algorithms, such as RL, to find the control sequence. We now show that this is a peculiarity of the physical system, as we turn our attention to a larger sequence space.

B. Heisenberg Spin-1 Chain
The eight-dimensional spin-1 group SU(3) provides a significantly larger space of gauge potential terms to build the optimal protocol from. We consider a total of |A| = 9 unitaries: five are generated by the imaginary-valued terms in the gauge potential series: Table. I], plus the two realvalued QAOA operators H 1 and H 2 , which build the Hamiltonian H = H 1 + H 2 whose ground state we target [Eq. (4)], and the two real-valued Hamiltonian terms X|X and Z. At q = 18, this amounts to |A|(|A|−1) q−1 ≈ 10 16 possible sequences. The exponential scaling of the sequence space size with q renders applying exhaustive search algorithms infeasible, and justifies the use of sophisticated algorithms, such as RL.
The (anisotropic) spin-1 Heisenberg model reads as: with the spin exchange coupling J = 1 set as energy unit, and ∆ -the anisotropy parameter; we use periodic boundary conditions and work in the ground state sector of zero momentum and positive parity, defined by the projector P. In the thermodynamic limit, this model features a rich ground state phase diagram including ferromagnetic (FM, ∆/J −1), XY (−1 ∆/J 0), topological/Haldane (0 ∆/J 1), and antiferromagnetic (AFM, ∆/J 1) order [44], with phase transitions belonging to different universality classes [45][46][47]. While the FM, XY, and AFM states are characterized by a local order parameter, the gapped Haldane state has topological order not captured by Landau-Ginzburg theory. We consider the AFM initial state |ψ i = P | ↑↓↑↓ · · · , and target the ground states of Eq. (4) deep in the FM, XY, and Haldane phases, where system-size effects are the smallest. Because CD-QAOA is not restricted to adiabatic evolution, the conventional paradigm of a closing spectral gap when transferring the population between and CD-QAOA (solid lines) for three different states. We start from the AFM state |ψi = P | ↑↓↑↓ · · · and target three different parameter regimes, corresponding to the FM (∆/J = −2.0) state, XY (∆/J = −0.5), and Haldane (∆/J = 0.5) states, respectively. CD-QAOA outperforms conventional QAOA (p = q/2), more notably in the FM and XY targets where it allows us to reach close to the target state using a short protocol duration. The empty symbols mark the duration at which we show the evolution of the system in two states displaying different order, does not apply in our non-equilibrium setup, even in the thermodynamic limit. Figure 3 shows a comparison between conventional QAOA with alternating sequence between the Hamiltonians H 1 and H 2 , and CD-QAOA. We find that CD-QAOA shows superior performance for all three ordered ground states: while the gain over conventional QAOA for the Haldane state is already a faster protocol, we clearly see how the gauge potential terms can prove essential for reaching the ground state in the FM and XY phases within the available durations. Note that the FM target state is doubly degenerate, and minimizing the energy, it ends up in an arbitrary superposition within the ground state manifold. Interestingly, we do not identify any distinction from preparing states with long-range and topological order, presumably due to the small system sizes that we reach in our classical simulation.
The CD-QAOA protocol sequences found by the RL agent have peculiar structures [App. F 2]: some of them resemble closely the alternating sequence of conventional QAOA, with the notable difference of applying additional unitaries to rotate the state to a suitable basis, either at the beginning or at the end of the sequence. While this is formally equivalent to starting from or targeting a rotated state, the rotations use two-body operators; hence, the resulting basis does not coincide with any of the distinguished S x , S y and S z directions. Variationally determining such effective bases demonstrates yet another advantage offered by the CD-QAOA ansatz. Another kind of encountered sequence contains two different sets of alternating unitaries, similar to two independent QAOA ansatzes concatenated one after the other. Finally, for those values of T , where CD-QAOA and QAOA have the same performance, we have also observed that CD-QAOA finds precisely the QAOA sequence. In this case, conventional QAOA already generates the shortest path, and the extra gauge potential terms to second-order do not give any advantage; a better performance might be expected when the three-and four-body higher-order terms from the gauge potential series are included.
Similar to other optimal control algorithms, RL agents typically find local minima of the optimization landscape; thus, there is no guarantee that the CD-QAOA protocols provide global optimal solutions; however, these sequences can serve as an inspiration to build future variational ansatzes tailored for many-body systems.

C. Lipkin-Meshkov-Glick Model
The non-integrable character of the previously discussed models precludes us from applying CD-QAOA with a large number of degrees of freedom, since reliably simulating their dynamics on a classical computer is prohibitively expensive. In order to study the behavior of CD-QAOA in a large enough system which also features a quantum phase transition, we now turn our attention to an exactly solvable many-body system.
The Lipkin-Meshkov-Glick (LMG) Hamiltonian [48] describes spin-1/2 particles on a fully-connected graph of N sites: where J is the uniform interaction strength and h the external magnetic field. In the thermodynamic limit, N → ∞, the system undergoes a quantum phase transition at h c /J = 1 between a ferromagnetic (FM) ground state in the x-direction for h/J 1, and a paramagnetic ground state for h/J 1. The spectral gap ∆ LMG between the ground state and excited states closes as ∆ LMG (h c ) ∼ N −1/3 at the critical point [49]. Realizing the LMG model is within the scope of present-day experiments with ultracold atoms [50, 51]; therefore, developing fast ground state preparation techniques can prove useful in practice. 1, which motivates the parameter choice for the target state. In the vicinity of the critical point, the overlap increases and approaches unity in the limit h/J → ∞. Note that, in the FM phase, the ground state is doubly degenerate, in which case the overlap is computed w.r.t. the ground state manifold: | ψi|ψ In the paramagnetic phase, the ground state is unique. We used N = 501 spins.
Defining the total spin operators as S α = N j=1 S α j , the Hamiltonian takes the form H = −J/N (S x ) 2 + h (S z + N/2). Hence, the total spin is conserved, [H, S · S] = 0, and the ground state symmetry sector contains a total of N + 1 states, i.e. dim(H) = N +1, which allows us to simulate large system sizes.
Our goal is, starting from the z-polarized paramagnetic initial state, |ψ i = |↓↓ · · · , to target an arbitrary superposition in the doubly-degenerate FM ground state manifold, at fixed values of the external field h/J which controls the magnitude of the transversal fluctuations on top of the ferromagnetic order. Figure 4 shows that the overlap of the initial and target states is vanishingly small in the FM phase, and approaches quickly unity across the critical point into the paramagnetic phase. Therefore, we choose to prepare ground states in the FM phase where the problem naturally appears more difficult. Figure 5 shows a comparison between CD-QAOA and QAOA on the LMG model at h/J = 0.5 for N = 501 spins [more h/J values are shown in App. F 3]. First, note the superior performance of CD-QAOA, as compared to conventional QAOA in a range or short durations T in the nonadiabatic driving regime. We applied CD-QAOA with two different sets of generators: A = {H 1 , H 2 ; Y }, and A = {H 1 , H 2 ; Y,XY ,ẐY } [cf. Table I] and found that, for the LMG model, the higher-order two-body termsXY ,ẐY do not offer any advantage deep in the FM phase. This observation can be understood as follows: to turn the z-polarized initial state into the x-ferromagnet, it is sufficient to perform a rotation about the y-axis, which coincides precisely with the single-body term in the gauge potential series expansion [cf. App. E 1 c]. Indeed, for all protocol durations smaller than the quantum speed limit, T < T QSL , the RL agent finds that the optimal protocol consists of a single Y -rotation, while for T ≥ T QSL the optimal protocol is degenerate, and typically involves the various terms from A. This finding allows us to extract the QSL as a function of the external field h, cf. Fig. 6.
Close to the critical point h c , we observe strong sensitivity in the best found protocols to system-size effects, and a single Y -rotation is no longer optimal below the QSL. Interestingly, at the critical point (and in the paramagnetic phase), the optimal protocol is given by QAOA: in this regime, despite the larger set of terms A we use in CD-QAOA, the RL agent correctly identifies the sequence of alternating H 1 and H 2 terms as optimal, which shows the versatility of CD-QAOA: the algorithm can always select a smaller effective subspace of actions when this is advantageous in the parameter regime of interest.

V. COMPARISON WITH COUNTER-DIABATIC DRIVING
To compare and contrast the CD-QAOA ansatz with CD and adiabatic driving [34], consider the driven spin-1

Ising model [52]
: where λ(t) = sin 2 πt 2T , t ∈ [0, T ], is a smooth protocol satisfying the boundary conditions for CD driving: λ(0) = 0, λ(T ) = 1,λ(0) = 0 =λ(T ). The initial state is the ground state at t = 0, i.e. |ψ i = |↓ · · · ↓ , while the target state is the ground state of the Ising model at t = T for h z /J = 0.809 and h x /J = 0.9045. Unlike the setup in Sec. IV A, adiabatic state preparation following the protocol λ(t), suggests using the QAOA generators Figure 7 shows a comparison between different methods using the best found energy density (main figure), and the corresponding many-body fidelity (inset). Let us focus on CD-QAOA and QAOA first. As expected, CD-QAOA (red) performs better for short durations T , since it contains conventional QAOA (red) as an ansatz, i.e. A QAOA A CD-QAOA . We emphasize that such a performance is not guaranteed in practice, since it is conceivable that the RL agent gets stuck in a local minimum associated with lower energy than the QAOA solution [App. D], e.g., if the deep autoregressive network architecture is not expressive enough, or if the learning rate schedules are not well-tuned to the problem. Unlike the spin-1/2 Ising model, here we cannot clearly identify a finite QSL, as the CD-QAOA energy keeps improving with To construct the counter-diabatic Hamiltonian H CD ≈ H(λ) +λX ({β j }) for Eq. (6), we make a variational ansatz [34] for the gauge potential X , and solve for the optimal parameters β j numerically [App. E]. We note the following differences between this approach and CD-QAOA: (i) the variational gauge potential depends on time t continuously, which requires further discretization when performing a gate-based implementation. (ii) the number of variational parameters in the standard variational gauge potential method is N T |A| with N T the number of steps used to discretize the time interval [0, T ]; instead, in CD-QAOA, we have q variational parameters. (iii) the variational gauge potential method does not constrain the magnitude of the variational coefficients β j , and hence the time-averaged norm of H CD over the protocol can grow indefinitely; especially for short durations T this typically gives a higher fidelity. By contrast, in CD-QAOA the time-averaged norm of the unitary generators α j H j summed along the sequence, is kept bounded via the constraint j α j = T . Nonetheless, in practice, we find that these norms are on the same order of magnitude for all methods considered [App. F 4]. As anticipated, Fig. 7 shows that CD driving performs better than adiabatic driving, and the two agree in the limit of large T . Moreover, we see explicitly that the CD and QAOA solutions are far from the adiabatic regime. Not surprisingly, CD driving outperforms conventional QAOA for small T , as it can increase the values of the variational parameters (and with it the norm) indefinitely. However, CD-QAOA consistently outperforms CD driving in the entire T -range; the contrast is especially pronounced in the many-body fidelity [ Fig. 7, inset]. CD-QAOA makes use of the variational power of QAOA, combining it with physics-motivated input from CD driving. Table II shows a comparison with the best obtained energies for N = 10 spin-1 particles (qutrits): the superior performance of CD-QAOA remains despite the exponentially growing Hilbert space size. Reaching significantly larger system sizes is infeasible with the presentday computational power: we note that this a feature of the quantum system rather than a drawback of CD-QAOA, cf. discussion on LMG model in Sec. IV C.
We emphasize that CD-QAOA features some important advantages as compared to CD driving: (1) Due to the nested commutators in the definition of time-ordered exponentials, the QAOA dynamics can effectively implement total unitaries U ({α j } q j=1 , τ ) generated by effective non-local operators; therefore, CD-QAOA can, in principle, realize a nonlocal effective Hamiltonian as an approximation to the true CD Hamiltonian, thereby overcoming convergence issues related to operator-valued series expansions.
(2) CD-QAOA lifts the boundary constraint present in adiabatic and CD driving where the initial and target Hamiltonians are eigenstates of H(0) and H(1), respectively; an interesting open question is whether a local effective Hamiltonian exists, which captures the evolution of the system in this case. Examining the evolution of the entanglement entropy and other local observables induced by the optimal protocol, suggests that this is indeed the case [App. F 4]. (3) One can add any control unitary to the set A, not just terms related to gauge potentials: CD-QAOA has high flexibility to accommodate experimental constraints. (4) determining the variational gauge potential in CD-driving requires using the exact ground state in order to minimize the action, which can be a significant drawback when the ground state is not known or cannot be computed.

VI. TRANSFER LEARNING AND GENERALIZATION OF THE RL ALGORITHM TO DIFFERENT SYSTEM SIZES
The scale collapse in the energy density of the spin-1/2 Ising model presents a testbed for the transfer learning capabilities of RL. In transfer learning, the RL agent learns to control one physical system, and is then used to manipulate another. In our case, the two systems are given by the same Ising model at two different system sizes. Note that transfer learning would have not been possible, had we defined the learning problem using the full quantum states, because the latter are vectors in Hilbert space whose size grows exponentially with N .
To apply transfer learning, consider first a fixed protocol duration T . For every fixed system size N , we first train a different RL agent. Next, we build the set of protocols across all system sizes, found by these agents, and determine the number of unique protocols [cf. legend in Fig. 8]. Finally, we apply all unique protocols to all system sizes available, and store the energy densities they result in. This leaves us with a set of energy density values for every fixed T . The error bars in Fig. 8 show the best and the worst protocols over this set. Observe that, below the QSL, there are only a few points T where the best control protocol is the same across all system sizes. Transfer learning works well, as can be seen by the small error bars. In this regime, the RL agent generalizes its knowledge and learns universal features of the protocol, required to control the Ising model. In contrast, for T > T QSL , there are many more protocols giving approximately similar ground state energies. While the corresponding energies are similar in value, the agent does not generalize. Nevertheless, we checked that, in this regime, training on smaller system sizes still provides a useful pre-training procedure for learning on larger systems.

VII. DISCUSSION/OUTLOOK
We analyzed many-body ground state preparation using unitary evolution in the spin-1/2 Ising model, the spin-1 anisotropic Heisenberg and Ising models, and the fully connected LMG spin-1/2 model. We introduced the CD-QAOA ansatz: an RL agent optimizes the order of unitaries in the protocol sequence, generated from terms in the adiabatic gauge potential series, and obtains short high-fidelity protocols away from the adiabatic regime. The resulting algorithm combines the strength of continuous and discrete optimization into a unified and versatile control framework. We find that our CD-QAOA ansatz outperforms consistently both conventional QAOA, and variational CD driving across different models and protocol durations. An interesting open question is whether one can use CD-QAOA to find a nonlocal approximation to the variational gauge potential itself, which is beyond the scope of asymptotic series expansions. Another straightforward application of CD-QAOA would be imaginary time evolution [53].
For the nonintegrable spin-1/2 Ising chain, we reveal the existence of a finite quantum speed limit. Moreover, we find a remarkable system-size collapse of the energy curves suggesting that the sequences found by the agent hold in the thermodynamic limit; this is corroborated by numerical experiments on transfer learning which demonstrate that one can train the agent on one system size while it generalizes to larger systems. In the Heisenberg spin-1 system, CD-QAOA allows preparing long-range and topologically ordered ground states, even when the initial state does not belong to the phase of the target state. The optimal protocols found by the RL agent contain nontrivial basis rotations, intertwined with alternating QAOA-like subsequences, suggesting new ansätze for more efficient variants of CD-QAOA. Numerical studies of nonequilibrium quantum many-body systems, in turn, suffer from limitations related to the exponentially large dimension of the underlying Hilbert space: future work can investigate dynamics beyond exact evolution.
Compared to conventional QAOA, using terms from the variational gauge potential series has higher expressivity, which results in much shorter, yet better performing, circuits. This method can be used, e.g., to reduce the cumulative error in quantum computing devices. However, gauge potential terms are not always easy to realize in experiments since they implement imaginary-valued terms which break time-reversal symmetry; that said, it is often possible to generate such terms using auxiliary real-valued operators via a generalization of the Euler angles, or by means of change-of-frame transformations [34]. Moreover, as we have demonstrated, CD-QAOA admits non-gauge potential terms as building blocks for control sequences, e.g., universal gate sets. Other experimental constraints, such as the presence of drift terms, which cannot be switched off, can also be conveniently incorporated by redefining the set of unitaries A.
Finally, let us remark that RL provides only one possible set of algorithms to explore the exponentially large space of protocol sequences; in practice, one can apply other discrete optimization techniques, e.g. genetic algorithms and search algorithms like Monte-Carlo Tree Search (MCTS).
Acknowledgments.-We wish to thank A. Polkovnikov, Dong An and Yulong Dong for valuable discussions. This   . The QAOA ansatz with policy gradient has been applied to efficiently find optimal variational parameters for unseen combinatorial problem instances on a quantum computer [78]; Qlearning was used to formulate QAOA into an RL framework to solve difficult combinatorial problems [79], and in the context of digital quantum simulation [80].
In the following, we introduce the details of the Reinforcement Learning algorithm used for the high-level optimization in this work.

Reinforcement Learning Basics
Reinforcement learning (RL) comprises a class of machine learning algorithms where an agent learns how to solve a given task through interactions with its environment using a trial-and-error approach [68]. It is based on a Markov Decision Process (MDP) defined by the tuple (S, A, p, R) where S and A represent the state and action spaces, p : S × S × A → [0, 1] defines the transition dynamics, and R : S × A → R is the reward function that describe the environment. Let π(a j |s j ) : A × S → [0, 1] denote a stochastic policy that defines the probability distribution of choosing an action a j ∈ A given the state s j ∈ S. Rolling out the policy π(a j |s j ) in the environment can also be viewed as sampling a trajectory τ ∼ P π (·) from the MDP, where P π (τ ) = p 0 (s 1 )π(a 1 |s 1 )p(s 2 |s 1 , a 1 ) · · · π(a q |s q )p(s q+1 |s q , a q ) is the probability for the trajectory τ to occur, q sets the episode or trajectory length, and p 0 is the distribution of the initial state; an example for a trajectory is τ = (s 1 , a 1 , ...., a q , s q+1 ). The goal in RL is to find a

Low-level optimization
High-level optimization FIG. 9. Schematics of CD-QAOA with an autoregressive policy network. The ancestral sampling procedure used for training is displayed in Fig. 10. The details of the network structure and its training hyperparameters are shown in Table III. policy that maximizes the expected return: To maximize the expected return J(θ), we use policy gradient -an RL algorithm, which is (i) on-policy, i.e. trajectories have to be sampled from the current policy π θ : π = π θ , and (ii) model-free, i.e. the agent does not need to have a model for the environment dynamics: p(s |s, a) is assumed unknown for the purpose of finding the optimal policy. Highly expressive function approximators, such as deep neural networks, help parametrize the policy using variational parameters θ. Policy gradient gradually improves the expected return in a number of iterations (or training episodes), by increasing the probability for actions that lead to higher rewards, and decreasing the probability for actions that lead to lower rewards, until it reaches a (nearly) optimal policy.
We mention in passing that we use interchangeably the terms return and cost function (the latter being the negative of the former): the goal of the RL agent is thus to maximize the expected return, or to minimize the cost function.

Policy Gradient Reinforcement Learning for Quantum Many-Body Systems
Actions: To apply the reinforcement learning formalism to quantum control, we identify taking actions at each time step within a learning episode, with selecting unitaries one at a time within the circuit depth q. Choosing the same unitary at two consecutive time steps is prohibited because the same actions can be merged resulting in a lower effective circuit depth q − 1. At the initial time The action at each time step is generated sequentially, by computing its respective conditional categorical distribution, and sampling according to that. Notice that only a single column is processed at each time step, and in order to sample a complete sequence of actions in an episode one needs to make a forward pass through the network architecture q times.
step j = 1, the quantum wavefunction is given by the initial state |ψ i ; for each intermediate protocol step j, the action a j = H j is chosen according to the policy π θ . Note that the RL agent only selects the generator H j out of the set of available actions A (or alternatively -which unitary to apply). In other words, unlike Ref.
[76], the RL part of CD-QAOA is not concerned with finding the corresponding optimal duration α j ; one can think of this low-level continuous optimization as being part of the environment [cf. App. B] [81]. At the end of the episode, the quantum state is evolved by applying the entire generated circuit U ({α j } q j=1 , τ ) to the initial quantum state |ψ i .
States: Since the initial state |ψ i is fixed and thus the quantum state at any time step j is uniquely determined by the previous actions taken, here we represent the RL state by concatenating all the previous actions up to step j [82]. One reason for this is that, in many-body quantum systems, the number of components in the quantum state scales exponentially with the system size N , which quickly leads to a computational bottleneck for the simulation on classical computers. A second advantage of this choice is that the first layer of the underlying deep neural network architecture, which parametrizes the policy, will not depend on the system size N either, which allows the algorithm to handle a large number of degrees of freedom. Using the quantum state would not be viable on quantum computers either, because quantum states are unphysical mathematical constructs that cannot be measured. Therefore, we can simplify the form of trajectories to consist of actions only, e.g. τ = (a 1 , a 2 , . . . , a q ).
Rewards: The reward R j = R(s j , a j ) is chosen as the negative energy density at the end of the episode: We use energy density, since it is an intensive quantity which has a well-defined limit when increasing the number of particles N . In all figures, we show the relative energy E/E GS for clarity (the ground state energy E GS is typically negative in our models), but the RL agent is always trained with the (negative) energy density −E/N . Rewards can also be other observables or nonobservable quantities, such as the overlap squared between two quantum states (fidelity), or the entanglement entropy.
Notice that the reward is sparse: only at the end of the episode is the negative energy density given as a reward; there is no instantaneous reward during the sequence [and thus we can use interchangeably the terms reward and total return]. This is motivated by the quantum nature of the control problem, where a projective measurement results in a wavefunction collapse.

Policy Parametrization using an Autoregressive Neural Network
An essential part of the policy gradient algorithm is the definition of the policy π θ . It is common to parametrize the policy with a highly expressive function approximator, such as a neural network. In our setup, we use a deep autoregressive network, which has recently been used in physics applications of learning to generate samples from free energies in statistical mechanics models [83], and variational approximators for quantum many-body states [84]. This architecture is selected to incorporate causality by factorizing the total probability into a product of conditional probabilities: π θ (a 1 , a 2 , · · · , a q ) = π θ (a 1 ) q j=2 π θ (a j |a 1 , · · · , a j−1 ), (A2) where the marginal distribution π θ (a 1 ) and the conditional distribution π θ (a j |a 1 , · · · , a j−1 ) are discrete categorical distributions over A. This kind of parametrization explicitly tells how the actions taken in the earlier steps of an episode affect the actions selected later on during the same episode. Such a causal requirement would not be necessary, had we used the full quantum state, which would make the dynamics of the environment Markovian. Each of the conditional probabilities in Eq. (A2) can be modeled explicitly using the autoregressive neural network architecture, which naturally allows the policy to depend on all the previous actions only. The structure of the policy network is shown in Fig. 9, the sampling of the autoregresssive policy is depicted in Fig. 10 and the hyperparameters of the algorithm (including the number of parameters) are given in Table III. In each iteration of the policy gradient algorithm, a batch of sampled trajectories are rolled out (i.e. sampled) from the current policy, where M is the batch/sample size. Then, the return R(τ k ) corresponding to trajectory τ k is computed as To compute the energies, we use the low-level optimization to determine the best-estimated values of α j , given a sequence τ , see App. B. To minimize the chance of getting stuck in a suboptimal local minimum, each sequence is evaluated multiple times, starting from a different initial realization for the α j -optimizer, and the best result is selected [App. D].
For every iteration, we can define three quantities for a fixed batch of samples: (i) mean reward (over the current batch), (ii) max reward (over the current batch), and (iii) history best (best-encountered reward over all the previous iterations). These quantities measure the performance of the learned policy, and are shown in Fig. 11. Figure 12 shows the scaling of these quantities for the spin-1 Ising chain, as a function of the episode length q. The performance of CD-QAOA increases because the action space for a larger value of q always contains as a subset the action space for a smaller q.
In order to improve the policy represented by the autoregressive network, the RL algorithm interacts with the quantum environment by querying the reward for samples from the current policy. Each trajectory is assigned a reward, once the simulation of the quantum dynamics is complete [note that, as of present date, the simulation may be more expensive if evaluated on a quantum computer]. Thus, it is advantageous to reduce the sample size needed to learn the policy, i.e., to improve the sample efficiency.
The vanilla policy gradient method is known for its poor data efficiency. Thus, we adopt Proximal Policy Optimization (PPO) [85], a more robust and sample-efficient policy gradient type algorithm. To be more specific, we use the following clipped objective function: clip (ρ θ (τ ), 1 − , 1 + ) A θt (τ ) .
Here, τ = (a 1 , a 2 , · · · , a q ) is the action sequence sampled from the previous policy π θt [cf. Algorithm 1]. Typically, the policy from the last iteration is chosen to be the old policy; ρ θ (·) = π θ (·) π θ t (·) is the importance sampling weight between the new policy π θ and the old policy π θt ; A θt (τ ) = R(τ ) − b is the advantage function, where b is called a baseline: the advantage measures the reward gain of choosing a specific action, w.r.t. the baseline. For example, a simple baseline can be the average reward, e.g., b = E τ ∼π θ t [R(τ )], and then the advantage measures how much better (or worse) an action is w.r.t. the average; in the numerical experiments, we use an exponential moving average [cf. App. A 5 for the details].
Further, the clip function, clip(r, x, y) = max min (r, x) , y , clips the value of r within the interval [x, y], which is used Spin-1 Ising model: energy minimization against different circuit depths q using CD-QAOA. The mean negative energy density (blue) is computed for a sample generated using the final, learned policy; max (orange) is the maximum within the sample; the history best (green) is the best encountered policy during the entire training process (i.e., considering all iterations). The total duration T = 20 and the values of q ranges from 8 to 24. The other model parameters are the same as in Fig. 11.
to restrict the likelihood ratio in the range [1− , 1+ ]; this prevents the policy update from deviating too much from the old policy after one gradient update. The clipped objective function is designed to improve the policy as well as to keep it within some vicinity of the last iteration, whence the name Proximal Policy Optimization.
We update the network parameters θ by ascending along the gradient of the RL objective G(θ). To provide intuition about the PPO objective, consider the following limiting case. If we only have the first term in the objective, i.e. G 1 (θ) = E τ ∼π θ t [ρ θ (τ )A θt (τ )], we obtain the following gradient: Since we are taking the gradient with respect to θ, it will pass through π θt and A θt (τ ). Furthermore, whenever the parameters θ ≈ θ t , the gradient above is identical to the policy gradient: However, PPO performs multiple gradient updates on the sampled data, rendering policy learning more sample efficient [85].

a. Incentivizing Exploration using Entropy
Maintaining a balance between exploration and exploitation is another major challenge for the reinforcement learning algorithm. Too much exploration prevents the agent from adopting the best strategy it knows so far; on the contrary, too much exploitation limits the agent from attempting new actions and achieving a potentially higher reward. Therefore, it is more appropriate for the agent to explore substantially in the initial iterations of the training procedure, and to gradually switch over to exploitation towards the end of the training procedure.
In order to incentivize the agent to explore the action space at the beginning of training, we include an entropy 'bonus' [86,87] to the PPO objective from Eq. (A3). To do this, consider the maximal-entropy objective, where the agent aims to maximize the sum of the total reward and the policy entropy S [cf. Eq. (A5)]: S π θ ( · |a 1 , · · · , a j−1 ) , where S π θ ( · |a 1 , · · · , a j−1 ) ≡ S π θ ( · ) , for j = 1. The trade-off between exploration and exploitation is controlled by the coefficient β −1 S , which carries a meaning analogous to temperature in statistical mechanics: for β −1 S → 0 (or β S → ∞), any exploration is limited to the intrinsic probabilistic nature of the policy; if training is successful, it is expected that, for deterministic environments, the policy eventually converges to a delta distribution (over the action space) at the later training iterations; this may deteriorate exploration and learning. However, in the opposite limit, β −1 S → ∞ (or β S → 0), every action is selected with equal probability, and the values of the policy π become irrelevant. Therefore, in practice, we use a decay schedule for the inverse temperature β −1 S to gradually reduce exploration [see App. A 5]. Since the marginal distribution π θ ( · ) and the conditional distribution π θ ( · |a 1 , · · · , a j−1 ) are discrete categorical distributions over A, we can compute a closed form expression for the entropy of the categorical distribution policy. For trajectory τ i = (a i 1 , · · · , a i q ), the j-th term in the entropy bonus simplifies to S π θ (·|a i 1 , · · · , a i j−1 ) (A5) = − a∈A π θ (a|a i 1 , · · · , a i j−1 ) log π θ (a|a i 1 , · · · , a i j−1 ).
We emphasize that the entropy considered here is the Shannon or information entropy associated with the policy as a probability distribution, and should be contrasted with the thermodynamic entropy, associated with the logarithm of the density of protocol configurations (a.k.a. density of states) in the optimization landscape. The Shannon entropy helps exploration in the space of policies, and thus the annealing of the corresponding Lagrange multiplier, β −1 S , is not related to thermal annealing in the optimization/energy landscape in a straightforward manner. Moreover, notice that the policy optimization is part of the classical postprocessing of the quantum data, i.e., it does not compromise the nature of the quantum data which is fed to the algorithm in form of rewards. Figure 13 shows a comparison of PPO with and without entropy, as controlled by the value of the temperature β −1 S . Introducing the policy information entropy keeps the policy a bit broader in the initial stages of training which enhances exploration; towards the end of train-ing the information entropy is not needed: therefore, we gradually "anneal" β −1 S , cf. App. A 5.

Technical Details
We train the CD-QAOA algorithm for 500 epochs/iterations with a mini-batch size of M = 128. Throughout the training, we sample trajectories according to the marginal and conditional policy distributions given by the autoregressive network.
We use Adam to perform gradient descent on the objective in Eq. (A4) with the default parameters β 1 = 0.9 and β 2 = 0.999, which define the exponential decay rate for the first and second moment estimates, respectively. The learning rate is initialized as α {lr,0} = 0.01 and decays by a factor of 0.96 every 50 steps in a staircase fashion. To be more precise, the learning rate at the k-th iteration with the exponential decay reads as α lr,{k} = 0.01 · 0.96 k/50 . The subscript {k} denotes the iteration/episode number.
We also introduce an exponential decay schedule for the pre-factor [a.k.a. temperature] β −1 S of the entropy bonus from Eq. (A4). The temperature initializes at β −1 S,{0} = 0.1 and decays by a factor of 0.9 every 10 steps. At the k-th iteration, the temperature is β −1 S,{k} = 0.1 · 0.9 k/10 . Eventually, the temperature is annealed to zero.
We estimate the advantage function by A θ old (τ ) = R(τ ) − b, where b is the baseline used to reduce the variance of the estimation. Our baseline b uses an exponential moving average (EMA) of the previous rewards. EMA stabilizes the training and also leverages the past reward information to form a lagged baseline. In practice, we find that the RL algorithm can achieve better rewards compared with using the average of current samples as the baseline. To be more specific, the exponential Algorithm 1 CD-QAOA with autoregressive network based policy Input: batch size M , learning rate ηt, total number of iterations Titer, exponential moving average coefficient m, entropy coefficient β −1 S , PPO gradient steps K. 1: Generate and select the gauge potential sets A using Algo. 2. 2: Initialize the autoregressive network and initialize the moving averageR = 0. 3: for t = 1, .., Titer do 4: Autoregrssively sample a batch of discrete actions of size M , denoted as B: τ k = (a k 1 , a k 2 , · · · , a k q ) ∼ π θ (a1, a2, · · · , aq) , k = 1, 2, · · · , M.

5:
Apply the SLSQP solver to the lower-level continuous optimization (cf. App B):

6:
Use the negative energy density as the return and compute the moving average:

7:
Compute the advantage estimates A k = R k −R.

11:
Use the advantage estimate and importance weight to compute G k , S k , following Eq. (A3) and Eq. (A5).

13:
Update weights θ t+1 ). 14: Update the parameter θt+1 ← θ Appendix B: Low-level optimization: finding optimal protocol time steps αj In order to determine the values of the time steps α j , we proceed as follows. For any given sequence of actions (or protocol sequence) τ = (a 1 , · · · , a q ) of total duration T , we solve the following low-level optimization problem: where q is the sequence length (circuit depth), N is the system size, and E(·) is the energy of the final quantum state [cf. Eq. (2)] after evolving the initial quantum state |ψ i according to the fixed protocol τ .
Note that the α j -optimization is both bounded and constrained. It fits naturally into the framework of the Sequential Least Squares Programming (SLSQP). SLSQP solves the nonlinear problem in Eq. (B1) iteratively, using the Han-Powell quasi-Newton method with a Broyden-Fletcher-Goldfarb-Shanno (BFGS) [89] update of the B-matrix (an approximation to the Hessian matrix), and an L1-test function within the step size.
During each iteration of the policy update, a batch of trajectories {τ i } = {(a i 1 , · · · , a i q )} M i=1 is sampled. Each trajectory sequence τ i is assigned a reward, by solving the optimization problem in Eq. (B1). Since performing the low-level optimization in Eq. (B1) is independent of the high-level optimization discussed in App. A, we run the former concurrently to boost the efficiency of the algorithm. We distribute every sequence τ i = (a i 1 , · · · , a i q ) to a different worker process and aggregate the results back to the master process in the end. In practice, we use the batch size M = 128, and we distribute the simulation on 4 nodes with 32 cores each, so that each core solves only one optimization at a time.
Recently, it was demonstrated that it is possible to perform the continuous optimization on par with the discrete one, which eliminates the need to use a solver and results in a fully RL optimization approach [90].
Appendix C: Scaling with the number of particles N , the protocol duration T , and the circuit depth q Next, we discuss the computational scaling of CD-QAOA. While there are a number of (hyper-)parameters in the algorithm, here we focus on the system size N , the protocol duration T , and the circuit depth q -which are physically the most relevant ones. We also consider the continuous and discrete optimization steps separately (the continuous step being also an essential part of conventional QAOA).
When it comes to the continuous optimization performed by a solver [cf. App. B], the main computational cost comes from the quantum evolution itself. The basic operation inside the solver is a multiplication of the matrix exponential exp(−iα j H j ) by the state |ψ i . The Hamiltonian H j is stored as a sparse matrix, and the action of the matrix exponential onto the quantum state, exp(−iα j H j ) |ψ i , can be evaluated without computing the matrix exponential itself with the help of a sparse matrix-vector product; this operation scales exponentially with the system size N , i.e. O(exp(cN )) for some constant c. If we denote the sequence length (a.k.a. circuit depth) by q, then the total cost for evaluating a single value of the continuous angle α scales as O(q exp(cN )). We stress that this cost is also incurred by conventional QAOA.
For the discrete optimization performed using reinforcement learning [App. A 4], notice first that the machine learning model is agnostic to the physical quantum model, because we do not use information about the quantum model to train the policy, cf. App. A 2. Because the policy input is, by construction, independent of the quantum state, the input layer of the neural network architecture is shielded from the exponential growth of the physical Hilbert space with N . Hence, the deep neural network is independent of the Hilbert space dimension. Further, we use an autoregressive network model which scales linearly with the sequence length q, and also linearly with the size of the available action set |A|. Thus, the total computational cost for the reinforcement learning optimization scales as O(q|A|). The scaling of the neural network with the variational networks parameters (weights and biases) is trivially given by the matrixvector multiplication, as is the case for typical ML deep networks, and is also independent of the physics of the controlled system. A comparison of the wall clock time for the discrete and continuous optimization steps is provided in Table IV. We distinguish between the continuous solver optimization and the discrete RL optimization, and show the average times for one successful step of each in the two columns on the right-hand-side. The total cost can then be obtained by multiplying the time for t solver by the appropriate number of repetitions (e.g., continuous solver initial conditions, policy sample batch size, PPO training episodes, etc.), and by multiplying the time for t RL by the Each point corresponds to a local minimum, obtained using the SLSQP optimizer, starting from a uniformly drawn random initial condition. The system size is N = 16, and the rest of the parameters are the same as in Fig. 1. number of PPO iterations, thereby taking into account any parallelization if used; for instance, the most expensive simulation we performed ran for about 109 hours on four nodes (Intel Xeon Skylake 6130 32-core 2.1 GHz) to produce the N = 10, T = 12, q = 20 data point shown in Table II.
We emphasize that the time t solver required for the continuous optimization is an essential part of conventional QAOA, and is the current limiting factor for reaching large system sizes, as is the case in merely all simulations of quantum dynamics on classical computing devices. In sharp contrast, the cost for training the deep autoregressive network is N -independent, and t RL per iteration is negligible; however, the choice of RL algorithm can strongly impact the number of iterations. CD-QAOA is, thus, suitably designed for potential applications on quantum simulators and quantum computers which will enable accessing large system sizes bypassing the exponential bottleneck intrinsic to simulations of quantum dynamics on classical devices.

Appendix D: Many-Body Control Landscape
Let us briefly address the question about how hard the many-body ground state preparation problems are, that we introduced in the main text. To this end, recall that CD-QAOA has a two-level optimization structure: (i) discrete optimization to construct the optimal sequence of unitaries [App. A], and (ii) continuous optimization to find the best angles, given the sequence, to minimize the cost function [App. B]. Here, we focus exclusively on the continuous optimization landscape, and postpone the discrete landscape to a future study.
The RL agent learns in batches/samples of M = 128 sequences, which sample the current policy at each iteration step and provide the data set for the policy gradient algorithm. To evaluate each sequence in the batch, we use SLSQP to optimize for the durations α j in a constrained and bounded fashion: j α j = T and 0 ≤ α j ≤ T [cf. App. B]. This provides us with the full unitary U ({α j } q j=1 , τ ); applying it to the initial state we obtain the reward value for this sequence. This procedure repeats iteratively as the RL agent progressively discovers improved policies.
Once the RL agent has learned an optimal sequence, i.e. after the optimization procedure is complete, we focus on the best sequence from the sample, and examine how difficult it is to find the corresponding durations α j using SLSQP. To this end, we draw q values at random from a uniform distribution over the interval [0, T /q], and use them as initial conditions for the α j , to initialize the SLSQP optimizer with. We use the same q as the circuit depth so that the initial durations α (0) j are, on average, equal. We then repeat this procedure P times, and generate a sample M = {α n j } q j=1 P n=1 of the local minima in the optimization landscape for α j 's. The larger P , the better our result for the true reward assigned to τ is.
Notice that, in the beginning of the training, the RL agent is still in the exploration stage and the reward estimation does not need to be too accurate; this reward estimation needs to be more accurate as the agent switches over exploitation during the end of the training. In order to make the algorithm computationally more efficient, we introduce a linear schedule for the number of realizations of the α j -optimizer, starting from 3 with an increment of 1 every 30 iteration steps, i.e. P tot {k} = 3 + k/30 , where subscript k indicates the iteration number for the RL policy optimization. In order to further save time in the reward estimation, we also introduce some randomness here by sampling P {k} from a uniform distribution over 1, 2, · · · , P tot {k} . Even though they all correspond to the same sequence, every local minimum in M represents a potentially different protocol, since the durations α j will cause the initial quantum state to evolve into a different final state. We can evaluate for every protocol in M the negative logfidelity, − log F τ (T ), and entanglement entropy of the half chain, S N/2 ent . Since the target state for the Ising model is an ordered ground state, it has area-law entanglement. Figure 14 shows a cut through the landscape in the fidelity-entanglement entropy plane for a few different durations T for the spin-1/2 Ising model. The better solutions are located in the lower left corner. The proliferation of local minima across the quantum speed limit has recently been studied in the context of RL [28] and QAOA [41]. This behavior indicates the importance of running many different SLSQP realizations, or else we may mis-evaluate the reward of a given sequence and the policy gradient will perform poorly. Figure 14 also provides a plausible explanation for the destruction of the scaling collapse for T T QSL [Fig. 2]. Although the precision of the SLSQP optimizer is set at 10 −6 , the energy curves for large durations no longer fall on top of each other with a larger relative error. Hence, the occurrence of many local minima of roughly the same reward, which correspond to different protocols, effectively removes any universal features from the obtained solution; therefore, different system size simulations end up in different local minima.

Consider the generic Hamiltonian
with a general smooth function λ = λ(t). We define a state preparation problem where the system is prepared in the ground state of H 0 at time t = 0, and we want to transfer the state population in the ground state of H by time t = T . Unlike adiabatic protocols, counter-diabatic driving relaxes the condition of being in the instantaneous ground state of H(λ) during the evolution. The idea is to reach the target state in a shorter duration T (compared to the adiabatic time) at the expense of creating controlled excitations [w.r.t. the instantaneous H(λ)] during the evolution, which are removed before reaching the final time T . To achieve this, one can define a counter-diabatic Hamiltonian H CD . In general, the original H(λ) differs from H CD , whose ground state the system follows adiabatically: where A λ is the gauge potential; A λ is defined implicitly as the solution to the equation [18] [ The boundary conditions H CD (λ(0)) = H(λ(0)) and H CD (λ(T )) = H(λ(T )) impose the additional constrainṫ λ(0) = 0 =λ(T ) which suppresses excitations at the beginning and at the end of the protocol. Using Eq. (E3), one can convince oneself that the gauge potential A λ of a real-valued Hamiltonian H is always imaginary-valued [18].
For generic many-body systems, it has recently been argued that the gauge potential A λ is a nonlocal operator [34]. Nevertheless, one can proceed by constructing a variational approximation X ≈ A λ , which minimizes the action (E4) For ground state preparation, · = ψ GS (λ)| · |ψ GS (λ) is the instantaneous ground state expectation value w.r.t. H(λ). More generally, one can use · = Tr(ρ th × (·)), where ρ th ∝ exp(−βH) is a thermal density matrix at temperature β −1 : for β → ∞, we recover the ground state expectation value; for β → 0 all eigenstates are weighted equally.
We mention in passing that alternative schemes to approximation the adiabatic gauge potential have also been considered [37].

a. Real-valued Spin-1/2 Hamiltonians
Let H now be a real-valued spin-1/2 Hamiltonian with translation and reflection invariance. Such a system is given, e.g., by the mixed-field Ising model, discussed in the main text. We now construct an ansatz for the variational gauge potential X which obeys these symmetries, and is imaginary valued.
We can organize the terms contained in X according to their multi-body interaction type, as follows. The only single-body imaginary valued term we can write is j β j S y j . Translation and reflection symmetries, whenever present in H, further impose that the coupling constant β j = β be site-independent, i.e. spatially uniform. Hence, this is the zeroth-order term in our variational gauge potential construction, cf. Eq. (E5).
Next, we focus on the two-body terms. Because the exact A λ is imaginary valued for real-valued Hamiltonians, we may only consider interaction terms where S y appears precisely once: S x S y and S y S z . For spin-1/2 systems, the two operators have to act on different sites, or else one can further simplify their product to single-body operators using the algebra for Pauli matrices. Once again, translation invariance dictates that the coupling constants are uniform in space, while reflection invariance requires us to take a symmetric combination. Imposing further that the interaction be short-range (we want to construct the most local variational ansatz), we arrive at The coefficients β (k) l are the variational parameters that we need to determine to find the approximate CD protocol. To find their optimal values, we minimize the action S(X ) [18]. Note that, since we do not have a closed-form expression for the instantaneous ground state of H(λ), we do the minimization numerically at every fixed time t along the protocol λ(t) [cf. App. E 2].
We can, in principle, add the next order terms to the series; however, they will either be less local, or consist of three-and higher-body interactions, which is hard to implement in experiments.

b. Real-valued Spin-1 Hamiltonians
The situation is more interesting for spin-1 systems: the eight-dimensional Lie algebra su(3), which generates SU(3), contains three distinct imaginary-valued directions, which form a closed subalgebra su(2) su(3), and hence there is more room to generate imaginary-valued combinations. To find all imaginary-valued terms consistent with a set of symmetries, we use QuSpin's functionality to implement an algorithm [App. E 3] that lists them for generic bases [54,55].
Translation and Reflection Symmetric spin-1 Hamiltonians, such as the spin-1 Ising and Heisenberg models, have a similar expansion to their spin-1/2 counterparts, but allow for more terms. Restricting the expansion to two-body terms, we have where the constants a and b are chosen so that all five terms are mutually orthogonal w.r.t. the scalar product induced by the trace (i.e. Hilbert-Schmidt) norm; this ensures the linear independence of the constituent terms. Note that the three imaginary-valued on-site terms cor-respond precisely to the imaginary-valued su(2) su(3).
Adding Magnetization Conservation and Spin Inversion Symmetry further reduces the allowed terms in the series. Therefore, one has to restrict to three-and four-body terms: Because these terms are multi-body and less local, we re-frain from using them in CD-QAOA in the present study.   l (λ(t)) in the variational gauge potential (Eq. E6) with translation and reflection symmetry, determined from the procedure in App. E 2. The total duration T = 12 with the time discretization step ∆t = 0.2, and the system size N = 8. The protocol we used is λ(t) = sin 2 πt 2T . The other model parameters are the same as in Fig. 7.
We merely list them here for completeness.
As explained in the main text, to apply CD-QAOA for many-body ground state preparation, we consider the constituent terms in X as independent generators {H j } |A| j=1 . This comes in contrast to the variational gauge potential method where the ratios between the coefficients β  As explained in the main text, the Lipkin-Meshkov-Glick (LMG) Hamiltonian, cf. Eq. (5), models homogeneously interacting spin-1/2 particles on an all-to-all connected graph in the presence of an external field. Here, we compute the lowest-order terms appearing in the series for the variational gauge potential X , going beyond Ref. [91].
The starting point is the LMG Hamiltonian We introduce two bosonic modes, s and t, where S z = t † t − N/2 = n t − N/2 and S + = t † s, and cast the LMG Hamiltonian in the form Recalling once again that real-valued Hamiltonians have imaginary-valued gauge potentials, and that gauge potentials do not have diagonal matrix elements, we make the following ansatz: To compute the matrix elements of the gauge potentials, we define the basis |N, n t = t † nt s † N −nt n t !(N − n t )! |0 , with n t = 0, . . . , N.

Numerical Minimization to obtain the Variational CD Protocol
Since the action S in Eq. (E4) is quadratic in the variational parameters β j , it is possible to derive a generic lin-ear system, whose solutions are the optimal parameters of the variational gauge potential within CD driving [92].
Suppose that X = r j=1 β j H j is given by a linear combination of r gauge potential terms. Then, it is straight-forward to see that Defining the operator-valued quantities B 0 = ∂ λ H and B j = i[H j , H] and setting β 0 = 1, we arrive at the following expression for the variational action which is a quadratic form in the unknown coefficients β j .
To find the minimum of S(X ) w.r.t. β j , we can take the derivative and set it to zero, to obtain the linear system of equations for the optimal β j : Solving the system we obtain the minimum {β j } r j=1 of the variational action S.
The ground state expectation values in the above procedure, as well as the Hamiltonian H(λ(t)) depend implicitly on time t ∈ [0, T ] via the protocol λ(t). Therefore, to find the time dependence of β j (t), we discretize the time interval [0, T ] into N T time steps, and repeat the procedure at every time step. This yields β j (t i ) at the time steps t i . To recover the full functional dependence, we use a fine discretization mesh, and apply a linear interpolation to β j (t i ). Alternatively, notice that the coefficients β j = β j (λ(t)) depend on time t only implicitly via the protocol λ. Therefore, it is also possible to discretize the range of λ(t) instead.
For the spin-1 Ising model, the time-dependence of β j is shown in Fig. 15. This defines H CD which generates the CD evolution. In Sec. V and App. F 4, we compare variational CD driving to CD-QAOA and conventional QAOA.

Algorithm for Generating Gauge Potential Terms in the Presence of Lattice Symmetries
Finally, we also show the algorithm we used to determine the terms appearing in the gauge potential expansions in Eqn. (E5), (E6), and (E7), which obey a fixed set of symmetries.
In general, one can represent any local operator of the kind J i1,··· ,i l O γ1 i1 · · · O γ l i l as a triple (Y, I, J), where J = J i1,··· ,i l is the coupling coefficient constant, I = (i 1 , · · ·, i l ) is the set of sites the operators act on, and Y = (γ 1 , · · ·, γ l ) defines the types of operators that act on the corresponding sites; the triple (Y, I, J) can then be used to construct the operator.
In the following, we refer to the separate terms appearing in the gauge potential series as 'Hamiltonians' H j , i.e. X = j β j H j ; a Hamiltonian is defined as where Λ is the lattice graph. As we argued above, real-valued Hamiltonians have purely imaginary-valued gauge potentials; thus, the coefficient J is chosen to be purely imaginary.
We build the series for the variational gauge potential X recursively: we first consider a set L elem of elementary operators O -the building blocks for the expansion: e.g., for the spin-1 chains, these can be the spin-1 operators L elem = {S + , S − , S z }. We want to construct the terms in the expansion for X iteratively at a fixed order l, e.g. l = 1 comprises single-body terms, l = 2 -twobody terms, etc. We also assume that we have access to a routine which checks if a trial list of operators obeys a given lattice symmetry; if not, the same routine returns the missing operators to be added to the original list, so that the symmetry is now satisfied [e.g., such a routine is used in QuSpin [54,55]].
The pseudocode we developed is shown in Algorithm 2. To construct multi-body terms at a fixed order l, we define combinations of the elementary operators, and store them in the list L op ; the way these combinations are built can be used to implement constraints, such as particle/magnetization conservation, etc. This is implemented via the product operator (Line 2 of Algorithm 2). It generates all possible combinations of selecting l elementary operators with replacement. The sets of lattice sites that the operators from L op act on, are stored in the list L sites (Line 3 of Algorithm 2). Then, for each trial triple (Y, I, J), we make use of the routine to check the symmetry and record any operators which do not respect it. We append these, so-called missing operators, to the original list, and we keep checking the symmetry condition until we obtain all operators that satisfy the symmetry (Line 10 -15 of Algorithm 2). The finite number of combinations guarantees a termination in a finite number of steps. for I in Lsites do 6: Initialize an empty list LH 7: Set J = i (i = √ −1).

12:
for sym in Lsym do 13: if exists missing operator (Y , I , J ) then

16:
Build Hamiltonian H using the triplets in LH .

17:
if H or equivalents not included in Lgauge then

18:
Append H to Lgauge . 19: Return the list of gauge potential basis Lgauge.
product: Cartesian product, equivalents: equivalent mod scalar, missing operator: the operator missed for the symmetry requirement In order to avoid repeating previously identified Hamiltonians, we discard equivalent Hamiltonians (Line 17 of Algorithm 2): two Hamiltonians are called equivalent when one is a scalar times the other. Since here we consider imaginary-valued gauge potentials, the multiple constant should be real. To test whether the Hamiltonians H 1 and H 2 are equivalent in practice, it suffices to test whether H 1 is equal to ± H1 H2 H 2 , where we use the Hilbert-Schmidt norm.

Appendix F: CD-QAOA for Many-Body State Preparation
Here, we provide a supplementary discussion on the performance of CD-QAOA for many-body pure state preparation using the quantum spin chains introduced in the main text. We refer the reader to the main text for the definition of various model parameters; the shorthand spin operator notation used is defined in Table I.

Spin-1/2 Ising Chain
First, we show results for the single-spin problem (J = 0): In Fig. 16, we clearly see that CD-QAOA [red curve] has a smaller quantum speed limit T QSL ≈ 4.0 than conventional QAOA [blue]; this is anticipated, since CD-QAOA has a larger control space at its disposal. Moreover, we find that, for T < T QSL , CD-QAOA only makes use of a single Y rotation by setting the durations α j associated with any other unitaries from the set A, to zero. As mentioned in the main text, conventional QAOA tries to represent this Y -rotation by means of Euler angles, i.e. composed of X and Z rotations; in general, this results in a higher duration cost to complete the population transfer (leading to a larger T QSL ). However, for short durations T , a Y -rotation can be exactly obtained using a proper sequence of the X and Z terms. For these reasons, we find an exact agreement between the two curves for small values of T 3.
Let us now switch on the spin-spin interaction strength J > 0; consider the spin-1/2 Ising chain . We find that additionally using only the single-particle gauge potential term Y [green line], typically accessible in experiments, one can already obtain a higher-fidelity protocol than QAOA to prepare the ground state. Interestingly, for short protocol durations T , the two-body gauge potential terms, present in A but not in A , do not contribute to improving the energy of the final state, as can be seen from the agreement of the red and green lines for T 1.5. This suggests that single-particle processes dominate over many-body processes when it comes to lowering the energy of the zpolarized initial state, and implies that the target ground state is single-particle-like (i.e. close to a product state). The non-smooth behavior of the green curve at larger durations, is attributed to the ruggedness of the control landscape, as different runs of the SLSQP optimizer may get stuck in one of the many suboptimal local minima [App. D].
One may wonder if it is possible to prepare the ground state by straightforward fidelity maximization. We define the many-body fidelity to transfer the population to the target state using the unitary process U ({α j } q j=1 , τ ), with q j=1 α j = T , as (F3) The fidelity can be less relevant from the perspective of many-body physics because (i) the many-body fidelity is typically exponentially suppressed, and (ii) it requires a reference to the ground state itself (which we seek) in order to be computed. However, the fidelity of a quantum process is a widely used benchmark in quantum computing; it also provides a better measure (than energy density) for the distance between two states in the Hilbert space H. Figure 17 [bottom] shows the many-body fidelity for N = 14 spins. Unlike the inset of Fig. 1 from the main text (where we show the fidelity associated with the protocol obtained using energy density minimization), here we use the fidelity as a reward function for QAOA. We observe that optimizing the fidelity behaves quantitatively similar to optimizing the energy density. We would like to emphasize here once again the advantage of the gauge potential ansatz: the conventional QAOA simulation is done using q = 80 variational parameters α j [yet no significant improvement is observed for q ≥ 4, cf. Fig. 1], while CD-QAOA requires only q = 3 variational parameters.
Although the fidelity F τ (T ) is anticipated to vanish for T < T QSL in the thermodynamic limit, the negative log-fidelity density, −N −1 log F τ (T ), is more likely to. Figure 18 [inset] shows the finite size scaling of the fidelity curves. Similar to the energy density [ Fig. 2], we obtain an almost perfect scale collapse. We verified that maximizing the fidelity produces similar results as minimizing the negative log-fidelity density for the spin-1/2 chain: at first sight, this is nontrivial because F τ (T ) is exponentially suppressed with the system size N for T < T QSL ; however, this behavior is likely explained by the generalization capabilities of the RL agent from small to large system sizes [cf. Sec. VI].

Anisotropic Spin-1 Heisenberg Chain
Next, we discuss in detail the ground state preparation process in the anisotropic Heisenberg spin-1 chain: where the model parameters are defined in the main text. An important detail worth mentioning is that the ferromagnetic ground state at ∆/J = −2.0 is two-fold degenerated (one state, corresponding to one of the two zpolarizations). While being a trivial observation, this requires certain care when analyzing the physics of the protocols the agent found. In particular, notice that energy minimization is insensitive to this degeneracy, and hence the final state can appear as an arbitrary superposition of the two ferromagnetic states, and still have the correct ground-state energy. This leads to ambiguity when computing the fidelity of being in the target state: related to this, the cost function landscape likely develops a continuous one-dimensional structure for the (degenerate) global minima. Because we are interested in energy minimization, here we define the fidelity using the projector to the ground state manifold P = |ψ where |ψ (1) * , |ψ (2) * are any two orthonormal states which span the doubly degenerate ground state manifold (e.g., the two FM ground states). Figure 19 shows a comparison between CD-QAOA and conventional QAOA for FM, XY, and Haldane target states: the top row shows the result of energy density minimization [cf. Fig. 3]. The bottom row, on the other hand, displays the many-body fidelity associated with the same protocols. For ∆/J = 0.5, CD-QAOA allows reaching the target topological Haldane state already faster, as compared to conventional QAOA. Notice also that the gauge potential ansatz appears essential for reaching the target for both the XY (∆/J = −0.5) and FM states (∆/J = −2.0); this becomes particularly obvious from the many-body fidelity curves. The latter also reveals an interesting detail: at ∆/J = 0.5, a regime emerges around T ≈ 5, where the QAOA fidelity is better than the CD-QAOA fidelity. However, this peculiarity below the quantum speed limit can be explained, recalling that the RL agent is given the (negative) energy density as the reward signal, and not the fidelity (note that CD-QAOA does outperform QAOA in energy).
In order to investigate in detail in the protocols found by CD-QAOA, we fix a duration T , and consider the time evolution of the state, |ψ(t) = U ({α j } q j=1 , τ )|ψ i , for three physical quantities: (i) the energy provides a measure of how far away in the cost function landscape the state is, at any given time t ∈ [0, T ].
(ii) the instantaneous fidelity F τ (t) = | ψ * |ψ(t) | 2 (and its generalization to the doubly-degenerate ground state manifold), measures how far the current state is, from the target state |ψ * in the Hilbert space; typically, we choose the ground state as the target state |ψ * = |ψ GS (H) .
(iii) the entanglement entropy of the half chain S N/2 where A denotes a contiguous spacial region with a complementĀ comprising half the periodic chain, and ρ A (t) is the reduced density matrix on A at time t. For many-body systems, it is common to look at the entanglement entropy per site, which for spin-1 systems lies within the interval 2N −1 S N/2 ent ∈ [0, log 3]. Figure 20 shows the time evolution of the energy, fidelity and entropy density, for all three target states of interest. For ∆/J = 0.5, transferring the population from the AFM initial state to the Haldane state can be obtained equally well using either QAOA or CD-QAOA. Table V(d) shows the optimal protocol found by the RL agent: notice the three vanishing durations α 2 = α 17 = α 18 = 0; factoring them out, we recover precisely the conventional QAOA sequence (albeit with q odd). Thus, we see that the CD-QAOA may converge to conventional QAOA whenever the latter provides a highreward sequence. This result exemplifies our claim that CD-QAOA generalizes QAOA successfully. Of course, it is not clear whether this is the true global minimum of the cost function landscape (the RL agent does make use of the additional gauge potential terms for T < 7). Nevertheless, all physical quantities are expected to be prepared with similar accuracy under both protocols: to see this, notice that the entanglement entropy density depends only on the quantum state (unlike expectation values of observables), and that its value at t = T is close to the value for the target state (dashed horizontal line). Importantly, the entanglement remains area-law (as seen by the values being much smaller than the maximum entropy per site, log(3), suggesting the existence of a local effective Hamiltonian which generates the population transfer process dynamically. The best sequence for targeting the XY state at ∆/J = −0.5 is shown in Table V(c). Although its structure is more complicated, factoring out the vanishing α j , we can discern two clear patterns: (i) the sequence starts and ends with two different single-particle basis rotations, and (ii) there is an alternating subsequence based on the subset {X|X + Y |Y, Y } A CD−QAOA . Interestingly, the only gauge potential term used by the RL agent is the experimentally accessible single-particle Y rotation, and it is sufficient to reach the target with a very high many-body fidelity. For comparison, conventional QAOA appears insufficient to prepare the target state for the circuit depth of q = 18 (p = 9). The advantage of CD-QAOA is also visible in the entanglement entropy density curve: QAOA can easily lead to volumelaw entanglement, while CD-QAOA manages to generate as little entanglement as needed for the target state.
The discrepancy between conventional QAOA and CD-QAOA is best visible in the FM state preparation at ∆/J = −2.0. In this case, a naïve application of QAOA with the set A QAOA = {X|X + Y |Y, Z|Z} is a priori doomed to fail: starting from the initial AFM state, which is orthogonal to the target FM manifold, the resulting QAOA unitaries leave the target AFM manifold invariant; in other words, transitions between the initial and the target states are forbidden by selection rules within the QAOA dynamics. Therefore, the many-body fidelity remains zero at all times during the QAOA evolu-tion. The energy and entanglement entropy curves certify that the state does undergo nontrivial dynamics: similar to the XY state, QAOA creates volume-law entanglement and cannot reach the FM ground state manifold in energy, while CD-QAOA is well-behaved and sufficient to prepare the target. The CD-QAOA protocol sequence is shown in Table V(b): while we do not discern an obvious pattern, we emphasize that this time the RL agent makes use of both single-particle and two-body gauge potential terms.
Last, we show the system-size scaling of the energy curves for the three target states in Fig. 21(b-d). Similar to the spin−1/2 Ising chain, we find very little systemsize dependence for the Haldane (b) and XY states (c). However, we cannot extrapolate the results to the thermodynamic limit due to the relatively small system sizes we were able to investigate. system-size effects are more pronounced for the ferromagnetic state (d), which is the one furthest away in Hilbert space from the initial perfect antiferromagnet.
Last, we mention in passing that we do not show results on preparing the AFM ground state at ∆/J = 2.0 since this problem is somewhat trivial: indeed, starting from a perfect AFM in the z-direction, the AFM ground state of the spin-1 Heisenberg model can be easily reached even using adiabatic evolution because it lies within the AFM phase. and Haldane (∆/J = 0.5) target state, respectively. Three quantities are shown: many-body fidelity (first row), energy ratio (second row), and the entanglement entropy density of the half chain (third row). The horizontal dashed line in the entanglement entropy curve shows the value in the target state, while the shaded area for the FM state denotes that in the span of the doubly degenerate ground state manifold. The protocols correspond to the duration T = 7 in Fig. 3. The related CD-QAOA protocol sequences are given in Table V

Lipkin-Meshkov-Glick model
In the main text, we also introduced the ferromagnetic Lipkin-Meshkov-Glick (LMG) model, described by the total spin Hamiltonian Figure 22 shows the comparison between CD-QAOA and QAOA for two more values of h/J = 0.1 (deep in the ferromagnetic regime), and h/J = 0.9 (close to the critical point at h/J = 1.0). While the behavior for h/J = 0.1 is qualitatively similar to h/J = 0.5 (discussed in the main text), we do see that close to the critical point the two-body gauge potential termsXY andẐY may offer some degree of improvement below the quantum speed limit, as compared to using only using the single-bodyŶ term. We mention in passing that we observed a stronger system-size dependence in the optimal protocol found by the RL agent in the immediate vicinity of the critical point h c /J = 1.

Spin-1 Ising Chain
Finally, let us turn to the spin-1 Ising chain: see main text for discussion of the model parameters. Using this model, we compare four state preparation techniques: CD-QAOA, conventional QAOA, CD-driving using a variational gauge potential, and adiabatic evolution.
In order to compare these four methods, we first investigate their energy budget, i.e. the amount of energy required by the corresponding protocols. This is necessary, since variational CD-driving does not put any constraints on the magnitude of the expansion parameters β j (λ) [cf. App. E], and we know that larger energies (i.e. generators of unitaries H j with large norms) in general allow for a faster population transfer. To measure quantitatively the energy budget of a protocol, we use the average energy density along the protocol trajectory where H(t) is a unified notation for the continuous protocols in the case of adiabatic or CD driving, and the piecewise-constant (in time) sequences in CD-QAOA and conventional QAOA; H denotes the Hilbert-Schmidt norm of the operator H. Since we are interested in manybody systems, it is also natural to look at the energy density, i.e. H(t) /N . Figure 23 [bottom] shows that N is on a similar scale for all four methods within the range of durations of interest, which allows for a meaningful comparison between them. As expected, CD-driving approaches adiabatic driving at large T , since the gauge potential term comes with a pre-factorλ which vanishes for T → ∞; in the opposite limit of T → 0, the energy budget of CD-driving blows up, as a result of β j (λ) being unconstrained.
In Fig. 23 [top], we see that the many-body fidelity, associated with the protocols obtained using energy density minimization, increases the performance contrast between the performance of the different methods [cf. , variational gauge potential (green) and adiabatic evolution (magenta). Two associated quantities are shown: manybody fidelity Fτ (top) and normalized time-averaged energy density N over the protocol (bottom). The empty symbols mark the duration for which the evolution of physical quantities is shown in Fig. 24. The parameters are the same as in Fig. 7.
the entanglement entropy, it is insensitive to any specific observable; this implies that CD-QAOA outperforms the other three methods on all observables, not just energy. This is anticipated, because CD-QAOA combines the variational power of QAOA with physical insights from CD driving. Despite its better performance, notice how CD-QAOA also has a smaller energy budget than either of CD-and adiabatic driving.
To demonstrate the nonequilibrium character of the optimal protocols found by the RL agent in this setup, we fix T = 12, and look at the time evolution of the energy, the fidelity, and the entanglement entropy within the learned protocol, cf. Fig. 24. While the protocol sequence [Table V(a)] appears impenetrable, we remark that (i) the RL agent makes use of both single-particle and twobody gauge potential terms, and (ii) some step durations α j are found to vanish identically, suggesting that the value of q may be reduced. As anticipated, the behavior of the dynamics generated by the CD and adiabatic driving is smooth, in contrast to the circuit-like piece-wise continuous curves of QAOA and CD-QAOA. The highly non-monotonic behavior of the energy curve shows that the CD-QAOA dynamics can be highly nonequilibrium: this likely stems from the RL objective [cf. App. A] -the total expected return: the agent only cares about maximizing the reward at t = T and is insensitive to any intermediate values. This allows the agent to drive the system through various states which are very far away from the target (e.g. w.r.t. the fidelity) [Curiously, these bad-energy states are all distinct, since they have different entanglement entropy, and the system does not visit the same quantum state twice during the evolution]. The non-smooth and non-monotonic behavior of the CD-QAOA solution raises the question about how robust the protocol is, to small external perturbations -a topic of future studies. Spin-1 Ising model: time evolution generated by the four different methods: CD-QAOA (red), conventional QAOA (blue), CD driving using the variational gauge potential (green) and adiabatic evolution (magenta). The three quantities are shown: the many-body fidelity (left), energy (middle), and entanglement entropy of the half chain (right). The protocols correspond to the empty symbols during T = 12 in Fig. 7. We compare The horizontal dashed line in the entanglement entropy curve shows the value in the target state. The CD-QAOA protocol sequence is given in Table V(a). The model parameters are the same as in Fig. 7.