Reinforcement Learning assisted Quantum Optimization

We propose a reinforcement learning (RL) scheme for feedback quantum control within the quan-tum approximate optimization algorithm (QAOA). QAOA requires a variational minimization for states constructed by applying a sequence of unitary operators, depending on parameters living ina highly dimensional space. We reformulate such a minimum search as a learning task, where a RL agent chooses the control parameters for the unitaries, given partial information on the system. We show that our RL scheme finds a policy converging to the optimal adiabatic solution for QAOA found by Mbeng et al. arXiv:1906.08948 for the translationally invariant quantum Ising chain. In presence of disorder, we show that our RL scheme allows the training part to be performed on small samples, and transferred successfully on larger systems.

In QA/AQC one constructs an interpolating Hamiltonian H(s) = s Ĥz + (1 − s) Ĥx , where, e.g., for spin-1/2 systems Ĥz is the problem Hamiltonian whose ground state (GS) we are searching [19] while Ĥx = −h j σx j is a transverse field term.An adiabatic dynamics is then attempted by slowly increasing s(t) from s(0) = 0 to s(τ ) = 1 in a large annealing time τ , starting from some easy-to-prepare initial state |+ , the GS of Ĥx .The difficulty is usually associated with the growing annealing time τ necessary when the system crosses a transition point, especially of first order [20].QAOA, instead, uses a variational Ansatz of the form where γ = γ 1 , . . ., γ P and β = β 1 , . . ., β P are 2P real parameters.The variational state |ψ P (γ, β) is as a sequence of quantum gates, corresponding to 2P unitary evolution operators applied to the initial state, from right to left for increasing t = 1, . . ., P, each parameterized by control parameters γ t or β t .The standard QAOA approach consists of a classical minimum search in such a 2P-dimensional energy landscape, which is in general not a trivial task [21].Indeed, there are in general very many local minima in the QAOA-landscape, and local optimizations with random starting points produce irregular parameter sets (γ * , β * ), hard to implement and sensitive to noise.To obtain stable and regular solutions (γ * , β * ) that can be easily generalized to different values of P and implemented experimentally, it is necessary to employ iterative procedures during the minimum search [14,15,17].Interestingly, as discovered in Ref. [15] for quantum Ising chains, smooth regular optimal schedules for γ t and β t can be found, which are adiabatic in a digitized-QA/AQC [22] context.One might indeed reformulate the QAOA minimization as an optimal control process [23] in which one acts sequentially on the system in order to maximize a final reward.This reformulation seems particularly suited for Reinforcement Learning (RL) [24][25][26][27].As schematically represented in Fig. 1(a), at each discrete time step t an "agent" is given some information, typically through measuring some observables O t−1 on the state S t−1 = |ψ t−1 of the system on which it acts (the "environment").The agent then performs an action a there choosing the appropriate (γ t , β t ) and applying the corresponding unitaries to the state -obtaining a new state S t = |ψ t and receiving a "reward" r t , measuring the quality of the variational state constructed.
Several questions come to mind, which have not been addressed in the recent literature on RL applied to quantum problems [28][29][30][31][32][33]: i) is such RL-assisted QAOA able to "learn" optimal schedules?ii) Are the schedules found smooth in t? iii) How to dwell with the fact that getting information from |ψ t involves quantum measurements which destroy the state?iv) Are the strategies learned easily transferable to larger systems?
In this Letter we show, on the paradigmatic example of the transverse field Ising chain, that optimal strategies -well known in that case, see Ref. [15] -can be effectively learned with a simple Proximal Policy Optimization (PPO) algorithm [34] employing very small neural networks (NN).We show that RL automatically learns smooth control parameters, hence realizing an optimal controlled digitized-QA algorithm [15,35].By working with disordered quantum Ising chains we show that strategies "learned" on small samples can be successfully transferred to larger systems, hence alleviating the "measurement problem": one can learn a strategy on a small problem which can be simulated on a computer, and implement it on a larger experimental setup [36].
RL-assisted QAOA -To test our scheme, we apply it to the transverse field Ising model (TFIM) in one dimension, where detailed QAOA results are already known [15].Specifically, we define the target Hamiltonian We start considering the uniform TFIM, where J j = J.
The model has a paramagnetic (h > J) and a ferromagnetic (h < J) phase, separated by a 2 nd -order transition at h = J.The performance of QAOA on the uniform TFIM chain has been studied in detail in Refs.37].Given a set of QAOA parameters (γ, β), we gauge the quality of the resulting state from the residual energy density where is the variational energy, and E max and E min are the highest and lowest eigenvalues of the target Hamiltonian.Specifically, the results presented below will concern targeting the classical state for h = 0, although the approach can be easily extended to the case with h > 0. At h = 0 the residual energy is bounded by the inequality [15] res which becomes an equality if and only if (γ, β) are optimal QAOA parameters.
The key ingredients of the RL-assisted algorithm, as schematized in Fig. 1, are as follows.
State) The state S t at time step t = 1, . . ., P is encoded by the wave-function |ψ t , defined iteratively as |ψ t = e −iβt Hx e −iγt Hz |ψ t−1 , with The agent has partial information through a number of observables O t−1 measured on |ψ t−1 .Our choice (with where a single value of j is enough when translational invariance is respected. Action) The action a t at time t corresponds to choosing (γ t , β t ).The conditional probability of a t given the observables O t−1 -called "policy" in RL -is denoted by Π θ (a t |O t−1 ), where θ are the parameters of a Neural Network (NN) encoding.Our policy is stochastic, to help exploration: Π θ (a|O) is chosen as a Gaussian distribution, whose mean and standard deviation are computed by the NN.From this, a t = (γ t , β t ) is extracted.
Reward) A reward r t is calculated at time t.In our present implementation, r t=1,...,P−1 = 0 and only r P > 0. The final reward r P = R(E P ) is associated to minimizing the final expectation value Here R(E P ) is monotonically increasing when E P decreases.Specifically, we take R(E P ) = −E P , but different non-linear choices have been tested.
Training) The training process consists of a number N epo of "epochs", as sketched in Fig. 1(b).During each epoch the RL agent explores, with a fixed policy, the state-action trajectories for a certain number N epi of "episodes", each episode involving P steps t = 1, . . ., P. At the end of each epoch the policy is updated to favor trajectories with higher reward.The particular RL algorithm we used is the Proximal Policy Optimization (PPO) algorithm [34], from the OpenAI SpinningUp library [38].PPO is an actor-critic algorithm where two independent NNs are used to parameterize the policy Π θ (a t |O t−1 ) and the state-value function [24] V Π θ (O t ).In our current implementation, gives the expected reward that the system in a state with observables O t gets as it evolves with the policy Π.V Π θ (O t ) is used to calculate the updates after each epoch [38].In our numerical simulations, we used NNs with two fully-connected hidden layers of 32, 16 neurons, and linear-rectification (ReLu) activation function.
Results -In the RL training, the system is initially prepared in the state |ψ 0 = |+ , while the NNs for the policy and the state-value function are both initialized with random parameters.The agent is then trained for N epo = 1024 epochs, each comprising N epi = 100 episodes of P steps each.After training, we test the RL algorithm with ∼ runs.Full blue lines denote st learned after Nepo = 1024 epochs on a chain of N = 128 sites; Dashed red lines, the RL+LO results; Black empty squares, the iterative LO smooth solution [15].The RL actions are in the basin of the same optimal minimum.Inset: same data for Nepo = 128 training epochs, where not all the LO optimized actions sets fall onto the iterative LO solution.
Fig. 2(a) shows the results obtained by the RL-trained policy.For P ≤ 6, the trained RL agent finds optimal QAOA parameters, saturating the bound for res P in Eq.( 4).In particular, for small system sizes N , when P > N/2, the agent finds the exact target ground state, and res P = 0.For longer episodes (P > 6), the residual energy deviates from the lower bound due to two factors: i) the longer the episode, the more difficult it is to learn the policy, as a larger number of training epochs are necessary to reach convergence; ii) since we are using a stochastic policy, the error due to the finite width of the action distributions is accumulated during an episode, leading to larger relative errors for longer trajectories.To cure this fact, we adopted the following strategy: we supplement the RL-trained policy with a final local optimization (LO) of the parameters (γ, β), employing the Broyden-Fletcher-Goldfard-Shanno (BFGS) algorithm [39].This last step is computationally cheap, since the RL training brings the agent already close to a local minimum, provided N epo is large enough.The residual energy data obtained in this way, denoted by RL+LO in Fig. 2(a), falls on top of the optimal curve res P = 1 2P+2 .To visualize the action choices, we translate γ t and β t into the corresponding interpolation parameter s t which a Trotter-digitised QA/AQC would show, which for h = 0 is given by: [15] Fig. 2(b) shows the interpolation parameter s t during an episode t = 1, . . ., P, for a chain of N = 128 spins and P = 8.Different curves are obtained by repeating a test run of the same stochastic policy, trained for N epo = 1024 epochs.The parameters obtained through the RL policy are smooth, and different tests result in similar sshaped profiles for s t .When a final local minimization is added, the curves for s t coalesce and coincides with the smooth optimal schedule obtained in Ref. [15] through an independent iterative local optimization strategy.When the training is at an early stage, i.e. the number of epochs is small, see inset of Fig. 2(b), the profiles s t are more irregular and do not fall all in the same smooth minimum upon performing the LO (see the three dashed red lines).Next, we turn to the random TFIM case.Here, for each chain length N we fix a given disorder instance {J j } j=1,...,N with J j ∈ [0, 1], both for the training and the test of the RL policy.Since translational invariance is now lost, one would naively imagine that the relevant observables O t in Eq. ( 5) would involve a list of 2N measurements.However, our experience has taught us that we can efficiently go on with a reduced list comprising only the two Hamiltonian terms, from the bare RL (full symbols) and from RL followed by a local optimization (RL+LO, empty symbols).The local optimization significantly improves the quality for large P ≥ 10.A detailed study of the behaviour of res P for large P and a comparison with the results obtained [40] by a linear-QA/AQC scheme, with s(t) = t/τ , is left to a future study.
Fig. 3(b) shows the optimal parameter s t = γ t /(γ t +β t ) found by the RL+LO method, compared to the s t constructed with the iterative optimization strategy described in Ref. [15]: the agreement between the two is remarkable, showing that the RL-assisted QAOA effectively "learns" smooth action trajectories.
The most remarkable fact, however, is shown by the series of grey lines present in Fig. 3(b).These are obtained by training the RL agent on a much smaller instance with N = 8 sites, and transferring the RL-policy to the larger (and different) disorder instance with N = 128, followed by local optimizations of the learned parameters.These results show a large transferability of the RL policies, which we have verified to hold even in the absence of the final LO.This suggests the following way-out from the "measurement problem" involved in the construction of the state observables O t .Indeed, in an experimental implementation of RL-assisted QAOA, the RL agent could observe a small system, efficiently simulated on a classical hardware, and then use the learned actions to evolve the larger experimental system.This reduces drastically the number of measurements to be performed and allows to test RL-assisted QAOA on physical quantum platforms.
Conclusions -In this Letter we have shown that the optimal QAOA strategies well known for the TFIM [15] can be effectively learned with a simple PPOalgorithm [34] employing rather small NNs.The observables measured on a state, referring to the two competing terms in the Hamiltonian and providing information to the "agent", seem to be effective in the learning process.We have shown that RL learns smooth control parameters, hence realizing an RL-assisted feedback Quantum Control for the schedule s(t) of a digitized QA/AQC algorithm [15], in absence of any spectral information.By working with disordered quantum Ising chains we showed that strategies "learned" on small samples can be successfully transferred to larger systems, hence alleviating the "measurement problem": one can learn a strategy on a small problem simulated on a computer, and implement it on a larger experimental setup.
A discussion of previous RL-work on quantum systems is here appropriate.RL as a tool for quantum control and quantum-error-correction has been investigated in Refs.[28,29].Regarding applications to QAOA, Refs.[30,31,33] have all formulated RL strategies to learn optimal variational parameters (γ, β).While sharing similar RL tools, their approach is markedly different from ours: they identify the RL "state" with the whole set of QAOA parameters.The agent has no access to the internal quantum state, and no information on the evolution process can be exploited in the optimization.In this way, the issue of measuring the intermediate quantum state is bypassed.This choice, however, reduces RL to a heuristic optimization which forfeits one of the most relevant feature of the RL framework: The possibility to drive the process with a step-by-step evolution.An alternative proposal, closer to ours in methods but tackling different physical questions, has recently appeared in Ref. [32].
Concerning future developments, we mention possible improvements of the "measurement problem".One possibility is to introduce ancillary bits to provide intermediate information to the RL agent without destroying the state of the system, in a way similar to Ref. [29].A possible alternative is to perform weak measurements [41].A second issue is the sensitivity to noise: preliminary results show that noise in the initial state preparation does not harm the ability to learn the correct strategies.Finally, the application to other models is worth pursuing: preliminary results on the fully-connected p-spin Ising ferromagnet are encouraging.

Figure 1 :
Figure 1: Scheme of: (a) a single step of Reinforcement Learning for QAOA; (b) the "episodes" loop in each k-th training "epoch", with the "policy" and "state-value" neural networks Π θ k and V Π θ k .

Figure 2 :
Figure 2: (a) Residual energy density res P , Eq. (3), vs P. Full symbols: results from RL only; empty symbols: a local optimization (LO) supplements the RL actions (RL+LO); data are averaged over 50 test runs.The black dashed line is the lower bound of Eq. (4).(b) The schedule st = γt/(γt + βt).Full blue lines denote st learned after Nepo = 1024 epochs on a chain of N = 128 sites; Dashed red lines, the RL+LO results; Black empty squares, the iterative LO smooth solution[15].The RL actions are in the basin of the same optimal minimum.Inset: same data for Nepo = 128 training epochs, where not all the LO optimized actions sets fall onto the iterative LO solution.

Figure 3 :
Fig.2(b)shows the interpolation parameter s t during an episode t = 1, . . ., P, for a chain of N = 128 spins and P = 8.Different curves are obtained by repeating a test run of the same stochastic policy, trained for N epo = 1024 epochs.The parameters obtained through the RL policy are smooth, and different tests result in similar sshaped profiles for s t .When a final local minimization is added, the curves for s t coalesce and coincides with the smooth optimal schedule obtained in Ref.[15] through an independent iterative local optimization strategy.When the training is at an early stage, i.e. the number of epochs is small, see inset of Fig.2(b), the profiles s t are more irregular and do not fall all in the same smooth minimum upon performing the LO (see the three dashed red lines).Next, we turn to the random TFIM case.Here, for each chain length N we fix a given disorder instance {J j } j=1,...,N with J j ∈ [0, 1], both for the training and the test of the RL policy.Since translational invariance is now lost, one would naively imagine that the relevant observables O t in Eq. (5) would involve a list of 2N measurements.However, our experience has taught us that we can efficiently go on with a reduced list comprising only the two Hamiltonian terms, O t = ψ t | Ĥz |ψ t , ψ t | Ĥx |ψ t , hence chain-averaged quantities.All the parameters involved in training the NNs are fixed as in the uniform TFIM case.Fig.3(a)shows the residual energy res P vs P obtained