Reinforcement Learning for Digital Quantum Simulation

Digital quantum simulation is a promising application for quantum computers. Their free programmability provides the potential to simulate the unitary evolution of any many-body Hamiltonian with bounded spectrum by discretizing the time evolution operator through a sequence of elementary quantum gates, typically achieved using Trotterization. A fundamental challenge in this context originates from experimental imperfections for the involved quantum gates, which critically limits the number of attainable gates within a reasonable accuracy and therefore the achievable system sizes and simulation times. In this work, we introduce a reinforcement learning algorithm to systematically build optimized quantum circuits for digital quantum simulation upon imposing a strong constraint on the number of allowed quantum gates. With this we consistently obtain quantum circuits that reproduce physical observables with as little as three entangling gates for long times and large system sizes. As concrete examples we apply our formalism to a long range Ising chain and the lattice Schwinger model. Our method makes larger scale digital quantum simulation possible within the scope of current experimental technology.

Digital quantum simulation is a promising application for quantum computers. Their free programmability provides the potential to simulate the unitary evolution of any many-body Hamiltonian with bounded spectrum by discretizing the time evolution operator through a sequence of elementary quantum gates, typically achieved using Trotterization. A fundamental challenge in this context originates from experimental imperfections for the involved quantum gates, which critically limits the number of attainable gates within a reasonable accuracy and therefore the achievable system sizes and simulation times. In this work, we introduce a reinforcement learning algorithm to systematically build optimized quantum circuits for digital quantum simulation upon imposing a strong constraint on the number of allowed quantum gates. With this we consistently obtain quantum circuits that reproduce physical observables with as little as three entangling gates for long times and large system sizes. As concrete examples we apply our formalism to a long range Ising chain and the lattice Schwinger model. Our method makes larger scale digital quantum simulation possible within the scope of current experimental technology.
Introduction. Digital quantum simulation (DQS) has emerged as one of the most promising applications of quantum computers. Unlike analog simulators, which directly mimic the Hamiltonian of interest, digital simulators reproduce a target time-evolution operator with a sequence of elementary quantum gates. In principle, the unitary time-evolution of any spin-type Hamiltonian can be encoded in a quantum computer with arbitrary precision [1]. The experimental implementation of DQS has seen remarkable progress in the recent years leading to the simulation of theoretical condensed matter models [2][3][4][5][6][7], lattice gauge theories [8], and quantum chemistry problems [9][10][11]. A common and natural approach to factorize time evolution operators into elementary quantum gates is to utilise Suzuki-Trotter formulas [12,13]. While the theoretical Trotter error can be well controlled [14][15][16], high accuracy Trotterization requires a large number of quantum gates. This leads to a critical problem because each of these individual gates suffers from experimental imperfections, in particular those which entangle qubits. A key challenge of DQSs is therefore to identify factorizations of time evolution operators utilizing a minimal number of quantum gates in order to exploit currently available hardware resources optimally.
In this work, we introduce a method based on reinforcement learning (RL) to systematically build DQSs constrained to a fixed low number of entangling gates. As a key step in our RL algorithm towards feasible large-scale DQS we propose to optimise the quantum circuits not with respect to the conventionally used global many-body wave function, but rather based on a local reward with the goal to reproduce expectation values of local observables and correlation functions. Remarkably, we find that the dynamics of strongly correlated systems can be digitally realised using just a handful of gates making large system sizes and long-time simulations feasible on current day devices.
Specifically, for the lattice Schwinger model, we build quantum circuits using only three entangling gates that correctly reproduce the dynamics of local observables and correlation functions for up to 16 qubits and for large times, reducing the number of entangling gates by one order of magnitude in comparison to a recent pioneering DQS experiment for 4 qubits [8]. With our RL algorithm we are able to systematically build DQSs with a drastically reduced number of quantum gates for large quantum many-body systems pushing the design of quantum circuits beyond what has been achieved previously utilizing RL methods [17][18][19][20] or in the field of quantum control [8,21]. Our work provides a route towards larger-scale DQS in previously inaccessible regimes with currently available hardware resources.
Digital Quantum Simulation. Let H = l H l be such that U l (t) = exp(−iH l t) can be realized on the chosen quantum computing platform. The targeted dynamics can then be approximately factorised using the Suzuki-Trotter formula: e −iHτ ≈ l e −iH l τ /n n splitting the time τ of the simulation into n smaller steps of duration τ /n. This Trotterization comes with an error that is rigorously upper bounded as O(N τ 2 /n) [14] with N the number of qubits whereas the error on local observables can be even much smaller [15]. The central problem is that higher Trotterization accuracy requires larger n. This, however, increases the number of required quantum gates and therefore amplifies the imperfections due to faulty gate operations. In this work we aim to generate optimized quantum circuits for the factorization of timeevolution operators with a minimal number of quantum gates. We focus on trapped ion quantum computing platforms with the following set of universal quantum gates consisting of the single-qubit rotations and the entangling gate, where σ x j , σ y j , and σ z j are the Pauli matrices at site j. The exponent α can be theoretically tuned within the range 0 ≤ α < 3, but the optimal performance is typically reached either for α = 0 or α ≈ 1. For the following we will focus for concreteness on either α = 3 or α = 1 while emphasizing that our approach can be straighforwardly applied also to other α or other quantum computing architectures such as superconducting qubits with different sets of universal quantum gates.
The central goal of our work is to find circuits with a small number of quantum gates for the task of reproducing the dynamics of a given Hamiltonian. We translate this task into a variational optimization problem as follows. Let |ψ 0 denote the initial state and let us fix the resources in terms of quantum gates as in Eq. (1). Then we construct a sequence of gates: as depicted schematically in Fig. 1 (a). The main goal now is to choose the underlying variational parameters θ = (θ xx t , θ z,1 t , θ x,1 t , . . . , θ z,N t , θ x,N t ) such that the state |ψ DQS is as close as possible to the desired time evolution of the target Hamiltonian H at a specified time τ : |ψ target = e −iHτ |ψ 0 . From now on the number of entangling gates will be fixed to n = 3. As we will show, remarkably, these small quantum circuits will be sufficient to reproduce the dynamics of local observables such as for the lattice Schwinger model, see Fig. 1(b), even for large systems and large times.
Method. We use reinforcement learning (RL) in order to solve this difficult optimization problem. RL is a subfield of machine learning in which a software agent learns by interacting with an environment and adapting its behavior accordingly. The agent generates sequences of actions in the environment and learns to perform a given task by maximizing a cumulative reward function. RL has seen a recent surge of applications in the field of quantum control for few-body problems [17,18,20,[22][23][24][25][26] as it suits well optimiziation problems consisting of successive actions on a state with high dimensionality. Here, we are interested in the dynamics of quantum many-body problems which is a far more challenging problem.
In this work, we use a modified version of a deep Qnetwork algorithm [27], a variant of the original Watkins off-policy Q-learning algorithm using artificial neural networks as function approximators [28,29]. While we now  7). We show the DQS results using the fidelity (green) and the local reward (4) (red), and the exact time evolution (blue).
summarize the central aspects of the algorithm, further details can be found in Refs. [29,30].
The optimization problem is defined as an episodic RL problem: each episode is divided into a finite number of steps t = 0, . . . , n, corresponding to the steps of the DQS. At t = 0, the quantum wave function is in a given initial state |ψ 0 . Then, at each step t the agent chooses an ac- . After each action a t , the agent receives a scalar reward r t . At the end of the episode, t = n, the reward characterizes how close the final state |ψ DQS is to the target state |ψ target = e −iHτ |ψ 0 . For intermediate steps, the reward is set to 0 as we do not constraint the specific evolution of the quantum wave function between the initial and target state. A representative learning process as a function of episodes is shown in Fig. 2(a) for the Ising model in Eq. (6).
In Q-learning, the goal is to numerically compute the so-called action-value function, Q(s, a) (represented by a neural network in our algorithm), which is defined as the expected total return n t=1 r t , given that the environment is in the state s and that the agent takes the action a and acts optimally afterwards. At each step, the agent chooses an action using current knowledge of the Q functions and uses the obtained feedback to update Q following the Bellman optimality equation [29]. Importantly, the actions a take continuous values in our case, which is not standard for Q-learning. We  have modified our algorithm accordingly so that the a t = argmax a Q(s t , a) operation is done by maximizing the output of the neural network with respect to part of its input [30].

Reward.
A central quantity in the RL optimization problem is the cost function measuring the reward and therefore quantifying how close |ψ DQS is to the target state |ψ target . The fidelity | ψ DQS |ψ target | 2 is commonly used to compare the two states globally. With a limited number of entangling gates, we find, however, that it is challenging to obtain high fidelities for large systems sizes or times. As a consequence, we now introduce an alternative reward in our RL algorithm, which takes into account that in quantum simulation we are not so much interested in the global many-body wave function but rather in reproducing local observables and correlation functions. Let ρ = |ψ target ψ target | and σ = |ψ DQS ψ DQS | denote the density matrices corresponding to the target and the DQS state. We then define a local reward measuring the closeness of reduced density matrices ρ jk and σ jk of the subsystem made of sites j and k for ρ and σ, respectively. Here D(ρ||σ) = Tr ρ(log ρ − log σ) is the relative entropy and N denotes the number of qubits. A reward of R local = 1 means that all ρ ij = σ ij and therefore all expectation values and correlation functions are reproduced exactly. It is a crucial observation that a high local reward R local = 1 − ǫ can be directly translated into a high accuracy for local observables and correlations functions. For a two-body opera- where · ∞ denotes the operator norm. This can be derived using Hölder's inequality for Schatten norms and Pinsker inequality [31,32]: As a first proof of concept, we apply our method to the long-range Ising model For this system we can directly compare the performance of our approach to a conventional Trotterization procedure, as there exists a straightforward and natural decomposition of the Hamiltonian into the universal set of quantum gates in Eq. (1) upon choosing θ xx n = Jτ /n, θ z,j n = m z τ /n, and θ x,j n = m x τ /n. For concreteness, we will consider J = 1 and m x = m z = 2, and α = 3 starting from a fully polarized state |ψ 0 = | ↑ . . . ↑ . Let us emphasize, however, that we obtain similar results also for other choices of system parameters. As mentioned before, we fix the number of entangling gates in our circuits to three (n = 3) both for the Trotterized dynamics as well as the circuit from RL.
The learning of the agent is witnessed by looking at the evolution of the reward as a function of episodes, i.e., during the learning process, shown in Fig 2 (a) (for 16 qubits at τ = 1 using the local reward). Starting from the Trotterized circuit, the agent progressively improves the circuit until convergence. The mean value of the maximum rewards throughout each independent run is shown in Fig. 2 (b) upon varying the system size N (at fixed τ = 1.0). As opposed to the Trotter fidelity, which decays exponentially, the DQS rewards remain at large values. The obtained fidelity decays with the system size N but only linearly at the considered N , and the local reward is remarkably unaffected. Now if we fix the system size and increase τ , the Trotterization also fails eventually. In Fig. 2 (c) we show this (for 10 qubits) together with the DQS results for both types of rewards.
To give a more physical perspective to our results, we also compare the values of physical observables resulting from DQS, Trotterization, and from the actual dynamics (using exact diagonalization). Figures 2 (d), (e) and (f) show the magnetization, the energy, and the Loschmidt echo | ψ 0 |ψ | 2 . For the 10-qubit system, after τ = 1 it is clear that the system enters a regime where the Trotterization with n = 3 fails. At the same time, there is a drop in performance of our algorithm, but the reward converges to a finite value as τ increases. Importantly, when translated in terms of physical observables the resulting quantum circuits are much more successful than Trotterization. The magnetization, energy, and Loschmidt echo are well reproduced by the DQS, especially when the local reward is used. This indicates that our algorithm can systematically find a circuit bringing the initial state to an arbitrary target state (e.g. the time evolved state with arbitrary large time) using only three entangling gates.
Having demonstrated that our RL based method with local reward exhibits a remarkable performance for the Ising model, we now aim to go one step ahead by studying a system where no natural decomposition into a Trotter sequence exists for the considered universal set of quantum gates. For that purpose we focus in the following on the lattice Schwinger model: which is represented here in the Kogut-Susskind Hamiltonian formulation [33,34], as it has been recently realized experimentally using DQS based on Trotterization [8].
Concerning the non-equilibrium protocol we closely follow the experiment [8]. We start from the Néel state (corresponding the bare vacuum) and then apply e −iHSτ with w = J = 1 and m = 0.5. Further, we use α = 1 for the entangling gates in the DQS in Eq. (1), as this represents one of the optimal working points in systems of trapped ions. Even more so than with the LRI model, optimizing with the fidelity only results in suboptimal sets of parameters as can be seen in Fig. 3. Both short-range and long-range couplings are present in the lattice Schwinger model, and thus reproducing the dynamics with only three all-to-all entangling gates is particularly challenging. Nevertheless, we show that better sets of parameters do exist and are obtained when using the local reward. Interestingly, as for the LRI model, the performance of the algorithm with the local reward does not plummet as the system size increase, and physical observables are significantly better reproduced with the local reward than with the fidelity, as shown Fig. 3 (b) and (c) and in Fig. 1 (b), where the particle number density ν = 1 2N N j=1 (−1) j σ z j +1 is shown, which has also been measured in the recent experiment [8].
While ν as a few-body operator is directly covered by the local reward, the Loschmidt echo is a global quantity, but can be nevertheless reproduced remarkably well.
To explore further the performance of our RL approach can, we compare in Fig. 3 (c) the obtained dynamics for a two-body quantum correlation function against the exact solution. There we show results for the connected correlator C zz (N/2, N/2 + 1) in the middle of the chain where C zz (j, j + 1) = σ z j σ z j+1 − σ z j σ z j+1 . While the two-body operator seems not as well reproduced as the single-body operator for long times, this is different for τ 2.0 when using the local reward. This is remarkable as the overall signal strength of C zz (N/2, N/2 + 1) is much smaller than what one would expect on the basis of the bound in Eq. (5).
Outlook. For the considered problems 3 entangling gates have turned out to be typically sufficient for an accurate DQS of local observables, remarkably. In the future it might be important to increase the number of gates for higher precision, where convergence of our algorithm turns out to become progressively challenging. This might be remedied for instance by either utilizing more advanced neural network structures, e.g., recurrent neural networks or long short-term memories, or by reducing the number of independent variational parameters in the optimization problem using physical insights, in particular, by utilizing symmetries.
The current scheme requires an exact theoretically known reference of the target state, which we obtain using exact diagonalization. The overarching goal of DQS, however, is to address scenarios which are beyond such a theoretical description and therefore without such exact reference available. For the future it might be a key open question whether it is possible to obtain optimized time-evolution operator factorizations using reinforcement learning without an exact reference. However, for current typical DQS scenarios such a regime of quantum supremacy is not yet reached, so that our algorithm represents a central contribution to push DQS significantly beyond what has been achieved up to now in terms of system size and simulation time.
thus off-policy as the behavior policy differs from the target policy). After meaning episodes, the algorithm converges and the policy obtained using Q(s, a) (i.e. a t = argmax a Q(s t , a)) results in near-optimal sequences of gates.
In the vanilla Q-learning algorithm, s and a only take a finite number of discrete values and Q(s, a) is a matrix. To cope with complex environements (with continuous-valued states or highly-dimensional states, or both), function approximators are used for Q.
In Deep Q-learning, the Q function is approximated with a neural network. For instance, in the original papers of DeepMind [27,35], convolutional networks are used to process the state of the environment made of pixels on a screen. In our algorithm, we use three-layer dense neural networks because the dimensionality is not so high.
Typically, the neural network has a state s as an input, and outputs different values Q(s, a) for all the possible discrete values of a. In our problem, the actions themselves are also continuous. In this case, there are Actor-Critic RL algorithms that solve the continuous action problem, such as deterministic policy gradient [36]. We tried using deterministic policy gradient, but it turned out that a more simple modification of the deep Q-network worked better. The neural networks used to represent Q(s, a) have both actions a and states s as inputs, and a single scalar as output for the value of Q(s, a). The only non-trivial difference from the usual deep Q network is how to calculate argmax a Q(s, a): we use gradient ascent (with Nesterov accelerated gradient) to maximize Q(s, a) with respect to the action inputs and with fixed state inputs.
In addition, RL is known to be unstable or divergent when a nonlinear function approximators such as a neural networks are used [27]. The instability comes from the correlations present in the sequence of data used to train the neural network, the fact that small updates to Q can significantly change the policy, and the correlations between Q and the target values. Indeed, the Q function is theoretically both used to choose the behavior a t and the target in Eq. (S3). We use common techniques developed to solve these problems: experience replay, and iterative updates of the target Q network (a separate network only periodically updated in order to reduce correlations with the target) In the following, we present the details of the algorithm used with pseudocode. The hyperparameters are given as we used them after tuning (through grid search).  Select action at ← choose action(st).

24:
Return best a. for t = 0 to n − 1 do

28:
Obtain physical parameters by rescaling the components of at: