Expressibility and trainability of parameterized analog quantum systems for machine learning applications

Parameterized quantum evolution is the main ingredient in variational quantum algorithms for near-term quantum devices. In digital quantum computing, it has been shown that random parameterized quantum circuits are able to express complex distributions intractable by a classical computer, leading to the demonstration of quantum supremacy. However, their chaotic nature makes parameter optimization challenging in variational approaches. Evidence of similar classically-intractable expressibility has been recently demonstrated in analog quantum computing with driven many-body systems. A thorough investigation of trainability of such analog systems is yet to be performed. In this work, we investigate how the interplay between external driving and disorder in the system dictates the trainability and expressibility of interacting quantum systems. We show that if the system thermalizes, the training fails at the expense of the a large expressibility, while the opposite happens when the system enters the many-body localized (MBL) phase. From this observation, we devise a protocol using quenched MBL dynamics which allows accurate trainability while keeping the overall dynamics in the quantum supremacy regime. Our work shows the fundamental connection between quantum many-body physics and its application in machine learning. We conclude our work with an example application in generative modeling employing a well studied analog many-body model of a driven Ising spin chain. Our approach can be implemented with a variety of available quantum platforms including cold ions, atoms and superconducting circuits


I. INTRODUCTION
The recent achievement of quantum supremacy [1], the ability of quantum systems to compute tasks that are intractable by a classical computer, stands as an important milestone for noisy intermediate-scale quantum (NISQ) devices [2].A common approach to operate NISQ devices is to implement variational quantum algorithms (VQAs), where a classical feedback loop is used to passively correct the noise in the quantum device [3][4][5].VQAs have been implemented to tackle a wide range of problems, from quantum chemistry [6][7][8][9][10][11], machine learning [12,13], quadratic binary optimization [14][15][16], to high energy physics [17].
One of the key questions for NISQ devices is whether they can provide provable quantum advantage for realworld problems.A hint to answer this question lies in the ability of NISQ devices to efficiently explore Hilbert space.For example, in quantum chemistry, NISQ devices can produce highly-entangled variational ansatzes, such as unitary coupled clusters, that cannot be efficiently represented on a classical computer [18].In machine learning, quantum circuits have been proven to have more 'expressive power' than any classical neuron networks [19][20][21].This means that those circuits can produce complex Similar to classical variational algorithms, VQAs rely on 'good' ansatzes that can efficiently capture the answer of a given problem.In the case when such ansatz is not known or implementable, it is desirable to exploit high expressibility of some NISQ devices to generate an unbiased guess.The latter is known as 'hardware efficient' [8].A common feature of this approach is to explore the chaotic dynamics which allows the system to quickly explore the entire Hilbert space.However, this chaoticity also makes it difficult, if not impossible, to classically optimize the system since it is highly sensitive to any small changes in the parameters.In the digital case, hardware efficient ansatzes suffer from the barren plateaus problem [22,23], where the landscape of the cost function becomes exponentially flat as the number of qubits increases.Hence, finding the right VQA for a given problem is an emerging art of balancing expressibility, implementability and trainability of the NISQ devices.
Analog quantum simulators stand out from their digital counterpart when it comes to implementability [24][25][26].Here, a quantum device is built to mimic a specific Hamiltonian, which requires significantly less control than universal quantum circuits.State-of-the-art quantum simulators have already been able to produce dynamics intractable by existing classical algorithms [27].Quantum supremacy in analog simulators have also been proven in 2D Ising lattice [28,29], cluster states [30], and more recently in periodically-driven quantum many-body systems [31].Hybrid analog-digital approaches for VQAs have been explored in Refs [7-9, 13, 14, 32-34].
In this work, we analyze the expressibility and trainability of analog quantum devices focusing on parameterized driven quantum many-body systems.We show that these properties are intimately related to phases of the system.We focus on four generic phases depending on whether the dynamics is thermalized or many-body localized (MBL) [35,36] and whether a continuous drive is applied.As an example, we consider the standard Ising chain, globally driven by an external magnetic field.We find that, evolving under the dynamics resulting from a series of quenches between randomized disorder configurations, the system in all four phases are capable of reaching the quantum supremacy regime, illustrating its high expressibility beyond a classical computer.We then devise a simple sequential training protocol to train the system for generative modeling tasks in machine learning.We show that the chaoticity in the thermalized phase prevents the training as in the digital case.However, the integrability of the MBL within each quench increases drastically the trainability of the system.The final learning accuracy depends solely on the phase of the system.

II. DRIVEN ANALOG QUANTUM SYSTEMS AND THEIR STATISTICS
In this section, we study the many-body dynamics of generic parameterized quantum systems and the different statistics associated with their phases.We then analyze a specific example of driven quantum Ising chain which will be used for the analysis of the expressibility and trainability in the following sections.

General framework
We consider fully general quenched quantum manybody systems |ψ(Θ M ) = Û(Θ M )|ψ 0 , where |ψ 0 is an initial product state, Θ M is a vector containing all variational parameters during the evolution and M is the number of times the system is quenched.The unitary time evolution is where Θ M = {θ m } M m=1 and each quench/layer is obtained from a time-dependent Hamiltonian Ĥ(θ m , t), i.e.

Û (θ
with m ∈ {1, 2, ..., M }, T being the time-ordering operator and T being the evolution time during each layer.The Hamiltonian is further decomposed as where Ĥ0 (θ m ) is a static Hamiltonian, V is the driving Hamiltonian such that Ĥ0 (θ m ), V = 0.The modulation f (t) is an oscillating function with the period T .We require that the time-averaged Hamiltonian Ĥave (θ m ) = 1 T T 0 Ĥ(θ m , t)dt is many-body [37].
We can now define the four regimes or 'phases' of Û (θ m ) in the above sense according to whether the dynamics is thermalized or MBL and whether f (t) is zero or non-zero.To allow non-trivial dynamics within each layer, we require 2π/T to be smaller than a typical energy gap of Ĥave (θ m ).We assume that all Û (θ m )'s in Û(Θ M ) belong to the same phase for simplicity.
Let us explore the various statistics associated with the four phases, starting with the f (t) = 0 case in which Ĥeff (θ m ) = Ĥave (θ m ).For the thermalized dynamics, the statistics of Ĥave (θ m ) follows the Gaussian orthogonal ensemble (GOE) [47].This is the ensemble of matrices whose entries are independent normal random variables subjected to the orthogonality constraint.This randomness is a signature of quantum chaos, which is a crucial ingredient for thermalization [39].A large disorder can prevent the system from thermalization leading to MBL dynamics.In this case, the eigenenergies of Ĥave (θ m ) follow the Poison (POI) statistics, indicating that they are uncorrelated.
In the driven case, i.e. f (t) = 0, the statistics are defined at the level of the unitary operator Û (θ m ), as it is generally not possible to have access to Ĥeff .For the driven thermalized dynamics, the statistics of Û (θ m ) follows the circular orthogonal ensemble (COE) [48].This is the ensemble of matrices whose entries are independent complex normal random variables subjected to the orthogonality and the unitary constraints.Unlike the GOE, the COE is intimately related to the infinitetemperature ensemble and is not possible to obtain without a drive [48].As before, a large disorder can prevent thermalization even with f (t) = 0, leading to the High expressibility (quantum supremacy) yes yes yes yes Trainability for generative modeling no yes no yes (best) TABLE I.A summary of statistics, expressibility, and trainability in the four regimes, defined by whether Û (θ m ) is thermalized or MBL and whether f (t) = 0 or f (t) = 0.The symbol '-' indicates that the statistics is not defined.
POI statistics of the quasi-energies (to be defined later) [49,50].A summary of all the statistics is given in Table I.

Driven disordered quantum Ising chains
To illustrate the four generic phases, we will work on a specific example of driven quantum Ising chains with where f (t) = − F 2 cos(ωt), ω = 2π/T , θ m = {θ i,m } L i=1 , L is the number of spins, { Xi , Ẑi } are Pauli's operators acting on site i, J is the interaction strength, h is a static magnetic field and F is the driving amplitude.The parameters {θ i,m } are 'varied' by randomly drawing them from a uniform distribution in the range [0, W ] where W is the disorder strength.This allows us to vary the parameters without changing the phase of the system.The dimension of the Hilbert space is N = 2 L .The initial state |ψ 0 is prepared as a product state where each spin points along the +z direction.This simple model has been implemented in various quantum platforms, including Rydberg atoms [51], trapped ions [52] and superconducting circuits [16].
The standard way to analyze the statistics of the system is to define the level statistics Pr(r α ) as the normalized distribution of where ∆ α = E α+1 −E α is the level spacing with E α+1 > E α and α = 1, 2, .., 2 L − 1.In the f (t) = 0 case, {E α } are eigenenergies of Ĥave (θ m ).In the f (t) = 0 case, {E α } are quasi-energies, defined such that {exp (−iE α T )} are eigenvalues of Û (θ m ).Not only that H eff = H ave in the driven case, but the quasi-energies are also defined in the limited range E α ∈ [0, 2π).This energy folding has profound impact on the resulting statistic.
In Fig. 1, we show the level statistics for F = 0 and F = 2.5J with W = 1J and W = 20J.For a small disorder W = 1J, the level statistics of Ĥave (θ m ) and Û (θ m ) agree with the predictions from the GOE and the COE, respectively.For a large disorder W = 20J, the level statistics of both Ĥave (θ m ) and Û (θ m ) follows the POI distribution, as expected.

III. EXPRESSIBILITY OF DRIVEN QUANTUM-MANY BODY SYSTEMS
In this section, we show that, given a large number of quenches M , the overall dynamics described by Û(Θ M ) for all four phases is capable of reaching the quantum supremacy regime, implying high expressibility of our system beyond a classical computer.

Expressibility and quantum supremacy
Expressibility is the term used in machine learning to describe the range of the resulting functions that a model can compute [53].In the context of quantum computing, expressibility relates to how much a quantum system can explore the Hilbert space [54].For example, product state ansatz have a lower expressibility than tensornetwork ansatz, due to their inability to capture entangled states [55].
The concept of quantum supremacy and expressibility are interconnected.In random quantum circuit proposals for quantum supremacy, universal set of quantum gates are designed such that the system is chaotic and quickly explores the entire Hilbert space over time [56].Consequently, it is impossible for a classical computer to efficiently reproduce its output distribution, unless the polynomial hierarchy collapses.Hence, random quantum circuits with L 100 qubits have higher expressibility than any possible models, implementable on a classical computer.
Let us consider the task of approximating p(z; Θ M ) up to additive error, i.e.
where ν is a positive constant, {z} are output bitstrings measured in the computational basis, p(z; is the exact output probability, and q(z) is the approximated value obtained from a classical / quantum device.In principle, a quantum device can satisfy this condition by directly implementing Û(Θ M ) in the hardware and measure the output multiple times to construct q(z).To show that a classical computer cannot do the same efficiently unless the polynomial hierarchy collapses, one need to show that (i) it is #P-hard to approximate p(z; Θ M ) up to multiplicative error [57], i.e.

Pr p(z; Θ
where δ, γ are some constants.We refer interested readers to Ref. [28,59,60] for the derivation of how these two conditions lead to the proof of quantum supremacy.

Achieving quantum supremacy with quenched quantum many-body systems
The #P-hardness to approximate p(z; Θ M ) up to multiplicative error has been shown (for the worse instance) in the case where it results from a unitary evolution that follows the circular unitary ensemble (CUE) statistics [60,61].The CUE is the ensemble of matrices whose entries are independent complex normal random variables subject to the unitary constraint [62].Such statistics can be probed from both the previously defined level statistics Pr(r α ) and the distribution Pr(c = | z|E α | 2 ) of the eigenstates |E α of Û(Θ M ).Fig. 2(a) and (b) show the statistics of the eigenstates and the quasi-energies of Û(Θ M ) in the four regimes at M = 400, respectively.It can be seen that in all cases the results match with the CUE statistics, indicating the #P-hardness to approximate the resulting p(z; Θ M ) up to multiplicative error.Our finding agrees with Ref [63], which shows that random quenches in atomic Hubbard and spin models with long-range interactions lead to the n-design property.The n-design ensemble produces the CUE when n → ∞ which happens in the long-time limit [64].
In Fig. 2(c), we plot the Kullback-Leibler (KL) divergence of the output distribution Pr(p) from the Porter-Thomas distribution, Pr PT (p) = N e −N p .The latter implies that the system explores the entire Hilbert space.(Here, we drop the argument Θ M for brevity).The Porter-Thomas distribution satistifies the anti-concentration condition since [61].From Fig. 2(c), it can be seen that the system in all four phases reaches the Porter-Thomas distribution over time with different timescales.The thermalized case with F = 2.5J reaches it first at M ∼ 10.The thermalized case with F = 0 and the MBL case with F = 2.5J have a similar convergence rate and saturate at M ∼ 100.The MBL with F = 0 has the slowest rate and saturates at M ∼ 250.This is expected as MBL dynamics localizes the system, while the drive F 'heats up' the system leading to de-localization.
Fig. 2(a)-(c) provides evidences that |ψ(Θ M ) cannot be efficiently approximated by a classical computer.This suggests that, for a large number of qubits, our system in all phases have higher expressibility than any classical models.

IV. TRAINABILITY OF DRIVEN ANALOG QUANTUM-MANY BODY SYSTEMS
In the context of machine learning, having a model with large expressibility is necessary but not sufficient as the model also need to be trainable.We here address the interplay between expressibility and trainability for the four generic phases of driven analog many-body systems discussed so far.Interestingly, we show that the external drive and the temporal correlations between different quenches in the MBL phase are the key ingredients to combine those two crucial characteristics.

Generative modeling in classical machine learning
As a testbed to analyse the trainability of our model, we solve a generative modeling problem in machine learning [65].The latter is an unsupervised task, meaning that the training data are unlabelled.The goal is to find the unknown probability distribution, Q(z), underlying the training data.Here, the data is a set of binary vectors {z} data = {z 1 , z 2 , ...}.For example, it can represent the opinions of a group of customers on a set of L different products, as depicted in Fig. 3(a).The opinion of the customer i is represented by a binary vector z i = [z i1 , z i2 , ..., z iL ] where z ij = 1 if he/she likes the product j and −1 otherwise.After knowing Q(z), the company can generate new data from this distribution and recommends products with +1 score to new customers.
In this section we use an artificial dataset as a working example.To assure the generality of the data, we assume that Q(z) is the Boltzmann distribution of classical Ising spins with all-to-all connectivity, i.e., where Z = z exp (−E(z)/k B T 0 ) is the partition function, k B is the Boltzmann constant, T 0 plays the role of a temperature, and with a i , b ij being random numbers between ±J/2.This model is known as the Boltzmann machine which is one of the standard types of artificial neuron networks used in machine learning and has been shown to capture a wide range of real-world data [66].Its quantum version has been studied in [67,68].

Sequential training scheme using an analog quantum model
Classically, the distribution of {z} data can be obtained by first guessing a model P model (z; Θ), such as the Poisson or the Boltzmann distribution, which has some variational parameters Θ.The 'training' is done by minimizing the cost function, which is the KL divergence of P model (z; Θ) from Q(z) using either gradient descent or gradient-free optimization algorithms.Here, Q is the normalized histogram of {z} data .
In our case, we show how the distribution of {z} data can be recovered as the output probability p(z; Θ M ) of the driven quantum Ising chain.This approach is also known as the Born's machine [21].Our goal here is to guide or train the quantum system to a specific point in the Hilbert space such that p(z; Θ M ) = Q(z).Our training protocol, depicted in Fig. 3  2. Evolve the system by one layer |ψ(Θ m+1 ) = Û (θ m+1 )|ψ(Θ m ) with Θ m+1 = {θ m+1 } ∪ Θ m , and then measure p(z; Θ m+1 ) to compute C.
3. Repeat the step (2) D times with different disorder realization θ m+1 .In the thermalized case, the system will randomly explore the entire Hilbert space in this step.However, in the MBL case, the system will only explore the Hilbert space locally near |ψ(Θ m ) allowing systematic optimization, see Fig. 3(c).
4. Choose the disorder realization in the step (3) that minimizes C, then update m → m + 1.This will 'move' the state in the most promising direction in the Hilbert space.
We note here three characteristics of our training protocol.First, it is sequential since not all parameters in Θ are updated at the same time, making them easier for classical optimization.Second, although the parameters are randomly drawn during the training, our optimization is done systematically in the Hilbert space.This makes an important difference to the usual optimization approaches which are done in the parameter space [22,67].Third, a large fraction of results is 'thrown away' in the step (3).Although in principle this data can be utilized to improve the training efficiency, it is our goal to keep the training protocol as simple as possible, so that the focus is made on distinct learning behaviors displayed by each phase.

Training results
The training results are shown in Fig. 4(a).As expected, the system in the thermalized phase cannot be trained.The cost function from the thermalized case with F = 2.5J saturates at C ∼ 2 already at the first layer.In the F = 0 case, the cost function starts at around C ∼ 3.5 and then falls down to saturate at C ∼ 2, the same value as the driven case, when M ∼ 50.For the MBL case with F = 0, during M 10 2 , the cost function steadily decays to 0.7 .Then during 10 2 M 10 3 , the cost function continues to decay with a slower rate.Interestingly, after M 10 3 , the cost function increases and saturates at 0.7 when M ∼ 10 4 .In contrast, for the MBL case with F = 2.5J, the cost function goes down steadily when M 10.Then, the cost function further decays monotonically with a slower rate to saturate at C ∼ 0.1 at M ∼ 10 4 .This results show that the learning behavior changes qualitatively depending on the phase and the timescale of the system.The best learning accuracy is obtained with the MBL phase with F = 2.5J.
In Fig. 4(b), we plot the final learning results as a function of W for F = 0 and F = 2.5J.For comparison, in Fig. 4(c), we also plot the averaged level spacing r α as a function of W for both cases.In the F = 2.5J case, the final learning accuracy shows a transition between the trainable and the untrainable regimes, which corresponds roughly to the phase transition between the CUE and the POI statistics.In the F = 0 case, the system moves towards the trainability regime as W approaches 30J.However, we stop our calculation here as the training takes too long to converge when W > 30J [69].Nevertheless, our present results are sufficient to conclude that the drive leads to a better learning accuracy for this timescale.We conjecture that, once the system in the undriven case fully reaches the trainable regime, the learning accuracy should monotonically decreases with M until saturation.

Temporal correlations enabled by MBL
To understand different learning accuracy in different phases, we calculate the KL divergence between p(z; Θ M ) and p(z; Θ M +δm ) to measure the temporal correlations or the 'memory' between outputs at different layers.In Fig. 2(d), we plot such KL divergence as a function of δm, averaged over various M 's.In the thermalized phase, we find that there are no temporal correlations between layers.This is expected as each layer has chaotic dynamics which is highly sensitive to any small changes introduced to the system.In contrast, in the MBL phase, the system displays short-term memory that decays with δm.The MBL dynamics with f (t) = 0 has the longest memory.This memory were exploited during the training to improve trainability of the system.

V. CONCLUSIONS
In this work, we have throughly analyzed the expressibility and trainability of parameterized analog quantum many-body systems.We show that both thermalized and MBL dynamics with and without the modulation f (t) are capable of reaching the quantum supremacy regime, indicating high expressibility beyond any classical models.In the context of generative modeling, we show that chaoticity prevents systematic optimization of the system.However, the latter can be qualitatively improved by the MBL dynamics.In the future, it would be interesting to analyze scalability and generalizability of our models as well as more complex training protocol for efficient optimization.

FIG. 2 .
FIG. 2. Statistics of parameterized analog quantum many-body evolution: (a) and (b) shows the eigenstate distribution Pr(N c) and the level statistics Pr(rα) for the four phases of Û (θ m ), respectively with M = 400.The shaded areas are the predictions from the CUE statistics.(c) The KLD of the output distribution from the Porter-Thomas distribution as a function of M .(d) The KLD of p(z; Θ m+δm ) from p(z; Θ m ) as a function of δm.The KLD is averaged over M ∈ [378, 400) for a given δm.The thermalized and and the MBL phases are obtained with W = 1J and W = 20J, respectively.(L = 9, ω = 8J, h = 2.5J, 500 disorder realizations).

FIG. 3 .
FIG. 3. Machine learning with a driven analog quantum processor: (a) A table demonstrating a real-world application of generative modeling tasks in machine learning.Each customer is asked to rate whether he/she likes (+1) or dislikes (−1) a given product.(b) A sketch of optimization loops used in the training protocol.(c) A diagram showing the movement of the system in the Hilbert space during the training in the MBL phase.

FIG. 4 .
FIG. 4. Training analog quantum systems in the Hilbert space: (a) The lowest cost function at each training step M for F = 0 and F = 2.5J.The thermalized and and the MBL phases are obtained with W = 1J and W = 20J, respectively.The shaded areas represent standard deviations.(b) The cost function at M = 10 4 as a function of W .The results are averaged over 10 dataset, i.e., 10 realizations of {ai, bi} in Eq. (11).Each dataset consists of 3000 samples.(c) The averaged level spacing rα at M = 10 4 as a function of W . (L = 9, ω = 8J, h = 2.5J,kBT0 = J and D = 200.)