Interpretable Quantum Advantage in Neural Sequence Learning

Quantum neural networks have been widely studied in recent years, given their potential practical utility and recent results regarding their ability to efficiently express certain classical data. However, analytic results to date rely on assumptions and arguments from complexity theory. Due to this, there is little intuition as to the source of the expressive power of quantum neural networks or for which classes of classical data any advantage can be reasonably expected to hold. Here, we study the relative expressive power between a broad class of neural network sequence models and a class of recurrent models based on Gaussian operations with non-Gaussian measurements. We explicitly show that quantum contextuality is the source of an unconditional memory separation in the expressivity of the two model classes. Additionally, as we are able to pinpoint quantum contextuality as the source of this separation, we use this intuition to study the relative performance of our introduced model on a standard translation data set exhibiting linguistic contextuality. In doing so, we demonstrate that our introduced quantum models are able to outperform state of the art classical models even in practice.


I. INTRODUCTION
The field of quantum information processing has reached a watershed in recent years, with the first demonstrations of quantum processors performing tasks on the verge of classical intractability [1][2][3][4][5][6].Spurred on by these recent experimental developments, there has been a push for finding algorithms that can be performed using either near-term quantum devices, or early error-corrected ones.One of the leading candidates for such algorithms are quantum machine learning (QML) algorithms, where training can be offloaded to a classical computer working in conjunction with a quantum computer, potentially minimizing the coherence requirements of the quantum device [7][8][9][10][11][12][13][14].These algorithms are also motivated by the ability of quantum systems to naturally represent complex probability distributions that are believed to be difficult to represent classically [15][16][17][18], with many proposed architectures for such quantum models [9,[19][20][21].
However, any proof of advantage in the expressivity of these models over classical models relies on results from computational complexity theory, themselves conditional on complexity theoretic assumptions [15][16][17][18]22].As the proofs of separation are abstract, it is unclear what realistic classical data sets one should expect a separation to hold in practice.Also, due to the universality of many of these models, they are very likely to be untrainable due to phenomena such as barren plateaus [23][24][25][26] and bad local minima [27][28][29] present in their loss landscapes.Because of these concerns, it has become increasingly clear that quantum models should be carefully constructed to fit the task at hand.Above all else, the interpretability of any expressivity separation achieved by a QML model has become increasingly important.Interpretability reveals which features of quantum mechanics yield more expressive models compared to classical models and, armed with this knowledge, allows one to find classes of problems where a practical quantum advantage on real data is potentially achievable.
Wishing to construct a model with an interpretable quantum advantage, we here focus on sequence-tosequence learning tasks [30], and consider a quantization of linear recurrent neural networks (LRNNs) [31].Classical LRNNs are recurrent neural networks with only linear activation functions.Such models can equivalently be considered a classical dynamical system governed by quadratic Hamiltonian evolution in the canonical variables (q, p).By lifting these canonical variables to operators (q, p) that satisfy the canonical commutation relations (in units where = 1  2 ): we arrive at a continuous variable (CV) quantum model where time evolution on an eigenstate of the canonical operators is performed under a quadratic Hamiltonian.
To measure properties of the state of the system, the most natural choice is to perform homodyne measurement; that is, measure linear combinations of the canonical operators qj and pk .This yields a quantum generative model where all operations are Gaussian.However, as all operations, initial states, and measurements are Gaussian, there are efficient Wigner function based simulations of sampling from such a system [32].In other words, such models on n modes are equivalent to deep belief networks [33]-a class of commonly used classical models-with 2n latent variables.
Instead, we extend this model slightly further by al-(a) An online neural sequence model.The model autoregressively takes input tokens x i and outputs decoded tokens y i with the map Fi.The model also has an unobserved internal memory with state λ i−1 ∈ L that Fi can depend on.
When the model is quantized to a CRNN, the n-dimensional space of λi is promoted to the Hilbert space of n qumode states |λi .(b) An implementation of a phase estimation circuit for CV Pauli operators, which forms the recurrent cell of the CRNN we use to prove our separations.Here, |a is a fixed ancilla state.Formally, if |a is a GKP state, this circuit allows for infinite precision measurements.In practice, |a can be a tensor product of a constant number of qubit |+ states for finite precision phase estimation.
lowing for measurements of the canonical operators modulo 2π, beginning in an eigenstate of periodic functions of the canonical operators [34,35].We call this introduced class of models contextual recurrent neural networks (CRNNs).Our main result is that CRNNs are more memory efficient at expressing certain distributions than essentially all trainable classical sequence models, even though CRNNs are not universal for CV quantum computation.Concretely, we show unconditionally that there exists a class of CRNNs with O (n) qumodes that can express certain distributions that no "reasonable" (which we later describe) classical model is able to represent without an Ω n 2 -dimensional latent space.Though this is only a quadratic separation in memory, the time complexity of inference for classical models is typically superlinear in the model size [31,[36][37][38], yielding a superquadratic time separation.As we show a memory (rather than a time) separation, our results also potentially point to a practical generalization advantage for CRNNs, as smaller models tend to generalize better than larger models due to formalized versions of Occam's razor [39].
Moreover, we are able to show directly that this quantum advantage is due to quantum contextuality [35,[40][41][42][43] present in our quantum model.Previously, quantum contextuality was known to be the resource for the expressive power of a certain class of quantized Bayesian networks [44].Our results show that this resource can be used to separate quantum models even from neural networks, which are exponentially more efficient than generic Bayesian networks.Intuitively, quantum contextuality is the statement that quantum measurement results depend on which measurements were previously performed, even if the measurements in question commute.In other words, quantum contextuality is the statement that the measurement of quantum observables cannot be thought of as the revealing of preexisting classical values for the observables.Here, we give a proof of the intuition that reasonable classical models cannot get around the need to "memorize" the measurement context of given observables, which is what yields the quadratic memory separation between the quantum and classical models.
Qualitatively, quantum contextuality is similar to the linguistic contextuality present in sentences.Namely, the meaning of a given word in a sentence depends heavily on other words in the sentence, and without this context has no fixed, single meaning.Inspired by this, we also test our constructed model against state of the art classical models on a real-world translation task.In particular, we evaluate the performance of an LRNN [45], an RNN with gated recurrent units (GRU RNN) [37], a Transformer [38], a Gaussian model, and our introduced contextual model on a standard Spanish-to-English data set [46].We show that our introduced contextual model achieves better translation performance compared to all other models at each model size we consider.This separation holds even when the online models are constrained to have a similar (and where possible, the same) number of trainable parameters in each recurrent cell.
Our methods provide a novel strategy for designing QML models for near-term devices: through the quantization of simple classical machine learning models with some minimal quantum extension.Though such models are most likely unable to outperform state of the art classical machine learning models on all tasks, the intuition gleaned from the simplicity of the quantum models gives guidance as to which problems the quantum models may outperform classical models on.Furthermore, the simplicity of the quantum models may circumvent the recent deluge of untrainability results of general quantum models [23][24][25][26][27][28][29].Finally, as such models are restricted in their allowed operations, they are more amenable to implementation on near-term quantum devices than completely generic quantum models.

A. Classical Sequence Learning
Sequence-to-sequence or sequence learning [30] is the approximation of some given conditional distribution p (y | x) with a model distribution q (y | x).This framework encompasses sentence translation tasks [30], speech recognition [47], image captioning [48], and many more practical problems.
Sequence modeling today is typically performed using neural network based generative models, or neural sequence models.Generally, these models are parameterized functions that take as input the sequence x and output a sample from the conditional distribution q (y | x).The parameters of these functions are trained to minimize an appropriate loss function, such as the (forward) empirical cross entropy: where T = {(x i , y i )} are samples from p (x, y).The backward empirical cross entropy is similarly defined, with p ↔ q.Note that a model with support on an incorrect translation (i.e.q = 0, p = 0) yields an infinite backward cross entropy, and a model failing to have support on a correct translation (i.e.p = 0, q = 0) yields an infinite forward cross entropy.
To maintain a resource scaling independent of the input sequence length, neural sequence models usually fall into one of two classes: online sequence models (also known as autoregressive sequence models) [31,36,37], or encoder-decoder models (which include state of the art sequence learning architectures, such as Transformers) [30,38].We focus on online models here, and discuss encoder-decoder models in more detail in Appendix A.
In online models, input tokens x i are translated in sequence to output tokens y i via functions F i .An unobserved internal memory (or latent space) L shared between time steps allows the model to represent long-range correlations in the data.A diagram of the general form of online models is given in Fig. 1(a).Generally, there are no restrictions on the forms of F i , though most neural sequence models are composed of simple smooth (or almost everywhere smooth) functions out of training considerations [31,[36][37][38].Here, we generalize from the typical smoothness constraints and consider locally Lipschitz maps.
Assuming the codomain of F i is R m , all maps that are almost everywhere differentiable with locally bounded Jacobian norm are locally Lipschitz [49].Realistically, then, locally Lipschitz models can be thought of as all models trainable using gradient based methods.Equivalently, they can be thought of as models not infinitely sensitive to infinitesimal changes in their inputs.This includes all models with standard nonlinearities, including those with ReLU, hyperbolic tangent, and sigmoid activation functions.Note that this condition is much weaker than a globally Lipschitz constraint.We give a formal definition of local Lipschitzness in Appendix A.
Though neural networks are often described as functions of real-valued inputs, in practice they are implemented at finite precision.We emphasize that where we analytically consider such networks here-such as in Sec.III-we consider the formal description of neural net- An example of CV quantum contextuality using a Mermin-Peres magic square [43], with CV Pauli operators Xi (a) , Zi (a) generated by −2iapi, 2iaqi, respectively.For any real α = 0, all operators in each row and column commute.Additionally, the product of each row and column is the identity operator, except for the final column, which gives −1.Thus, definite classical values cannot be assigned to each operator without yielding a contradiction.
works, which assumes infinite precision.Our numerical experiments in Sec.IV, however, give evidence that our analytic results also hold in the finite precision regime.We discuss this in more detail in Appendix C.

B. Contextual Recurrent Neural Networks
We now consider a quantization of a simple online model.Generally, online models can be interpreted as a classical dynamical process, where queries x i are made to a physical system described by the latent state λ i−1 , yielding a result y i and transforming the latent state λ i−1 → λ i (see Fig. 1(a)).For linear recurrent neural networks (LRNNs), this can be interpreted as the physical process of querying properties of an underlying system described by λ i undergoing Hamiltonian evolution under a quadratic Hamiltonian; this can be seen straightforwardly from Hamilton's equations and the linearity of the model.When quantizing the canonical position and momentum variables to operators satisfying the canonical commutation relations, such a model can then be interpreted as performing sequential measurements on a system undergoing evolution via Gaussian operations.When these measurements are restricted to homodyne measurements and all inputs are Gaussian states, this process can be simulated classically with memory linear in the number of modes of the Gaussian system [32].We minimally extend this, and allow for non-Gaussian measurements.In particular, we are here interested in measuring via phase estimation the CV analogues of the Pauli operators [50] (in units where = 1  2 ): We also promote the initial state of the network to a GKP state [34], which is an eigenstate of CV Pauli operators.We call a recurrent online model beginning in a GKP state, with cell that takes as input x i a description of a CV Pauli operator and returns its measurement result y i , a contextual recurrent neural network (CRNN).This measurement can formally be performed at infinite precision using Gaussian operations and homodyne measurement with fixed ancilla GKP states [34,51].A circuit description of this is given in Fig. 1(b), where |a is a uniform superposition over squeezed states |s with q |s = q |s , where q ≡ 0 (mod 2π).When performed sequentially on an initial GKP state, these measurements are what we consider when we compare in Sec.III CRNNs against the infinite precision classical neural networks described in Sec.II A. In this scenario, the model is not universal for CV quantum computation, even when additional Gaussian operations within the latent space are added [52].Counterintuitively, when the initial state is the vacuum state or a finitely squeezed GKP state, the model is universal [52,53]; this suggests a potential superpolynomial advantage in the expressive power and the time complexity of inference when implemented at finite precision.We discuss this in more detail in Appendix C.
Just as in the classical case, one can consider a finite precision approximation of these measurements.In this scenario, phase estimation using ancilla qubits can be performed for each measurement [54].We discuss proposals for the experimental implementation of such a measurement in Appendix C. In general, parameterized Gaussian operations can be included within each recurrent cell to yield a trainable CRNN.This is a special case of the CV neural networks considered in [14], which also considered the training of such networks.For our expressivity separations, however, we consider the fixed CRNN instance given in Fig. 1(b).
For our purposes, these measurements are important as CV Pauli operators exhibit quantum contextuality [35], in complete analogy with the contextuality present in qubit Pauli operators [43].Quantum contextuality is the statement that no definite classical values can be assigned to quantum operators, even when the measured operators in any given measurement scenario commute.For an example of this phenomenon, see Table I; there is no consistent assignment of classical values to each operator in the Table for any real α = 0.

C. Stabilizer Measurement Translation
We now focus on a classical sequence learning task that is naturally performed by the introduced CRNN.In particular, we consider the (k, n) stabilizer measurement translation task, parameterized by k and n.Leaving the formal definition for Appendix B, we give an informal definition here.We use the terminology of Fig. 1 for clarity.
Definition 1 ((k, n) stabilizer measurement translation task, informal).Given a k long sequence of classical descriptions x i of CV Pauli operators on n modes, output a sequence of measurement outcomes y i that is consistent with measuring these operators sequentially on a fixed GKP state |λ 0 .
As described in Sec.II B, such measurement sequences can display nontrivial correlations due to quantum contextuality.Note that this task is distinct from the mea-

2
, where n is the number of modes in the stabilizer measurement translation task.We show that when this is the case, in the neighborhood of some input, only a subspace of inputs (gray) of the same dimension as L are mapped injectively.(b) A sketch of the space of inputs, with fibers locally induced by the model.The base manifold is mapped injectively to L. All points on a fiber (e.g.|ψ1 , |ψ2 ) map to the same point as their base point (e.g.|ψ0 ) in L. When the dimension of the fiber is large enough, we show that these states have contextual stabilizers.We then show that this implies that the states have a single-shot distinguishing measurement sequence.
surement of position and momentum operators.Here, we require the measurement of linear combinations of position and momentum operators modulo 2π, as we are measuring the phases of operators generated by position and momentum.This can be done using the CRNN cell described in Sec.II B. We consider in Appendix B a slight generalization of this task, though here we consider Definition 1 with its fixed GKP initial state for simplicity.

III. BOUNDS ON STABILIZER MEASUREMENT TRANSLATION
We now give statements and proof sketches of our main results, which are lower bounds on the performance of classical models in performing the stabilizer measurement translation task described in Sec.II C.This will give an expressivity separation between the classical and quantum sequence models.
For discrete models, quantum contextuality was the key resource for showing a separation in expressivity between classical and quantum models [44].Using different proof techniques, we here show that quantum contextu-ality is also the resource giving the separation between continuous classical and quantum models with infinite dimensional Hilbert spaces.To do this, we specialize to two classes of models: online neural sequence models, and encoder-decoder models (which include state of the art models such as seq2seq models [30] and Transformers [38]).Here, we focus on the memory separation between CRNNs and classical online neural sequence models, and discuss a similar separation against encoderdecoder models in Appendix B. We also there formulate a general statement on the classical efficiency of simulating CV Pauli measurements on an initial GKP state, similar in spirit to the fact that Gottesmann-Knill [55,56] is optimal for qubit stabilizer simulation [57].
Our main result can be informally stated as the following Theorem (with the full statement and proof left to Appendix B).Note as discussed in Sec.II C that a CRNN can perform the stabilizer measurement translation task with n qumodes of memory.
Theorem 1 (Online stabilizer measurement translation memory lower bound, informal).Consider a locally Lipschitz online model with latent space L.
, this model cannot achieve a finite backward cross entropy on the (n + 2, n) stabilizer measurement translation task.
Proof sketch.The strategy of our proof is to show that, when the dimension of L is less than n(n−3)

2
, the model must map an embedded submanifold K of the space of the first n inputs to the same point in L; in other words, the model loses the ability to distinguish between inputs in K.The nontrivial aspect of this proof is to demonstrate that such a K exists, where distinct points in K yield different translations.As the model is unable to distinguish between points in K, this then demonstrates that the model will get a translation incorrect, corresponding to an infinite backward cross entropy on the stabilizer measurement translation task.This is equivalent to demonstrating that the quantum mechanical processes being described by points in K yield quantum states that are single-shot distinguishable.
To demonstrate that such a K exists, we use the local Lipschitzness of the model and the constant rank theorem [58].This then implies that the map F-given by the n-fold composition of the F i as shown in Fig. 1(a)locally induces a fiber bundle on the input space (as shown in Fig. 2), where F can be considered a projection onto the base manifold of this induced fiber bundle.We will slightly abuse notation in the remainder of this proof sketch, and conflate the first n inputs (and their associated outputs) with the quantum state that arises from this measurement sequence.
We consider a fiber of this fiber bundle, with the goal of proving that there exist points in this fiber that are single-shot distinguishable.We show in Appendix B that the dimension of this fiber is large enough such that there exist three states in this fiber with stabilizers which exhibit quantum contextuality.We claim (and prove in Ap- pendix B) that due to the presence of quantum contextuality in these stabilizers, these states have a distinguishing measurement sequence of length two.When performing this distinguishing measurement sequence, then, the model must give the incorrect measurement results for one of the three states, giving the lower bound on classical simulation.It is easy to see from Eq. (2) that this yields both an infinite backward cross entropy when these sequences are in the data set T .As a simple example of this phenomenon, assume that three states |ψ 1 , |ψ 2 , |ψ 3 with classical representations in the same fiber are stabilized respectively by the rows of Table I for some real α = 0.As upon measuring this operator, the measurement result is constrained to be 1 for the simulation to be accurate.The post-measurement state of |ψ 1 is then stabilized by X 1 (α) X 2 (α) and Z 1 gives an incorrect translation for one of these states.This measurement sequence is single-shot, as only a single copy of the state being measured is used.
Our results show that there is a general n versus Ω n 2 bound in the memory requirements of contextual and classical models performing the stabilizer measurement translation task.In practice, this can yield an even greater separation in time complexity for given implementations of these models, as the time complexity of inference using classical models is typically superlinear in the model size.We discuss this in more detail in Appendix E 3.

IV. NUMERICAL EXPERIMENTS
We now showcase the practical benefit of finding an interpretable advantage in the expressivity of our quantum model.Namely, it is able to give us intuition as to which data sets-beyond the constructed data set used in our proof-a CRNN may outperform classical machine learning models on.As previously discussed, the contextuality present in quantum operators behaves qualitatively similar to the linguistic contextuality present in language.That is, words can have one of many meanings, and their exact definition only becomes apparent when considering their context in a sequence.This is important for translation tasks, where different meanings of a single word in one language have different translations in other languages.
To explore this intuition, we consider the application of a CRNN on a standard Spanish-to-English translation data set [46], with trainable Gaussian interactions within each recurrent cell.We also consider the performance of GRUs [37] in a seq2seq learning framework [30], and Gaussian models (with Gaussian measurements).Details of our numerical simulations for all of the models we consider are given in Appendix E, along with details of the architectures.We also discuss in Appendices D and E a Θ n 2 memory classical simulation of CRNNs with n latent modes on a restricted space of Gaussian operations, which is what we use in our numerical simulations.The Gaussian and contextual models were constrained to have exactly the same number of trainable parameters, and each recurrent cell of the GRU had a parameter count within 2.5% of those of the quantum models at the largest model size considered.In Fig. 3, we plot the final training performance of all of our models.It is easy to see that the contextual model outperforms all models under consideration in forward empirical cross entropy at a wide range of model dimensions n.Random samples of translation results after training are shown in Table II.
We also compared the performance of CRNNs against classical linear RNNs (LRNNs) [45] and Transformers [38].We found that CRNNs substantially outperformed LRNNs.Though the memory requirements of Transformers grow with the length of the input-making direct comparisons against CRNNs difficult-we found that roughly the performance of the largest CRNNs we considered was matched by Transformers with quadratically more memory.We give details of these results in Appendix F.

V. OUTLOOK
Our results pinpoint quantum contextuality as a resource that can be used to enhance traditional machine learning models.We achieved this by constructing a sequence learning task parameterized by n that a contextual quantum model (a CRNN) of size n is able to model, yet provably no classical neural networks of size o n 2 can model due to their noncontextuality.To our knowledge, this is the first unconditional proof of an expressivity separation between a quantum neural network and classical neural networks on classical data.By explicitly demonstrating that quantum contextuality is the source of this advantage, we are also able to provide intuition as to which classes of problems CRNNs are able to outperform traditional machine learning models in solving.Our numerics confirm the intuition that CRNNs perform extremely well on problems exhibiting linguistic contextuality, such as the Spanish-to-English translation task we consider here.
The simple structure of CRNNs also allow (finite precision approximations of) them to be more amenable to potential experimental implementations when compared with completely general quantum architectures.In particular, all operations in the contextual model are Gaussian, up to the requirement for interactions with fixed ancilla states to perform the required non-Gaussian measurements.The restricted nature of the model may also circumvent the poor training landscapes of generic quantum neural networks [23][24][25][26][27][28][29], though we leave further investigation of this to future work.
We believe that the specifics of the CRNN architecture can be relaxed somewhat.Due to recent results linking non-Gaussian operations to quantum contextuality [59,60], we suspect any non-Gaussian measurement would make a suitable replacement for the stabilizer measurements we consider here for technical reasons.We also suspect that the technical requirement that the measurements be made with infinite precision to be an artifact of the nature of our proof, which compares the quantum architecture with infinite precision classical models.We believe that in practice, performing phase estimation using ancilla qubits instead of GKP (or other non-Gaussian CV) states is all that is necessary for a practical separation.In fact, such a finite precision implementation may counterintuitively yield a larger quantum advantage, as our architecture implemented with a finitely squeezed initial Gaussian state is universal for CV quantum computation [52,53].We discuss these two points in more detail in Appendix C.
CRNNs demonstrate that even the quantization of a very simple class of classical architectures-here, the class of LRNNs-is able to outperform a wide range of classical models on certain tasks, even if the classical models are much more powerful than LRNNs.We leave for future work the quantization of more powerful classical architectures, which may achieve a practical quantum advantage on a wider variety of tasks than we consider here.representation λ of x.When the encoder map is trivial (i.e. when L is congruent to the input space and E is the identity), then no compression occurs, and the model is equivalent to a general representation of p (y | x) given by D.
Generally, there are no restrictions on the forms of F i , E, or D, though most neural sequence models are composed of simple smooth (or almost everywhere smooth) functions out of training considerations [31,[36][37][38].Here, we generalize from the typical smoothness constraints and consider locally Lipschitz maps.A function F : K → L for metric spaces where C x is constant in some neighborhood of x.Here, d S is the distance function on the metric space S.
All practical neural sequence models are locally Lipschitz.Indeed, assuming L = R m , all maps that are almost everywhere differentiable with locally bounded Jacobian norm are locally Lipschitz [49].Realistically, then, locally Lipschitz models can be thought of as all models trainable using gradient based methods; equivalently, they can be thought of as models trainable via methods not arbitrarily sensitive to local noise the main text will, here, be referred to as the (2, n) stabilizer measurement translation task.We make this change as the formal definition of this task as presented here is undefined when k + n < n.Second, we now set the first n measurements to be infinite precision Gaussian measurements, i.e. measurements of linear combinations of position and momentum operators.These measurements are a limit of the periodic measurements we consider in the main text, with infinitely large periods.
To be more explicit, we consider an input language given by n + k-long sequences of linear combinations of position and momentum operators on n modes.Specifically, input sentences are composed of words which are of the form of rows of: when beginning in some given fixed state |ψ 0 on n modes that is either a GKP state [34] or an infinitely squeezed Gaussian state, which maintains the nonuniversality of the model [52].The final k rows describe the sequential measurement of each operator ŝi = exp via e.g.phase estimation, as shown in Fig. 1(b).Note that the measurement of ŝi is not equivalent to the measurement of its generator.For instance, given states such that: where the measurement outcomes m i are consistent with those of quantum mechanics.To prove our separations, we will consider input sentences that exhibit quantum contextuality.FIG.
5. An example of two graphs, with edges leaving vertex i given by e i and e i , respectively.As the two graphs differ, they must differ by an edge; indeed, they differ in the weights of both i, j and i, k .
with distinct (modulo π) adjacency matrices.There exist operators that stabilize these three states that exhibit quantum contextuality.Furthermore, there exists a distinguishing measurement given by one of the stabilizers of |0 q that maps |ψ 1 and |ψ 2 to orthogonal post-measurement states when the measurement result is 1; in other words, there exists a distinguishing measurement sequence of length two that distinguishes these three states.
Proof.As |ψ 1 and |ψ 2 are CV graph states with distinct adjacency matrices (modulo π), they must differ (modulo π) in the edges e i touching some vertex i.That is, |ψ 1 is stabilized by some: and |ψ 2 by some: where e i and e i differ (modulo π) in some element indexed by j = i (as there are no loops in either graph), and where 2θ , 2θ are phases.Here, Z (•) is defined as the tensor product: An example diagram of the entries of e i , e i is given in Fig. 5.By the symmetry of CV graph state adjacency matrices, we thus have that the former is also stabilized by and the latter by where e j and e j differ (modulo π) in some element indexed by i, and where 2φ , 2φ are phases.Note in particular The Mermin-Peres magic square of stabilizers of states mapping to the same latent space under a locally Lipschitz map (with θ = θ = φ = φ = 0 for simplicity).Stabilizers of |ψ1 , |ψ2 , and |0 q make up the three rows.All observables in each row and column commute.Furthermore, the product of observables in each row and column is the identity, except for the third column, which gives minus the identity.See Table I for a special case of this magic square.
that we have the commutation relations [s i , s i ] = 0, s j , s j = 0, s i , s j = 0, s i , s j = 0, (B12) where As ζ = 0 (mod π), there exists some α ∈ R * + such that To save on notation, we redefine all stabilizers to be given by their α power, i.e. s α → s (and similarly redefine the phases θ , θ , φ , φ by their scaling by α).As |ψ 1 , |ψ 2 are CV graph states, these rescaled operators are still stabilizers.We also define: which are stabilizers of |0 q .Thus, we have constructed nine observables with constraints satisfying those of a Mermin-Peres magic square (see Table III for an example), a well-known proof of quantum contextuality [43].
Consider now the post-measurement states |ψ 1 , |ψ 2 of |ψ 1 , |ψ 2 , respectively, when s i s j is measured to be 1.By Table III, |ψ 1 is stabilized by s i s j and s i s j ; furthermore, |ψ 2 is stabilized by s i s j and s i s j , and therefore is also stabilized by we have that |ψ 1 and |ψ 2 are orthogonal.If it is congruent to π 2 (mod π), then either θ − θ = 0 (mod π) or φ − φ = 0 (mod π) (or both).Assume the latter WLOG (the former case is the same with φ → θ and j → i), and instead consider the post-measurement states |ψ 1 , |ψ 2 of |ψ 1 , |ψ 2 , respectively, when s j is measured to be 1.By Table III, |ψ 1 is stabilized by s j and s j ; furthermore, |ψ 2 is stabilized by s j and s j , and therefore is also stabilized by s j s j = e −2i(φ −φ ) s j .
We now consider the locally Lipschitz online learner, with structure given by Fig. 4(a).We assume that the learner is deterministic, and discuss the extension to randomized models at the end of this Appendix.
Online neural sequence models at time step i map an input token x i and a latent vector λ i−1 to an output token m i and a new latent vector λ i .After n steps, then, we can consider the online model as a locally Lipschitz map: where L is a locally Lipschitz latent manifold and r is a random vector such that, for any r, F r is deterministic.This consideration of F r as a deterministic function of a random r is typically the implementation of stochastic learners, such as generative adversarial networks (GANs) [63] and flow-based models [64].This also includes implementations of stochastic simulation algorithms such as Wigner function simulation [32].As for each r, there exists an input sequence such that any classical model with dim (L) < n(n−3) 2 deterministically outputs a measurement sequence inconsistent with quantum mechanics, our results still hold for these classes of random models.Due to this, in the following, we take the r dependence to be implicit.
With the preliminaries in place, we now prove our expressivity separation.
Theorem 2 (Online stabilizer measurement translation lower bound).Consider an online model with locally Lipschitz latent manifold L and locally Lipschitz map F as described in Eq. (B22).If dim (L) < n(n−3)

2
, this model cannot achieve a finite backward empirical cross entropy on the (2, n) stabilizer measurement translation task.
Proof.Consider K ⊂ R 2n n , with elements of the form: here, B F is the Frobenius norm of B, and B is an n × n hollow (zero diagonal elements) symmetric matrix with entries bounded to be − 1 4 , 1 4 .H is the fixed n × n symmetric hollow matrix of ones, i.e.
It is obvious from this construction that K is an n(n−1)

2
-dimensional embedding of a compact subspace of hollow symmetric matrices (with bounded entries and norm) B. Note that the states described by the measurement scenarios of points in K are exactly CV graph states without loops (with bounded weight edges, as K is compact) and, depending on the measurement results, perhaps overall phases on the stabilizers.To see this, note that the symmetric constraint on B ensures that the symplectic product of any two rows of Q is zero; furthermore, the final n columns of Q are linearly independent for all B = 0, and the first n columns for B = 0 due to the shift by H. Thus, all points in K are full row rank, and the rows of Q completely determine the CV stabilizer state, up to phases given by the measurement results of these operators.Furthermore, as B and H are hollow, the CV graph state Q describes has no loops.Also note that, up to independent rescalings of the rows of Q, different Q correspond to different graph states.
We assume WLOG that the Jacobian of F attains its maximum rank at B = 0 (that is, the squeezed state |0 q ); this can always be done by implicitly transforming the basis of inputs to the model (i.e. by appropriately relabeling points in R 2n n+2 ), and then considering K as previously defined in this new basis.
We will proceed as follows.First, we will show that when dim (L) is sufficiently small, F must map three distinct CV graph states described by different Q to the same point in latent space.Then, we will use Lemma 1 to show that the stabilizers of these states exhibit quantum contextuality (independent of the associated n measurement results), and give rise to a distinguishing measurement sequence.Thus, by considering measurement sequences of length n + 2 that include these three Q and the length two distinguishing measurement sequence, one of the final two measurement outcomes must be incorrect.This implies that there is an infinite backward empirical cross entropy on any finite set containing these three measurement sequences.
Let us begin by showing that F| K (i.e. the locally Lipschitz map that is F restricted to K) must map three nontrivially distinct Q (i.e. three distinct CV graph states) to the same point in latent space when dim (L) is sufficiently small.By the constant rank theorem and the local Lipschitzness of F| K , F| K is not injective for In particular, in a sufficiently small neighborhood of B = 0 (where the Jacobian of F| K attains its maximal rank), there exist local coordinates x of K and L such that 2 [58].WLOG, we identify x = 0 with B = 0, which is the state infinitely squeezed in all qi .We will call C the fiber with local coordinates x = 0, . . ., 0, xl+1 , . . ., which is of dimension at least By construction, all points in C-including B = 0-map to the same point l ∈ L under F| K .We now assume that dim (L) < n(n−3) Now fix B = 0 and B = B in C. As described previously, Q (and thus B) completely determines a CV graph state after n measurements, up to independent rescalings of the rows of Q (and the measurement results).As the dimension of the space of points that differ (modulo π) from B + H/2 by just a scaling factor in each row is at most n, because ∆ ≥ n + 1 we must have that there exists another B = B, B describing a distinct CV graph state.
Therefore, by Lemma 1, we have that there exists a distinguishing measurement sequence for these three states.Note that as Lemma 1 does not depend on the phases of the CV stabilizers, the existence of this distinguishing measurement holds true regardless of what the measurement results are (i.e.independently from what the model outputs for the first n tokens in the decoded sequence).As after n tokens all three sequences map to the same point in latent space in the model, and as they share a distinguishing measurement sequence, the model must obtain an infinite backward empirical cross entropy on these three input sequences when followed by the distinguishing measurement sequence.

A CV Gottesmann-Knill Lower Bound
We now show that our results can be reformulated as a memory lower bound on the classical simulation of stabilizer measurement scenarios.In practice, using finite resources (i.e. at finite precision), any classical ontological model simulating p (y | x) can only be evaluated at a finite number of x.We now show that any locally Lipschitz interpolation of such a model to real x cannot accurately simulate Gaussian operations on an initial GKP state.This includes, for instance, any polynomial interpolation (which always exists).
Corollary 1 (CV Gottesmann-Knill lower bound).Consider a classical ontological model p (y | x) with a latent space of dimension less than n(n−3)

2
, simulating (k, n) stabilizer measurement translation with k ≥ 2. Assume that this ontological model is defined at a finite number of x.There exists a locally Lipschitz interpolation of this model to all x.Furthermore, no locally Lipschitz interpolation of this model can faithfully perform stabilizer measurement translation at all x.
Proof.As there exists a polynomial interpolation of p, and as all polynomials of finite degree are locally Lipschitz, there exists a locally Lipschitz interpolation of this model to all x.Furthermore, no locally Lipschitz interpolation of this model can faithfully perform stabilizer measurement translation by Theorem 2, as the composition of locally Lipschitz functions is locally Lipschitz.

Expressivity Separation for Encoder-Decoder Models
Though online sequence models are perhaps conceptually the simplest as they directly map input tokens to output tokens, in practice encoder-decoder models outperform them [30,38].We now show that no encoder-decoder model with a locally Lipschitz encoder (and an additional technical assumption) can perform stabilizer measurement translation to finite backward empirical cross entropy.The proof will be similar to that of Theorem 2; however, as the model can see the entire input sequence at once, we do not directly have the freedom to choose the distinguishing measurement sequence as in Theorem 2. Instead, we will require an input sequence of length quadratic in n (and our additional technical assumption) to force the distinguishing measurement sequence.Note that, as the memory of the contextual learner is independent of the sequence length, this new sequence length has no impact on the memory separation.
We consider an encoder-decoder model with structure given by Fig. 4(b).The encoder of such a model can be considered a locally Lipschitz map to some locally Lipschitz latent manifold L. As in Appendix B 1, r is a random vector such that, for any r, E r is deterministic.We once again make the r dependence implicit in the following.
For technical reasons, we slightly change the definition of the (k, n) stabilizer measurement translation task, where now the final k measurement descriptions instead describe the measurements of the operators: where we define x to be zero when x = 0. We will call this the modified (k, n) stabilizer measurement translation task.This can obviously still be performed perfectly with a CRNN of model size n, by either changing the parameters of the phase estimation circuit of Fig. 1(b) to be given by , or more formally by introducing a quantum circuit computing on which these gates control.
We now discuss our additional technical assumption.Defining the subspace R of inputs as in the proof of Theorem 3, we assume that the Jacobian of the encoder restricted to R attains its maximal rank at some point of the form (Q, 0) ∈ R. A sufficient condition for this is requiring that some point of the form (Q, 0) is not a critical point of E| R ; this condition is satisfied by generic E when dim (L) < n(n−3)

2
, and also by models with encoders constrained to be submersions.In fact, when the latter holds, then it is easy to see from the proof of Theorem 3 that the separation still holds on the unmodified (k, n) stabilizer measurement translation task, as all properties we use that hold locally then hold globally.Any one of these conditions is sufficient, and needed for our proof technique to be able to analyze any neighborhood of the non-Gaussian measurements we consider.
Theorem 3 (Encoder-decoder stabilizer measurement translation lower bound).Consider an encoder-decoder model with locally Lipschitz latent manifold L. Let E be the associated locally Lipschitz encoder function, as defined in Eq. (B29), and assume that the Jacobian of the map E| R (where the subspace R is defined below) attains its maximal rank at some point of the form

2
, this model cannot achieve a finite backward empirical cross entropy on the modified n 2 − n, n stabilizer measurement translation task.
with elements (Q, P ) of the following form: 1. Q ∈ K is given by rows of matrices of the form: here, B F is the Frobenius norm of B, and B is a hollow symmetric matrix with entries bounded to be − 1 4 , 1  4 .
H is the fixed symmetric hollow matrix of ones, i.e.
2. P ∈ R 2n n 2 −n is an n 2 − n × 2n matrix that is arbitrary.
It is obvious from this construction that K is an n(n−1)

2
-dimensional embedding of a compact subspace of hollow symmetric matrices (with bounded entries and norm) B. We assume WLOG that the Jacobian of E| K (that is, the locally Lipschitz restriction of E to points of the form (Q, 0) ∈ R) attains its maximum rank at B = 0 (that is, the squeezed state |0 q ); this can always be done by implicitly transforming the basis of the inputs to the model (i.e. by appropriately relabeling points in R 2n n 2 ), and then considering K as previously defined in this new basis.
We now give some intuition behind points in the smooth manifold (with boundary) R. At fixed P = 0, the states described by the measurement scenarios of points in R are exactly CV graph states without loops (with bounded weight edges, as K is compact) and, depending on the measurement results, perhaps overall phases on the stabilizers.
To see this, note that the symmetric constraint on B ensures that the symplectic product of any two rows of Q is zero; furthermore, the final n columns of Q are linearly independent for all B = 0, and the first n columns for B = 0 due to the shift by H. Thus, all points in K are full row rank, and the rows of Q completely determine the CV stabilizer state, up to phases given by the measurement results of these operators (which are not yet determined at the time of encoding).Furthermore, as B and H are hollow, the CV graph state Q describes has no loops.Also note that, up to trivial rescalings of the rows of Q, different Q correspond to different graph states.At general P , the state after the first n measurements is still a CV graph state completely determined by Q (up to phases from the first n measurement results); different P correspond to different (non-Gaussian) measurement scenarios given an initial CV graph state determined by Q (and the first n measurement results).
We will proceed as follows.First, we will show that as dim (L) < n(n−3)

2
, E must map three distinct CV graph states described by different Q to the same point in latent space when P = 0.Then, we will use Lemma 1 to show that the stabilizers of these states exhibit quantum contextuality (independent of the associated n measurement results), and give rise to a distinguishing measurement sequence.Finally, we will show that one can locally find P = 0 mapping to the same point in latent space that contains this distinguishing measurement sequence, forcing an incorrect measurement outcome on one of these states.This gives rise to an infinite backward empirical cross entropy on this task.
Let us begin by showing that E| R (i.e. the locally Lipschitz map that is E restricted to R) must map three nontrivially distinct Q (i.e. three distinct CV graph states) to the same point in latent space.We will consider the restriction E| K , which is the (locally Lipschitz) restriction of E to points of the form (Q, 0) ∈ R. We will show that this map is not injective for small enough dim (L).As E| R lifts to E| K , this will show that three distinct Q map to the same point in latent space under E| R .This is similar to the construction for F| K in the proof Theorem 2; we repeat it here for completeness.
the source of the separations in our proofs, this gives evidence that one could use a more experimentally feasible non-Gaussian ancilla state than a GKP state-such as a photon subtracted state [68]-and achieve similar results.
We can also write down its wavefunction in position space as: where q and c q are c-number column vectors, and Z = V + iU is a complex symmetric matrix.We can interpret Z as the adjacency matrix for an undirected graph with complex-valued edge weights.Therefore, any Gaussian pure states can be interpreted as a Gaussian graph states with complex-valued weights on the graph edge.As Gaussian states are uniquely identified by the linear combinations of position and momentum operators that nullify them [71], we can consider what the nullifiers are for the CV graph state defined by Z and c = (c q , c p ) : In the second to last line, we have use the fact that −Zc q + c p is a c-number vector, which commutes with R Z,c .In addition, we also use the fact that Z = V + iU , and (−i q + p) |0 = −i √ 2â |0 = 0. Therefore, the Gaussian graph state with complex adjacency matrix Z and mode shift center c has complex nullifiers: We now restrict to the case V = 0 and c p = 0, which are the class of states we consider in our numerical experiments.Then, the nullifiers are of the form: We also restrict to symplectic transformations of the quadratures of the form: where W is some assumed invertible matrix.We restrict to operations of this form to more efficiently allow for the classical simulation of the contextual RNN using Gaussian operations of an identical form, as discussed in [70] and Appendix D 2. After performing the quantum operation described by such a symplectic matrix, the nullifier for the whole system is updated as W −1 p − iU W q + iU c q |ψ = 0, which is equivalent to ( p − iW U W q + iW U c q ) |ψ = 0.
Homodyne detection of the position quadrature on m qumodes is then, in general, a multivariate Gaussian random variable which is centered at Π Y W −1 c q with variance Π Y U −1 Π Y , where Π Y is the projection operator onto the subspace of the m qumodes being measured.To make the training of our Gaussian models more stable-and to maintain simulability with GKP input states, as is done in Appendix D 2-we assume in our numerical simulations that there is an implicit large scaling for U , c q such that this variance is small.After the measurement, the hidden state at ith step is updated to a generalized graph state with adjacency matrix Π H U Π H and mode shift where Π H is the projection operator onto the subspace of the latent n qumodes.

Gaussian Operations on GKP States
Our methods for the simulation of (restricted) Gaussian operations on GKP states are similar to the methods used by [70].We restrict to (unnormalized) states of the form: where each |ψ is a Gaussian state with nullifiers given by: We assume → 0 + , such that all |ψ are approximately orthogonal.Time evolution is simulated as in Appendix D 1, simultaneously for all |ψ .As all |ψ are approximately orthogonal, measurement is simulated via choosing uniformly at random, and performing the corresponding measurement.In principle, using many measurements (over multiple instances of the state), one can read out L and c q ; these are the measurement results we use in our numerical expirements, as described in Appendix E 2. For any given measurement on a subset of the modes, the post-measurement state on the remainder of the modes is just the uniform superposition over all consistent with the resulting measurement outcome y.
To make this latter observation more concrete, assume after evolution under S of the form of Eq. (D9), the nullifiers of |ψ are (see Appendix E): We wish to find all consistent with the position measurement result y (in the limit → 0 + ); that is, all such that: where Π Y is the projector onto the m mode space being measured.Let H label the space of the other n modes, and the projector onto this space Π H . Writing (with the assumptions that WY Y and W HH are full rank): and assuming L (assumed full rank) is of the form: we find from Eq. (D13) that all consistent with the measurement result satisfy: Assuming the entries of L Y Y are sufficiently small, up to any given machine precision the H are in one-to-one correspondence with the consistent with the measurement result.Furthermore, for all such : Using the appropriate displacement operator to remove the final term of Eq. (D19), then, yields the effective transformation: Appendix E: Details of the Numerical Simulations We now discuss the details of the numerical simulations performed in Sec.IV.For all models, we studied the performance in modeling a standard Spanish-to-English data set [46].For each model and each n, the training set was taken to be a random sample of 80% of the data set, and the test set 20%.Each model was trained for 80 epochs, with a batch size of 64.To map the words in this data set to vectors (taken to also be of dimension n, the model 6.An overview of the recurrent models we study.Each red box represents the recurrent cell, and the variational parameters in the recurrent cells are shared within the encoder and decoder.λ0 is random fixed hidden memory vector.In the decoding phase, the output of each recurrent cell is also treated as the input for the next recurrent cell.dimension), we used the Keras [72] implementation of word2vec ("TextVectorization") adapted to the data, with a maximum vocabulary size of 5000.The first 5000 most frequent words are mapped to distinct integers, and other words are mapped to a unique token "[Unk]."At the beginning and end of each sentence, we add unique "[Begin]" and "[End]" tokens.For the recurrent translation models, such as the GRU RNN or the CRNN, we do not need to set the length of the sentences.For the Transformer model, we set the sentence length to be 20 words.If the sentence is shorter than 20 words, the additional token "[Pad]" is added to the sentence.For the rare case when the sentence contains more than 20 words, additional words are truncated.All networks were trained using Adam [73] (with a learning rate of 10 −3 , β 1 = 0.9, β 2 = 0.999, and = 10 −7 ), trained on the forward empirical cross entropy (as the backward empirical cross entropy is difficult to train on).

Classical Sequence Models
We studied three standard classical sequence models in our numerical experiments: an implementation of an orthogonal recurrent neural network [45], an implementation of a network using gated recurrent units (GRU) [37], and an implementation of a Transformer [38].The first two networks were trained in a seq2seq configuration [30]; the models autoregressively map the input sequence to a latent space, which then is autoregressively decoded.For the orthogonal recurrent neural network, we used the implementation of [45], with network capacity equal to the model dimension.An illustration of these architectures is given in Fig. 6.
The Transformer models we considered follow the standard construction of [38].To fairly compare against the shallow RNN cells we consider, we used a single Transformer encoder and decoder layer for each model.Our implementation used a trained positional embedding with uniform initialization, and the encoders and decoders used ReLU activations in the feedforward network layers, Glorot weight initialization, zero bias initialization, no dropout, and a single head.Each encoder and decoder layer was followed by the layer normalization implementation of Keras [72] with its default parameters.The final layer normalization of the decoder of each Transformer we considered was followed by a dense layer with a softmax activation function.

Quantum Sequence Models
Based on the discussion in Appendix D, we simulated both the Gaussian RNN and the contextual RNN described in Sec.IV.The training and architecture of these models were identical to those of the orthogonal neural network described in Appendix E 1, other than the structure of each unit cell of the recurrent network.Once again, Fig. 6 describes the overall architecture of the models, and Fig. 7 describes the recurrent cell of these models.For the Gaussian model, the simulated homodyne position measurements are what we used for our cell outputs; for the CRNN, we simulated the lattice (and position measurement) readout procedure described at the end of Appendix D 2.
The pseudocode for the cells of both are given in Algorithm 1; for the Gaussian model, one takes J 0 , K = 0.By considering the Gaussian RNN cell as a limit of the contextual RNN cell, and by not training K or J 0 in the contextual RNN, we were able to maintain identical parameter counts for the two models.For the contextual RNN, J 0 is fixed to the identity, and K is the result of an untrained dense layer h applied to the cell input (with uniform Glorot initialization and linear activation function, and biased such that K has mean the identity).Specializing to the notation of Algorithm 1: r is similar.f and g are similar, except with no bias.

Time Complexity
We now discuss the time complexity of implementing a CRNN, both as a quantum model implemented on a quantum computer, and as a quantum-inspired classical algorithm.On a quantum computer, each cell of the CRNN we consider in our proofs of an expressivity separation can be implemented in depth O (n), assuming access to the fixed ancilla state |a used for non-Gaussian measurement.This further decreases to O (1) time if one assumes access to quantum fan-out [74].More general Gaussian operations can also be implemented in depth O (n) utilizing a swap network [75].
Examining Algorithm 1, it is easy to see that our algorithm simulates inference on a CRNN with model di-

Fibers
FIG. 2. (a)A schematic of the classical model when the dimension of the latent space L (green) is less than n(n−3)

FIG. 4 .
FIG. 4. (a) An online neural sequence model.The model autoregressively takes input tokens x i , and outputs decoded tokens y i , with map Fi.The model also has an unobserved internal memory with state λ i ∈ L after decoding token i that Fi+1 can depend on.(b) A general encoder-decoder model.E encodes the input x into some latent representation λ ∈ L. A decoder D then outputs the decoded sequence y.

2 e il = 1 e ik = 3 e
il = 1 FIG.6.An overview of the recurrent models we study.Each red box represents the recurrent cell, and the variational parameters in the recurrent cells are shared within the encoder and decoder.λ0 is random fixed hidden memory vector.In the decoding phase, the output of each recurrent cell is also treated as the input for the next recurrent cell.

FIG. 7 .
FIG.7.One recurrent cell of the quantum recurrent architectures.Note that the only trained part of the dense network at the input of each recurrent cell are displacements in phase space acting on |φe , in order to keep the number of trainable parameters in line with the GRU RNN at the same n.|φe is a Gaussian state for the Gaussian model, and a GKP state for the CRNN.

1
Printed by Wolfram Mathematica Student Edition