Learning to predict arbitrary quantum processes

We present an efficient machine learning (ML) algorithm for predicting any unknown quantum process $\mathcal{E}$ over $n$ qubits. For a wide range of distributions $\mathcal{D}$ on arbitrary $n$-qubit states, we show that this ML algorithm can learn to predict any local property of the output from the unknown process~$\mathcal{E}$, with a small average error over input states drawn from $\mathcal{D}$. The ML algorithm is computationally efficient even when the unknown process is a quantum circuit with exponentially many gates. Our algorithm combines efficient procedures for learning properties of an unknown state and for learning a low-degree approximation to an unknown observable. The analysis hinges on proving new norm inequalities, including a quantum analogue of the classical Bohnenblust-Hille inequality, which we derive by giving an improved algorithm for optimizing local Hamiltonians. Numerical experiments on predicting quantum dynamics with evolution time up to $10^6$ and system size up to $50$ qubits corroborate our proof. Overall, our results highlight the potential for ML models to predict the output of complex quantum dynamics much faster than the time needed to run the process itself.

As an example, for predicting outcomes of quantum experiments [8,37,38], we consider ρ to be parameterized by a classical input x, E is an unknown process happening in the lab, and O is an observable measured at the end of the experiment.Another example is when we want to use a quantum ML algorithm to learn a model of a complex quantum evolution with the hope that the learned model can be faster [7,11,12].As an n-qubit CPTP map E consists of exponentially many parameters, prior works, including those based on covering number bounds [4,7,8,37], classical shadow tomography [33,39], or quantum process tomography [30][31][32], require an exponential number of data samples to guarantee a small constant error for predicting outcomes of an arbitrary evolution E under a general input state ρ.To improve upon this, recent works [4,7,8,37,40] have considered quantum processes E that can be generated in polynomial time and shown that a polynomial amount of data samples suffices to learn tr(OE(ρ)) in this restricted class.However, these results still require exponential computation time.
In this work, we present a computationally efficient ML algorithm that can learn a model of an arbitrary unknown n-qubit process E, such that, given ρ sampled from a wide range of distributions over arbitrary n-qubit states and any O in a large physically relevant class of observables, the ML algorithm can accurately predict f (ρ, O) = tr(OE(ρ)).See Fig. 1 for an illustration.The ML model can predict outcomes for highly entangled states ρ after learning from a training set that only contains data for random product input states and randomized Pauli measurements on the corresponding output states.The training and prediction of the proposed ML model are both efficient even if the unknown process E is a Hamiltonian evolution over an exponentially long time, a quantum circuit with exponentially many gates, or a quantum process arising from contact with an infinitely large environment for an arbitrarily long time.Furthermore, given few-body reduced density matrices of the input state ρ, the ML algorithm uses only classical computation to predict output properties tr(OE(ρ)).The proposed ML model is a combination of efficient ML algorithms for two learning problems: (1) predicting tr(Oρ) given a known observable O and an unknown state ρ, and (2) predicting tr(Oρ) given an unknown observable O and a known state ρ.We give sample-efficient and computationally efficient learning algorithms for both problems.Then we show how to combine the two learning algorithms to address the problem of learning to predict tr(OE(ρ)) for an arbitrary unknown n-qubit quantum process E. Together, the sample and computational efficiency of the two learning algorithms implies the efficiency of the combined ML algorithm.
In order to establish the rigorous guarantee for the proposed ML algorithms, we consider a different task: optimizing a k-local Hamiltonian H = P∈{I ,X ,Y,Z} ⊗n α P P. We present an improved approximate optimization algorithm that finds either a maximizing or minimizing state |ψ with a rigorous lower or upper bound guarantee on the energy ψ| H |ψ in terms of the Pauli coefficients α P of H .The rigorous bounds improve upon existing results on optimizing k-local Hamiltonians [41][42][43][44].We then use the improved optimization algorithm to give a constructive proof of several useful norm inequalities relating the spectral norm O of an observable O and the p norm of the Pauli coefficients α P associated with the observable O.The proof resolves a recent conjecture in Ref. [45] about the existence of quantum Bohnenblust-Hille inequalities.These norm inequalities are then used to establish the efficiency of the proposed ML algorithms.

II. LEARNING QUANTUM STATES, OBSERVABLES, AND PROCESSES
Before proceeding to state our main results in greater detail, we informally describe the learning tasks discussed in this paper: what do we mean by learning a quantum state, observable, and process?

A. Learning an unknown state
It is possible, in principle, to provide a complete classical description of an n-qubit quantum state ρ.However, this would require an exponential number of experiments, which is not practical at all.Therefore, we set a more modest goal: to learn enough about ρ to predict many of its physically relevant properties.We specify a family of target observables {O i } and a small target accuracy .The learning procedure is judged to be successful if we can predict the expectation value tr(O i ρ) of every observable in the family with error.
Suppose that ρ is an arbitrary and unknown n-qubit quantum state, and that we have access to N identical copies of ρ.We acquire information about ρ by measuring these copies.In principle, we could consider performing collective measurements across many copies at once.Or we might perform single-copy measurements sequentially and adaptively; that is, the choice of measurement performed on copy j could depend on the outcomes obtained in measurements on copies 1, 2, 3, . . ., j −1.The target observables we consider are bounded-degree observables.A bounded-degree n-qubit observable O is a sum of local observables (each with support on a constant number of qubits independent of n) such that only a constant number (independent of n) of terms in the sum act on each qubit.Most thermodynamic quantities that arise in quantum many-body physics can be written as a bounded-degree observable O, such as local observables, few-body correlation functions, geometrically local Hamiltonians, and the average magnetization.
In the learning protocols discussed in this paper, the measurements are neither collective nor adaptive.Instead, we fix an ensemble of possible single-copy measurements, and for each copy of ρ, we independently sample from this ensemble and perform the selected measurement on that copy.Thus, there are two sources of randomness in the protocol-the randomly chosen measurement on each copy and the intrinsic randomness of the quantum measurement outcomes.If we are unlucky, the chosen measurements and/or the measurement outcomes might not be sufficiently informative to allow accurate predictions.We settle for a protocol that achieves the desired prediction task with a high success probability.
For the protocol to be practical, it is highly advantageous for the sampled measurements to be easy to perform in the laboratory, and easy to describe in classical language.The measurements we consider, random Pauli measurements, meet both of these criteria.For each copy of ρ and each of the n qubits, we choose uniformly at random to measure one of the three single-qubit Pauli observables X , Y, or Z.This learning method, called classical shadow tomography, was analyzed in Ref. [46], where an upper bound on the sample complexity (the number N of copies of ρ needed to achieve the task) was expressed in terms of a quantity called the shadow norm of the target observables.
In this work, using a new norm inequality derived here, we improve on the result in Ref. [46] by obtaining a tighter upper bound on the shadow norm for boundeddegree observables.The upshot is that, for a fixed target accuracy , we can predict all bounded-degree observables with spectral norm less than B by performing random Pauli measurement on (2) copies of ρ.This result improves upon the previously known bound of O(n log(n)B 2 / 2 ).Furthermore, we derive a matching lower bound on the number of copies required for this task, which applies even if collective measurements across many copies are allowed.

B. Learning an unknown observable
Now suppose that O is an arbitrary and unknown nqubit observable.We also consider a distribution D on n-qubit quantum states.This distribution, too, need not be known, and it may include highly entangled states.Our goal is to find a function h(ρ) that predicts the expectation value tr(Oρ) of observable O on state ρ with a small mean squared error: To define this learning task, it is convenient to assume that we can access training data of the form {ρ , tr(Oρ where ρ is sampled from distribution D. In practice, though, we cannot directly access the exact value of the expectation value tr(Oρ ); instead, we might measure O multiple times in state ρ to obtain an accurate estimate of the expectation value.Furthermore, we do not necessarily need to sample states from D to achieve the task.We might prefer to learn about O by accessing its expectation value in states drawn from a different ensemble.
A crucial idea of this work is that we can learn O efficiently if distribution D has suitably nice features.Specifically, we consider distributions that are invariant under single-qubit Clifford gates applied to any one of the n qubits.We say that such distributions are locally flat, meaning that the probability weight assigned to an n-qubit state is unmodified (i.e., the distribution appears flat) when we locally rotate any one of the qubits.Locally flat distributions include random product states, ground and thermal states of random local Hamiltonians, and any state that is generated by a circuit, where the last circuit layer consists of random single-qubit gates.Furthermore, any distribution that is at most polynomially far from a locally flat distribution (measured in terms of the maximum likelihood ratio) can be predicted efficiently and accurately.
An arbitrary observable O can be expanded in terms of the Pauli operator basis:  (k) .Usually, in machine learning, after learning from a training set sampled from a distribution D, we can only predict new instances sampled from the same distribution D. We find, though, that, for the purpose of learning an unknown observable, there is a particular locally flat distribution D such that learning to predict under D suffices for predicting under any other locally flat distribution as well as any other distribution that is at most polynomially far away from a locally flat distribution.Namely, we sample from the n-qubit state distribution D by preparing each one of the n qubits in one of the six Pauli operator eigenstates {|0 , |1 , |+ , |− , |y+ , |y− }, chosen uniformly at random.Pleasingly, preparing samples from D is not only sufficient for our task, but also easy to do with existing quantum devices.
After training is completed, to predict tr(Oρ) for a new state ρ drawn from distribution D, we need to know some information about ρ.State ρ, like operator O, can be expanded in terms of Pauli operators, and when we replace O by its weight-k truncation, only the truncated part of ρ contributes to its expectation value.Thus, if the k-body reduced density matrices (RDMs) for states drawn from D are known classically then the predictions can be computed classically.If the states drawn from D are presented as unknown quantum states then we can learn these k-body RDMs efficiently (for small k) using classical shadow tomography and then proceed with the classical computation to obtain a predicted value of tr(Oρ).

C. Learning an unknown process
Now suppose that E is an arbitrary and unknown quantum process mapping n qubits to n qubits.Let {O i } be a family of target observables and D be a distribution on quantum states.We assume the ability to repeatedly access E for a total of N times.Each time, we can apply E to an input state of our choice, and perform the measurement of our choice on the resulting output.In principle, we could allow input states that are entangled across the N channel uses, and allow collective measurements across the N channel outputs.But here we confine our attention to the case where the N inputs are unentangled, and the channel outputs are measured individually.Our goal is to find a function h(ρ, O) that predicts, with a small mean squared error, the expectation value of O i in the output state E(ρ) for every observable O i in the family {O i }: Our main result is that this task can be achieved efficiently if O i is a bounded-degree observable and D is locally flat.That is, N , the number of times we access E, and the computational complexity of training and prediction scale reasonably with the system size n and the target accuracy .For example, any generic input product state can be predicted efficiently and accurately.From Eq. ( 6), it is also easy to see that a small average error can be achieved for any distribution D that is at most polynomially far away from a locally flat distribution with distance measured by the maximum likelihood ratio.
To prove this result, we observe that the task of learning an unknown quantum process can be reduced to learning unknown states and learning unknown observables.If ρ is sampled from distribution D then, since E is unknown, E(ρ ) should be regarded as an unknown quantum state.Suppose that we learn this state; that is, after preparing and measuring E(ρ ) sufficiently many times we can accurately predict the expectation value tr , where E † is the (Heisenberg-picture) map dual to E. Since E † is unknown, E † (O i ) should be regarded as an unknown observable.Suppose that we learn this observable; that is, using the dataset {ρ , tr(E † (O i )ρ )} as training data, we can predict tr(E † (O i )ρ) for ρ drawn from D with a small mean squared error.This achieves the task of learning process E for state distribution D and target observable O i .
Having already shown that arbitrary quantum states can be learned efficiently for the purpose of predicting expectation values of bounded-degree observables and that arbitrary observables can be learned efficiently for any input state distribution that is not superpolynomially far away from a locally flat distribution, we obtain our main result.Since distribution D is not too far from locally flat, it suffices to learn the low-degree truncated approximation to the unknown operator E † (O i ), incurring only a small mean squared error.To predict tr(E † (O i )ρ), then, it suffices to know only the few-body RDMs of the input state ρ.For any input state ρ, these few-body density matrices can be learned efficiently using classical shadow tomography.
As noted above in the discussion of learning observables, states ρ in the training data need not be sampled from D. To learn a low-degree approximation to E † (O i ), it suffices to sample from the uniform distribution over product states.Even if we sample only product states during training, we can make accurate predictions for highly entangled input states.We also emphasize again that the unknown process E is arbitrary.Even if E has quantum computational complexity exponential in n, we can learn to predict tr(OE(ρ)) accurately and efficiently for boundeddegree observables O and for any distribution on the input state ρ that is at most polynomially far from some locally flat distribution.

III. ALGORITHM FOR LEARNING AN UNKNOWN QUANTUM PROCESS
Consider an unknown n-qubit quantum process E (a CPTP map).Suppose that we have obtained a classical dataset by performing N randomized experiments on E. Each experiment prepares a random product state , passes through E, and performs a randomized Pauli measurement [46,47] where |s (in)  ,i , |s (out)   ,i ∈ stab 1 .Each product state is represented classically with O(n) bits.Hence, the classical dataset S N (E) is of size O(nN ) bits.The classical dataset can be seen as one way to generalize the notion of classical shadows of quantum states [46] to quantum processes.Our goal is to design an ML algorithm that can learn an approximate model of E from the classical dataset S N (E), such that, for a wide range of states ρ and observables O, the ML model can predict a real value h(ρ, O) that is approximately equal to tr(OE(ρ)).

A. ML algorithm
We are now ready to state the proposed ML algorithm.At a high level, the ML algorithm learns a low-degree approximation to the unknown n-qubit CPTP map E. Despite the simplicity of the ML algorithm, several ideas go into the design of the ML algorithm and the proof of the rigorous performance guarantee.These ideas are presented in Sec.IV below.
Let O be an observable with O ≤ 1 that is written as a sum of few-body observables, where each qubit is acted on by O(1) of the few-body observables.We denote the Pauli representation of O as Q∈{I ,X ,Y,Z} ⊗n a Q Q.By the definition of O, there are O(n) nonzero Pauli coefficients a Q .We consider a hyperparameter ˜ > 0; roughly speaking, ˜ will scale inverse polynomially in the dataset size N from Eq. ( 12) below.For every Pauli observable P ∈ {I , X , Y, Z} ⊗n with |P| ≤ k = (log(1/ )), the algorithm computes an empirical estimate for the corresponding Pauli coefficient α P via (3|s (out)   ,i The With a proper implementation, the computational time is O(kn k N ).Note that, to make predictions, the ML algorithm only needs the k-body reduced density matrices (k-RDMs) of ρ.The k-RDMs of ρ can be efficiently obtained by performing randomized Pauli measurement on ρ and using the classical shadow formalism [46,47].Except for this step, which may require quantum computation, all other steps of the ML algorithm only require classical computation.Hence, if the k-RDMs of ρ can be computed classically then we have a classical ML algorithm that can predict an arbitrary quantum process E after learning from data.

B. Rigorous guarantee
To measure the prediction error of the ML model, we consider the average-case prediction performance under an arbitrary n-qubit state distribution D invariant under single-qubit Clifford gates, which means that the probability distribution f D (ρ) of sampling a state ρ is equal to f D (UρU † ) of sampling UρU † for any single-qubit Clifford gate U. We call such a distribution locally flat.
Theorem 1 (Learning an unknown quantum process).-Supposethat , = (1) and that there is a training set S N (E) of size N = O(log n) as specified in Eq. (7).With high probability, the ML model can learn a function h(ρ, O) from S N (E) such that, for any distribution D over n-qubit states invariant under single-qubit Clifford gates, and for any bounded-degree observable O with O ≤ 1, where O is the low-degree truncation [of degree k = log 1.5 (1/ ) ] of observable O after the Heisenberg evolution under E. The training and prediction time of h(ρ, O) are both polynomial in n.When is small and = 0, the data size N and computational time scale as 2 O(log(1/ ) log(n)) .
The detailed theorem statement and the proof of the theorem are given in Appendix E. An interesting aspect of the above theorem is that the states sampled from distribution D can be highly entangled, even though the training data S N (E) only contains information about random product states.From the theorem, we can see that if O = O(1) then we only need O(log(n)) samples to obtain a constant prediction error.Otherwise, O(log(n)) samples are still enough to guarantee a constant prediction error relative to O 2 .The precise scaling is given as follows.Consider data size The computational time to learn and predict h(ρ, O) is bounded above by O(kn k N ) and the prediction error is bounded as As we take to be zero, we can remove the dependence on the low-degree truncation O .In this setting, N and computation time both become 2 O(log(1/ ) log(n)) , which is polynomial in n if = (1) and is quasipolynomial in n if = 1/poly(n).For a distribution D that is not locally flat, we can consider a locally flat distribution D * that is closest to D under the distance defined by the maximum likelihood ratio := sup ρ [p D (ρ)/p D * (ρ)], where p D (ρ) is the probability density of ρ under D. We can see that the average prediction error under D satisfies Hence, if the distance is at most poly(n) then the prediction error under D is small using a quasipolynomial sample complexity and computational time.

IV. PROOF IDEAS
The proof of the rigorous performance guarantee for the proposed ML algorithm consists of five parts.The first two parts presented in Appendices A and B are a detour to establish a few fundamental and useful norm inequalities about Hamiltonians and observables.The latter three parts given in Appendices C, D, and E apply the newly established norm inequalities to three learning tasks.In the following, we present the basic ideas in each part.

A. Improved approximation algorithms for optimizing local Hamiltonians
We begin with a different task, namely, optimizing local Hamiltonians.We are given an n-qubit k-local Hamiltonian H = P∈{I ,X ,Y,Z} ⊗n : |P|≤k where |P| is the weight of the Pauli operator P, the number of qubits upon which P acts nontrivially.Our goal is to find a state |ψ that maximizes or minimizes ψ| H |ψ .This task is related to solving ground states [48,49] when we consider minimizing ψ| H |ψ and quantum optimization [43,44,[50][51][52][53][54] when we consider maximizing ψ| H |ψ .We give a general randomized approximation algorithm in Appendix A for producing a random product state |ψ that either approximately minimizes or approximately maximizes a k-local Hamiltonian H with a rigorous upper or lower bound based on the Pauli coefficients α P of H .The proposed optimization algorithm applies to various classes of Hamiltonians and is inspired by the proofs of Littlewood's 4/3 inequality [55] and the Bohnenblust-Hille inequality [56].For classes that have been studied previously [41][42][43][44], the proposed algorithm yields an improved bound.Our improvement crucially stems from our construction for the random state |ψ .In Refs.[41][42][43] the authors utilize a random restriction approach, where some random subset of qubits is fixed with some random values and the rest of the qubits are optimized.On the other hand, we utilize a polarization approach, where we replicate each qubit many times, randomly fix all except the last replica, optimize the last replica, and combine using a random-signed averaging.A detailed comparison is given in Appendices A 1 c and A 2.
Two for some constant C. We note that in the above results, we cannot control whether our algorithm outputs an approximate maximizer or minimizer.This caveat stems from the use of polarization, where the random-signed averaging only guarantees improvement in one of the two directions.Modifying our approach to address this issue is an interesting direction for future work.

B. Norm inequalities from approximate optimization algorithms
The bridge that connects the optimization of k-local Hamiltonians and efficient learning of quantum states and processes is a set of norm inequalities.A norm that characterizes the efficiency of learning is the Pauli-p norm, defined as the p norm on the Pauli coefficients of a Hamiltonian H = P α P P, The rigorous guarantees from the previous section, namely, on finding a state |ψ whose energy is higher or lower than a Haar-random state by a margin that depends on the Pauli coefficients α P , give an algorithmic proof that the spectral norm H and the Pauli coefficients α P are related.The proof of this relation is given in Appendix B. In particular, for general and bounded-degree k-local Hamiltonians, we can use the rigorous guarantee from the approximation algorithms to obtain the following norm inequalities.Corollary 3 proves the conjecture given in Ref. [45].

Corollary 3 (Norm inequality for the general k-local Hamiltonian).-Given an n-qubit k-local
Hamiltonian H , we have where Corollary 4 (Norm inequality for the bounded-degree local Hamiltonian).-Givenan n-qubit k-local Hamiltonian H with bounded degree d, we have where

C. Sample-optimal algorithm for predicting bounded-degree observables
As the first application of the above norm inequalities to learning, we consider the basic problem of predicting many properties of an unknown n-qubit state ρ.Given M observables O 1 , . . ., O M , after performing measurements on multiple copies of ρ, we would like to predict tr(O i ρ) to error for all i ∈ {1, . . ., M }.This is the task known as shadow tomography [46,57,58].One approach for obtaining practically efficient algorithms for shadow tomography is via the classical shadow formalism [46].
We consider a physically relevant class of observables, where the observable O i = j O ij is a sum of few-body observables O ij and each qubit is acted on by O(1) of the few-body observables.Despite significant recent progress in shadow tomography [8,33,57,[59][60][61][62][63][64][65][66][67][68][69][70], the sample complexity (number of copies of ρ) for predicting this class of observables has not been established.The central challenge is the appearance of the Pauli-1 norm O i Pauli,1 when characterizing the sample complexity.In particular, one can bound the shadow norm O i shadow [46], which gives an upper bound on the sample complexity in terms of the Pauli-1 norm O i Pauli,1 up to a constant factor.Using the new norm inequality established in this work, we give a sample-optimal algorithm for predicting bounded-degree observables.
The sample-optimal algorithm is equivalent to performing classical shadow tomography based on randomized Pauli measurements [46,47], and is essentially the ML algorithm given in Sec.III A with a fixed input state.Consider an unknown n-qubit state ρ.After performing N randomized Pauli measurements on N copies of ρ, we have a classical dataset denoted as where |s (out)   ,i ∈ stab 1 is a single-qubit stabilizer state.Given an observable O, the algorithm predicts (3|s (out)   ,i The following theorem shows that the above algorithm achieves the optimal sample complexity of any algorithm that can perform collective measurement on many copies of ρ.
Theorem 3 (Sample complexity lower bound).-Considerthe following task.There is an unknown n-qubit state ρ, and we are given M observables O 1 , . . ., O M with max i O i ≤ B ∞ .Each observable O i is a sum of few-body observables, where every qubit is acted on by O(1) of the few-body observables.We would like to estimate tr(O i ρ) to error for all i ∈ [M ] with high probability by performing arbitrary collective measurements on N copies of ρ.The number of copies N must be at least (27) for any algorithm to succeed in this task.
The detailed proofs of the sample complexities stated in the above theorems are given in Appendix C.

D. Efficient algorithms for learning an unknown observable from log(n) samples
As a second learning application of the norm inequalities, we consider the task of learning an unknown n-qubit observable O (unk) = P∈{I ,X ,Y,Z} ⊗n α P P. We can think of this unknown observable as E † (O), i.e., the observable O after Heisenberg evolution under the unknown process E. Suppose that we are given a training dataset of {ρ , tr(O (unk) ρ )} N =1 , where ρ is sampled from an arbitrary distribution D over n-qubit states that is invariant under single-qubit Clifford gates.Given an integer k > 0, we define the weight-k truncation of O (unk) to be the Hermitian operator where |P| is the number of qubits upon which P acts nontrivially.For a small k, we can think of O (unk,k) as a lowweight approximation of the unknown observable O (unk) .By definition, O (unk,k) is a k-local Hamiltonian; hence, the norm inequality in Corollary 3 shows that where r = 2k/(k + 1) ∈ [1, 2).An r -norm bound (r < 2) on the Pauli coefficients implies that we can remove most of the small Pauli coefficients without incurring too much change under the 2 norm.As an example, consider an Mdimensional vector x with x r ≤ 1.Given > 0, let x be the M -dimensional vector with In Appendix D 1, we show that the average error (both the mean squared error and the mean absolute error) is characterized by the 2 norm.Hence, Eq. ( 29) implies that we can set most of the Pauli coefficients in O (unk,k) to zero without incurring too much error on average.Using the above reasoning, learning the low-weight truncation O (unk,k) amounts to learning the large Pauli coefficients of O (unk,k) and setting all small Pauli coefficients to zero.This ensures that the learning can be done very efficiently.This approach is presented in Appendix D 2 with the main result stated in Lemma 18.It is inspired by the learning algorithm of Ref [71] that achieves a logarithmic sample complexity for learning classical low-degree functions.
The last step in the proof is to argue that the low-weight truncation O (unk,k) is a good surrogate for the unknown observable O (unk) when the goal is to predict tr(O (unk) ρ).
The key insight here is that, for distributions D that are invariant under single-Clifford gates, the contribution of any Pauli term P in O (unk) to E ρ∼D [tr(O (unk) ρ) 2 ] is exponentially decaying in weight |P|.This allows us to prove that Putting these ingredients together, we arrive at the following theorem.As stated in the theorem, the learning algorithm is computationally efficient.
Theorem 4 (Learning an unknown observable).-Supposethat , , where ρ is sampled from D, we can learn a function h(ρ) such that with probability at least 1 − δ.The training and prediction times of h(ρ) are O(Nn k ).
The factor of O (unk) 2 in the prediction error is the natural scale of the squared error.From the theorem, we can see that we only need O(log(n)) samples to obtain a constant prediction error relative to O (unk) 2 + O (unk,k) r O (unk) 2−r .The proof of the theorem and the detailed description of the ML algorithm are given in Appendix D.

E. Learning an unknown quantum process
The ML algorithm for learning an unknown n-qubit quantum process E is essentially the combination of the two learning applications described above with a few modifications.At a high level, we consider the following.There is an n-qubit state ρ sampled from an unknown distribution D, as well as an observable O that can be written as a sum of few-body observables, where each qubit is acted on by a constant number of the few-body observables.In the first stage, we use the sample-optimal algorithm for predicting the bounded-degree observable O, where E(ρ ) is an unknown quantum state, thus transforming the classical dataset S N (E) in Eq. ( 7) into a dataset, that maps quantum states to real numbers.In the second stage, we apply the efficient algorithm for learning an unknown observable O (unk) = E † (O), regarding Eq. ( 33) as the training data for this task, thus predicting tr(E † (O)ρ) = tr(OE(ρ)) for state ρ drawn from distribution D. Because both stages of the algorithm run in time polynomial in n, the overall runtime for this procedure is polynomial in n.
In our actual proofs, there are a few deviations from the above high-level design, stemming from the fact that the input states ρ are tensor products of random singlequbit stabilizer states.This specific setting allows a few simplifications to be made.With the simplifications, we can remove an additive factor of in the prediction error.Furthermore, a surprising fact is that learning from random product states is sufficient to predict highly entangled states sampled from any distribution D invariant under single-qubit Clifford unitaries.This surprising fact is a result of the characterization of the prediction error given in Lemma 14 based on a modified purity on subsystems of an input quantum state ρ ∼ D.
By combining the five parts, we can establish Theorem 1, the precise sample complexity scaling in Eq. ( 12), and the prediction error bound in Eq. ( 13).The full proof is given in Appendix E.

V. NUMERICAL EXPERIMENTS
We have conducted numerical experiments to assess the performance of ML models in learning the dynamics of several physical systems.The results corroborate our theoretical claims that long-time evolution over a many-body system can be learned efficiently.While our theorem only guarantees good performance for randomly sampled input states, we also find that the ML models work very well for structured input states that could be of practical interest.The source code is available from a public GitHub repository [72].We note that all prior tomographic protocols that can learn an arbitrary quantum process require a sample complexity that scales exponentially in n.The strong numerical performance demonstrated here raises the hope that the synthesis of existing tomographic techniques and the low-weight truncation proposed in this work will enable a more powerful ML method for predicting arbitrary quantum processes.
We focus on training ML models to predict output state properties after the time dynamics of one-dimensional (1D) n-spin XY and Ising chains with homogeneous or disordered Z fields.Let H be the many-body Hamiltonian.The quantum process E is given by E(ρ) = e −itH ρe itH for a significantly long evolution time t = 10 6 .We consider the ML models described by Eq. ( 10).While we utilize the very simple sparsity-enforcing strategy of setting small values to zero to prove Theorem 1, the standard sparsity-enforcing approach is through 1 regularization [73].A detailed description of applying 1 regularization to enforce sparsity in α P (O) is given in Appendix F. We find the best hyperparameters using fourfold cross-validation to minimize the root-mean-square error (RMSE) and report the predictions on a test set.
Figure 2 considers the performance for predicting the expectation of the Pauli-Z operator Z i on the output state for randomly sampled product input states not in the training data.Figure 2(a) illustrates the many-body Hamiltonian H . Figure 2(b) shows the dependence of the error on the training set size N .We can clearly see that, as the training set size N increases, the prediction error notably decreases.This observation confirms our theoretical claim that long-time quantum dynamics could be efficiently learned.In Fig. 2(c), we consider how the evolution time t affects prediction performance.From the figure, we can see that, even when we exponentially increase t, the prediction performance remains similar.This matches with our theorem stating that no matter what the quantum process E is, even if E is an exponentially long-time dynamics, the ML model can still predict accurately and efficiently.In Fig. 2(d), we consider the dependence on the system size n.As n increases linearly, the Hilbert space dimension 2 n grows exponentially.Despite the exponential growth, even for 50-spin systems, the ML model still predicts well.This matches with the logarithmic scaling on n given in Theorem 1.
In Fig. 3, we consider predicting properties of the final state after long-time dynamics for a highly structured input

n-spin chain
Homogeneous: h i = 0.5 Disordered: XY model: The ML model is trained on 10 000 random product states.We see that the ML model performs accurately for a significantly large range of time t.
product state which has a single domain wall in the middle.We focus on predicting the expected value for Z i (t) = e itH Z i e −itH on every spin in the 1D 50-spin XY chain with a homogeneous Z field h i = 0.5 and consider evolution time t from 0 to 10 6 .We train the ML model using N = 10 000 random input product states.We can see that the ML model predicts very well for this highly structured product state.The collapse of the domain wall is accurately predicted by the ML model despite only seeing outcomes from random unstructured product states.This numerical experiment suggests that the performance of the ML model goes beyond Theorem 1, which only guarantees accurate prediction on average.Theorem 1 states that the ML model can predict well on highly entangled input states after learning only from random product state inputs.We test this claim in Fig. 4 by considering an entangled input state The left n/2 spins of state |ψ e exhibit Greenberger-Horne-Zeilinger (GHZ)-like entanglement, which requires a linear-depth 1D quantum circuit to prepare.The right n/2 spins of |ψ e form a product state with spins rotating clockwise from left to right.Combining the left and right spins, state |ψ e cannot be generated by a short-depth 1D quantum circuit.We can see that, for this entangled input state, the ML model trained on random product states still predicts very well across a broad range of the evolution time t.

VI. OUTLOOK
The theorem established in this work shows that learning to predict a complex quantum process can be achieved with computationally efficient ML algorithms.Once we have obtained training data by accessing the unknown process E sufficiently many times, the proposed ML algorithm is entirely classical except for the step of obtaining the RDM of the input state ρ, which may require quantum computation.This algorithm is reminiscent of recent proposals for quantum ML based on kernel methods [2,3,29], in particular the projected quantum kernel [29].This result highlights the potential for using hybrid quantum-classical ML algorithms to learn to model exotic quantum dynamics occurring in nature.
The results presented in this work also have implications for several previously studied problems.Prior works [7,11,12] have proposed to train quantum ML models on a given quantum process with the hope that the learned model can be faster than the process itself.Our proof that one can always train an ML model that runs in quasipolynomial t = 10 time, even for exponential-time quantum dynamics, provides rigorous support for such a hope.When the few-body RDMs of the input state ρ are hard to compute classically, the proposed ML algorithm can be seen as a variant of the projected quantum kernel method [29].When the few-body RDMs of the input state ρ are easy to compute classically, the proposed ML model can efficiently run on a classical computer.Hence, this result provides a rigorous foundation for empirical works using classical ML to learn and simulate quantum dynamics [27,[74][75][76].When E is a parameterized quantum circuit U θ , such as a quantum neural network [3,4,6,9,29,37], the existence of a classical ML model that can efficiently predict the output of U θ implies that the function tr(OU θ ρU † θ ) is easy to represent and learn on a classical computer.This finding shows that quantum circuits do not have strong representational power for various distributions over quantum state input ρ with easy-to-compute RDMs.
Several open problems remain to be answered.While we focus only on locally flat distributions D, we believe that efficient ML algorithms also exist for other general classes of distributions.An important open problem is hence the following: can we obtain computationally efficient learning algorithms for any "smooth" distribution over quantum state space?If not, how general can the class of distributions be?Similar questions can be asked about the class of observables that we predict.For what general classes of observables O can one predict efficiently, in terms of both sample size and computation time?This problem is closely related to the problem of when shadow tomography [57,58,77] can be made computationally efficient.Other important questions include the following.If we restrict the quantum process E to be generated in polynomial time, can we obtain improved efficiency?What efficiency guarantees apply to fermionic or bosonic systems?A better understanding of these problems would illuminate the ultimate power of classical and quantum ML algorithms for learning about physical dynamics.
Science Research Centers, Quantum Systems Accelerator, and the National Science Foundation (PHY-1733907).The Institute for Quantum Information and Matter is an NSF Physics Frontiers Center.

APPENDIX A: OPTIMIZING A k-LOCAL HAMILTONIAN WITH RANDOM PRODUCT STATES
While our goal is to design a good ML algorithm with low sample complexity, this appendix is a detour to a different task on the optimization of a k-local Hamiltonian.We present an improved approximation algorithm for optimizing any k-local Hamiltonian.The central result in this detour will become useful for showing the low sample complexity of several ML algorithms.

Task 1 (Optimizing a quantum Hamiltonian).-Given n, k ≥ 1 and an n-qubit k-local Hamiltonian
where |P| is the number of nonidentity components in P, find a state |ψ that maximizes or minimizes ψ| H |ψ .The task given above is related to solving ground states [48,49] when we consider minimizing ψ| H |ψ and quantum optimization [43,44,[50][51][52][53][54] when we consider maximizing ψ| H |ψ .The maximization and minimization are often the same problem since maximizing ψ| H |ψ is the same as minimizing ψ| (−H ) |ψ .Without further constraints, even for k = 2, finding the optimal state |ψ * maximizing ψ| H |ψ is known to be QMA hard [78]; hence, it is expected to have no polynomialtime algorithm even on a quantum computer.Most existing works consider deterministic or randomized constructions of |ψ with rigorous upper and lower bound guarantees on ψ| H |ψ for minimization and maximization.Some of these lower bounds [52][53][54] are based on the optimal value OPT = sup |ψ ψ| H |ψ , while some [43,44,51] are based on the Pauli coefficients α P .

a. Definition of expansion
In this section, we present a random product state construction for the optimization problem, where the rigorous upper or lower bound is based on the Pauli coefficients α P and the expansion property defined below.The expansion property is defined for any Hamiltonian H .

b. Main theorem
With the expansion property defined, we can state the rigorous guarantee on the performance of the proposed randomized approximation algorithm on optimizing an nqubit k-local Hamiltonian H .We compare with the average energy E |φ : Haar [ φ| H |φ ] = α I over the Haar random state.The randomized approximation algorithm uses an optimization over a single-variable polynomial that guarantees improvement in at least one direction (minimization or maximization).
The constant C(c e , d e , k) is given by where k considers the asymptotic scaling when k is a constant.Some observations can be made.First, the improvement over Haar random states in Theorem 5 becomes larger when the expansion coefficient c e is smaller.Second, ( P =I |α P | r ) 1/r is the r norm on the nonidentity Pauli coefficients, so by monotonicity of r norms, ( P =I |α P | r ) 1/r becomes smaller as r becomes larger (corresponding to larger d e ).Hence, the improvement is greater for smaller expansion dimension d e .In particular, it is helpful to contrast Eqs.(A3) and (A4) with the following basic estimate corresponding to r = 2 that holds regardless of c e , d e , k: This holds for any Hamiltonian H = P α P P because • denotes the spectral norm and α I = E |ψ : Haar [ φ| H |φ ].This basic estimate shows that we can always find a state that improves by at least the 2 norm of α P , although the optimization process can be computationally hard.

c. An alternative version of the main theorem
By following the proof of Theorem 5 and replacing the use of Corollary 9 by Lemma 5, we can establish the following alternative theorem statement that does not utilize the expansion property. Theorem for some constant D.
We can compare the above theorem with a closely related result in Ref. [43].The following is a restatement of the approximation guarantee from Theorem 2 and Lemma 3 of Ref. [43], which is a corollary of a powerful result in Boolean function analysis [41,42] relating the maximum influence and the ability to sample a bitstring from the Boolean hypercube with a large magnitude in the function value.We can define the influence of qubit i under Pauli matrix p ∈ {X , Y, Z} as I (i, p) = P : P i =p α 2 P .Theorem 7 (Approximation guarantee from Ref. [43] for optimizing a k-local Hamiltonian).-Givenan n-qubit k-local Hamiltonian H = P : |P|≤k α P P with k = O(1), there is a polynomial-time randomized algorithm that produces a random state for some constant D.
The guarantee from Ref. [43] is asymptotically optimal when the influence I (i, p) is of a similar magnitude for different qubits i and Pauli matrices p.However, the approximation guarantee can be far from optimal when there is a large variation in the influence I (i, p) over different qubits i, p.As an example, consider a 1D n-qubit nearest-neighbor chain, where |α P | = 1 for only a constant number of Pauli observables P and |α P | = 1/ √ n for the rest of the Pauli observables.The improvements over the Haar random state by our algorithm and the algorithm in Ref. [43] are respectively given by i∈[n], p∈{X ,Y,Z} P : P i =p , p∈{X ,Y,Z} P : P i =p α 2 P max j ,q P : P j =q α 2 Hence, when there is large variation in the influence, our guarantee improves over that of Ref. [43].For our machine-learning applications, the removal of the dependence on the maximum influence is central.By removing the ratio I (i, p)/ max j ,q I (j , q), we can obtain the r norm dependence for an r < 2, as given in Theorem 5.
We will later see that having the r -norm bound (for r < 2) allows a substantial reduction in the sample complexity in training machine-learning models for predicting properties.
We do want to mention that the improvement comes at a cost of a slightly worse dependence on k = O(1).In Theorem 7 from Ref. [43] based on Boolean function analysis [41,42], the dependence on D is 1/2 (k) .However, our result in Theorem 6 is D = 1/2 (k log k) .This difference stems from the construction for the random state |ψ .In Refs.[41][42][43] the authors utilize a random restriction approach, where some random subset of variables is fixed with some random values and the rest of the variables are optimized.On the other hand, we utilize a polarization approach, where we replicate each variable many times, randomly fix all except the last replica, optimize the last replica, and combine using a random-signed averaging.

Corollaries of the main theorem
Here, we consider how the main theorem applies to certain classes of k-local Hamiltonians and discuss the relations of the corollaries to related works.

a. Optimizing arbitrary k-local Hamiltonians
The first corollary considers a general k-local Hamiltonian H = P : |P|≤k α P P. We can combine Fact 1 and the main theorem to obtain the following corollary.

Corollary 5 (Optimizing an arbitrary k-local Hamiltonian).-Given an n-qubit k-local
where For k = 2, we have 2k/(k + 1) = 4/3 and the above result resembles Littlewood's 4/3 inequality.Recall that Littlewood's 4/3 inequality states that, given {β i,j ∈ C} i,j , sup i,j β i,j x (1)  i x (2) For k > 2, the above result resembles the Bohnenblust-Hille inequality, which states that, given for some constant D k that depends on k.For optimizing a general k-local Hamiltonian, the design of the randomized approximation algorithm is inspired by the original proof [56] of the Bohnenblust-Hille inequality from 1931, which is used to study the absolute convergence of the Dirichlet series.

b. Optimizing bounded-degree k-local Hamiltonians
Here, we consider a Hamiltonian given by a sum of kqubit observables, where each qubit is acted on by at most d of the k-qubit observables.This is often referred to as a k-local Hamiltonian with a bounded degree d.We can combine Fact 2 and the main theorem to obtain the following corollary.for some constant C.

Corollary 6 (Optimizing a bounded-degree k-local Hamiltonian).-Given an n-qubit k-local
The task of optimizing bounded-degree k-local Hamiltonians has been considered in previous work [44].
Theorem 8 (Approximation guarantee from Ref. [44]).-Given an n-qubit 2-local Hamiltonian H = P : |P|≤2 α P P with bounded degree d and |α P | ≤ 1 for all P, there is a polynomial-time randomized algorithm that produces a quantum circuit that generates a random maximizing state |ψ satisfying as well as a random minimizing state |ψ satisfying for some constant C. The result from Ref. [44] considers a single-step gradient descent using a shallow quantum circuit on an initial random product state.Because P =I |α P | 2 ≤ P =I 1[α P = 0] and P =I |α P | ≥ P =I |α P | 2 , our result in Corollary 6 improves either the maximization problem or the minimization problem over Theorem 8.For example, if we consider α P = (1/d), which sets the total interaction strength on each qubit to be (1), then the improvement over the Haar random state by our algorithm and that by the algorithm in Ref. [44] We can see that our algorithm gives a larger improvement for the scaling with the number n of qubits.

Description of the randomized approximation algorithm
There are a few steps in the proposed randomized algorithm.The first step is to choose the best slice of the k-local Hamiltonian by splitting the k-local Hamiltonian H = P : |P|≤k α P P as We choose κ * ∈ {1, . . ., k} to be the κ that maximizes P : |P|=κ |α P | r , where r = 2d e /(d e + 1).This step can be performed in time O(nnz(H )k).
In the second step, the algorithm samples (κ * − 1)n Haar-random single-qubit pure states, |ψ (s,j ) ∈ C 2 for all s ∈ {1, . . ., κ * − 1}, j ∈ {1, . . ., n}. (A23) This step can be performed in time O(nk).The third step is a local optimization on each qubit based on |ψ (s,j ) .For each qubit i and Pauli matrix p ∈ {X , Y, Z}, we define an (n − 1)-qubit homogeneous (κ * − 1)-local Hermitian operator For each qubit i and p ∈ {X , Y, Z}, the algorithm computes the real value, given as Then, for each qubit j , we consider a single-qubit local optimization where n p = β j ,p / q β 2 j ,q for p ∈ {X , Y, Z}.After the optimization, the algorithm samples random numbers σ s ∈ {±1} for all s ∈ {1, . . ., κ * } to define a one-dimensional parameterized family of n-qubit product states, We denote this by ρ(t) when |ψ (•,•) , σ are clear from the context.This concludes the third step.The third step can be performed in time O(nnz(H )2 k ).
The fourth step performs a polynomial optimization over the one-dimensional family max The second step is a random sampling that generates a single-qubit pure state |ψ (s,j ) for each qubit j and each copy s ∈ {1, . . ., κ * − 1}.The third step is the most important part of the proof.We devote Appendices A 4 a, A 4 b, and A 4 c to establishing the first inequality in (Corollary 9 below) For the final step of the algorithm, using E |ψ |ψ ψ| = ρ(t * ; ρ (s,j ) , σ s ) and convexity, we have The theorem follows by noting that

a. Polarization
We justify the definition of β i,p using polarization.Given an n-qubit homogeneous k-local observable O = P : |P|=k α P P, consider the following nk-qubit observable.First, we index the set [nk] using ordered tuples (s, i), where s ∈ [k] and i ∈ [n].For every Pauli operator P on n qubits with |P| = k, suppose that it acts nontrivially on qubits i 1 < • • • < i k via Pauli matrices P i 1 , . . ., P i k .Then, for any permutation π ∈ S k , consider the nk-qubit observable pol π (P) that acts on the (π(s), i s )th qubit via P i s for (A33) We prove the following operator analogue of the classical polarization identity.

Lemma 1 (Polarization identity).-For any nk-qubit
any n-qubit homogeneous k-local observable O, and any t ∈ R, we have the identity where the expectation is with respect to the uniform measure on {±1} k .
Proof.-Let O = P : |P|=k α P P. By the multinomial theorem, we can expand the right-hand side to get For a given Pauli operator P, note that the only terms in the inner summation that are nonzero are given by (s 1 , . . ., s n ) satisfying the condition that if s i > 0 then P acts nontrivially on the ith qubit, because otherwise tr(ρ s i ,i − I /2) = 0 and the corresponding summand vanishes.Furthermore, for (s 1 , . . ., s n ) satisfying this property, if {1, . . ., k} do not each appear exactly once then In this case, the expectation of this term with respect to σ vanishes.Altogether, we conclude that, for P that acts via P 1 , . . ., P k on qubits 1 ≤ i 1 < • • • < i k ≤ n and via identity elsewhere, the corresponding expectation over σ in Eq. (A35) is given by from which the lemma follows.
Using the polarization identity, we can obtain the following corollary, which shows that β i,p is defined to be proportional to the expectation of the polarization pol(H κ * ,i,p ) of the homogeneous κ * -local observable H κ * ,i,p on the tensor product of n(κ * − 1) single-qubit Haar-random states.We will later study the expectation value of the polarized observable on random product states.
Corollary 7.-From the definitions given in Appendix A 3, we have Proof.-The claim follows from the polarization identity in Lemma 1 and the definition of β i,p in Eq. (A25).

b. Khintchine inequality for polarized observables
We recall the following basic result in high-dimensional probability.
Lemma 3 (Khintchine inequality for homogeneous Using the orthogonality of Pauli matrices, we have under any rotated Pauli basis.We utilize the rotated Pauli basis to establish the claimed results.
A single-qubit Haar-random pure state |ψ i can be sampled as follows.First, we sample a random single-qubit unitary U i .Then, we consider |ψ i to be sampled uniformly from the set of eight pure states, Using this sampling formulation and the rotated Pauli basis representation for O, we have Using the standard Khintchine inequality given in Lemma 2, we have Using Eq. (A42), we can obtain which implies the claimed result.We prove the left half of the Khintchine inequality for polarized observables.The right half can be shown using a similar proof, but we are only going to use the left half stated below.
Lemma 4 (Khintchine inequality for polarized observables).-Letn, k > 0. Consider an nk-qubit observable O = pol(O ), which is the polarization of an n-qubit , where |ψ (s,i) is a single-qubit Haarrandom pure state.We have Proof.-For ∈ [3n], define P ( ) to be an n-qubit observable equal to the Pauli matrix σ 1+( mod 3) ∈ {X , Y, Z} acting on the /3 th qubit.From the definition of polarization, we can represent O as For arbitrary coefficients α 1 ,..., k ∈ R, we prove the following claim by induction on k: It is not hard to see that the left-hand side of Eq. (A49) is Hence, the lemma follows from Eq. (A49).
We now prove the base case and the inductive step.The base case of k = 1 follows from the Khintchine inequality for homogeneous 1-local observables given in Lemma 3. Assume by the induction hypothesis that the claim holds for k − 1. Denoting by |ψ (k) the product of n Haar-random single-qubit states, we can then apply the Khintchine inequality for homogeneous 1-local observables (Lemma 3) to obtain We can then apply Minkowski's integral inequality to the upper bound above and yield The last inequality considers (k) to be a scalar indexed by 1 , . . ., k−1 and uses the induction hypothesis.We have thus established the induction step.The claim in Eq. (A49) follows.
The Khintchine inequality for polarized observables allows us to show that the average magnitude of pol(H κ * ,i,p ) for the tensor product of single-qubit Haarrandom states is at least as large as the Frobenius norm of H κ * ,i,p up to a constant depending on κ * .Using the definitions from the design of the approximate optimization algorithm, we can obtain the following corollary.
Corollary 8.-From the definitions given in Appendix A 3, we have Proof.-The claim follows immediately from Lemma 4 and Eq.(A33).

c. Characterization of the locally optimized random state
Recall that ρ(1; |ψ (•,•) , σ ) is created by sampling random product states and performing local single-qubit optimizations.The locally optimized random state satisfies the following inequality.
Lemma 5 (Characterization of ρ(t) for t = 1).-From the definitions given in Appendix A 3, we have Proof.-From the polarization identity given in Lemma 1, we have Next, using the definition of H κ * ,i,p in Eq. (A24), we have We can see this by considering the case when H κ * is a single Pauli observable P ∈ {I , X , Y, Z} ⊗n with |P| = κ * , and then extending linearly to any homogeneous κ * -local Hamiltonian H κ * .Equations (A54) and (A55) give From Corollary 7, we can rewrite the right-hand side as From the local optimization of |ψ (κ * ,i) given in Eq. (A26), we have, for every i ∈ [n], Using Corollary 7 yields the lower bound From Corollary 8, we can further obtain The definition of H κ * ,i,p , the above inequality, and the inequality can be used to establish the claim.
Given the expansion property, we are going to use the following implication, which considers an arbitrary ordering π of the n qubits.The inequality allows us to control the growth for the number of Pauli observables that act on qubits before the ith qubit under ordering π .The precise statement is given below.
Lemma 6 (A characterization of expansion).-Suppose that there is an n-qubit Hamiltonian H = P α P P with expansion coefficient c e and expansion dimension d e .Consider any permutation π ∈ S n over n qubits.For any i ∈ (A62) Proof.-Consider a permutation π ∈ S n over n qubits and an i ∈ The second inequality follows from the definition of the expansion coefficient c e .For the second case, we consider all subsets (A64) The second inequality again follows from the definition of c e .
Using the above implication of the expansion property, we can obtain the following inequality relating two norms.Basically, we can use the limit on the growth of the number of Pauli observables to turn the sum of the 2 norm into an r norm, where r depends on the expansion dimension d e .
Lemma 7 (Norm inequality using the expansion property By going through the n qubits based on permutation π , we have the identity .

(A69)
We can then use Lemma 6 to obtain (c e i d e −1 ) 1/(d e +1) .(A70) Using r − 1 = (d e − 1)/(d e + 1) ≥ 0, we have The choice of π ensures Eq. (A67), which gives rise to Proof.-From Lemma 5, we have directly from the orthonormality of the Pauli observables {I , X , Y, Z} ⊗n .Proposition 1 (Frobenius norm).-Givenany n-qubit Hermitian operator H , we have Proof.-Let n be the number of qubits that H acts on, and let λ 1 , . . ., λ 2 n be the eigenvalues of O. From the fact that tr(PQ) = 2 n δ P=Q , we have We now utilize Theorem 5 to obtain the following useful norm inequality.
Theorem 9 (Norm inequality from Theorem 5).-Consider an n-qubit k-local Hamiltonian H with expansion coefficient c e and dimension d e .Let r = 2d e /(d e + 1) ∈ [1, 2).We have where can be used to establish the claim.
Using Facts 1 and 2 that characterize the expansion property for general k-local Hamiltonians and boundeddegree k-local Hamiltonians (i.e., each qubit is acted on by at most d of the k-qubit observables), we can establish the following corollaries.

Corollary 11 (Norm inequality for a k-local Hamiltonian).-Given an n-qubit k-local
Hamiltonian H , we have where is the same as in Corollary 5.

Corollary 12 (Norm inequality for a bounded-degree Hamiltonian).-Given an n-qubit k-local Hamiltonian H
with bounded degree d, we have where

APPENDIX C: SAMPLE-OPTIMAL ALGORITHMS FOR PREDICTING BOUNDED-DEGREE OBSERVABLES
In this appendix, we consider one of the most basic learning problems in quantum information theory: predicting properties of an unknown n-qubit state ρ.This has been studied extensively in the literature on shadow tomography [57,58] and classical shadows [46].

Review of classical shadow formalism
We recall the following definition and theorem from classical shadow tomography [46] based on randomized Pauli measurements.Each randomized Pauli measurement is performed on a single copy of ρ and measures each qubit of ρ in a random Pauli basis (X , Y, or Z).
Definition 4 (Shadow norm from randomized Pauli measurements).-Consider an n-qubit observable O. Let U be the distribution over the tensor product of n single-qubit random Clifford unitaries, and let Theorem 10 (Classical shadow tomography using randomized Pauli measurements [46]).-Givenan unknown n-qubit state ρ and we can estimate tr(O i ρ) to error for all i ∈ [M ] with high probability.
We can see that the sample complexity for predicting many properties of an unknown quantum state ρ depends on the shadow norm • shadow .The larger • shadow is, the more experiments are needed to estimate the properties of ρ accurately.From the original classical shadow paper [46], we can obtain the following shadow-norm bounds for Pauli observables and for few-body observables.
Lemma 9 (Shadow norm for Pauli observables [46]).-Forany P ∈ {I , X , Y, Z} ⊗n , we have Lemma 10 (Shadow norm for few-body observables [46]).-Forany observable O that acts nontrivially on at most k qubits, we have Combining the above lemmas and Theorem 10, we can see that Pauli observables and few-body observables can both be predicted efficiently under a very small number of randomized Pauli measurements.

Upper bound for predicting bounded-degree observables
Consider an n-qubit observable O given as a sum of k-qubit observables O = j O j , where each qubit is acted on by at most d of these k-qubit observables O j .We focus on k = O(1) and d = O(1), and refer to such an observable as a bounded-degree observable.These bounded-degree observables arise frequently in quantum many-body physics and quantum information.For example, the Hamiltonian in a quantum spin system can often be described by a geometrically local Hamiltonian, which is an instance of bounded-degree observables.For these observables, the shadow norm is related to the Pauli-1 norm of the observable: ( C 5 ) If we consider the norm inequality between the 1 norm and 2 norm and use the standard result relating the Frobenius norm and spectral norm (Proposition 1), we would obtain the following upper bound on the shadow norm: Using Theorem 10, this shadow-norm bound gives rise to a number of measurements scaling as where Because of the linear dependence on the number n of qubits in the unknown quantum state, this scaling is not ideal.Furthermore, we will later show that this scaling is actually far from optimal.
To improve the sample complexity, we use the improved approximate optimization algorithm presented in Appendix A, and the corresponding norm inequality presented in Appendix B. Using the norm inequality relating the Pauli-1 norm and the spectral norm (Corollary 12), we can obtain the following shadow-norm bound.
Lemma 11 (Shadow norm for bounded-degree observables).-Givenk, d = O(1) and an n-qubit observable O that is a sum of k-qubit observables, where each qubit is acted on by at most d of these k-qubit observables, for some constant C > 0.
Combining the above lemma with Theorem 10 allows us to establish the following theorem.Compared to Eq. (C7), the following theorem uses n times fewer measurements.
Theorem 11 (Classical shadow tomography for bounded-degree observables).-Consideran unknown n-qubit state ρ and where every qubit is acted on by a constant number of few-body observables O ij .After N randomized Pauli measurements on copies of ρ with we can estimate tr(O i ρ) to error for all i ∈ [M ] with high probability.
Proof.-The upper bound of ) follows immediately from Theorem 10 and Lemma 11.We can also establish an upper bound of ).To see this, consider the task of predicting all k-qubit Pauli observables P ∈ {I , X , Y, Z} ⊗n with |P| ≤ k.There are at most O(n k ) such Pauli observables.To predict all of the k-qubit Pauli observables to error under the unknown state ρ, we can combine Theorem 10 and Lemma 9 to see that we need only randomized Pauli measurements.Now, given any observable O i = P α P P that is a sum of few-body observables for a constant C. Hence, by setting = /C, we can predict O i to error.Thus we can also establish an upper bound of ).The claim follows by considering the corresponding prediction algorithm (use the standard classical shadow when M < n, and use the above algorithm when M ≥ n).

Optimality of Theorem 11
Here we prove the following lower bound on the sample complexity of shadow tomography for bounded-degree observables, demonstrating that Theorem 11 is optimal.The optimality holds even when we consider a collective measurement procedure on many copies of ρ.This is in stark contrast to other sets of observables, such as the collection of high-weight Pauli observables, where single-copy measurements (e.g., classical shadow tomography) require exponentially more copies than collective measurements.
Theorem 12 (Lower bound for predicting bounded-degree observables).-Considerthe following task.There is an unknown n-qubit state ρ and we are given where every qubit is acted on by a constant number of few-body observables O ij .We would like to estimate tr(O i ρ) to error for all i ∈ [M ] with high probability by performing arbitrary collective measurements on N copies of ρ.The number of copies needs to be at least for any algorithm to succeed in this task.
To show Theorem 12, we show a lower bound for the following distinguishing task, from which the lower bound for shadow tomography will follow readily.Given i ∈ [n], let P i denote the n-body Pauli operator that acts as Z on the ith qubit and trivially elsewhere, and define the mixed state We show a lower bound for distinguishing whether ρ is maximally mixed or of the form ρ i for some i.Lemma 12 (Lower bound for a distinguishing task).-Let0 ≤ ≤ 1 and δ ≥ 2 .Let A be an algorithm that, given access to N copies of a mixed state ρ that is either the maximally mixed state or ρ i for some i ∈ [min(M , n)], correctly determines whether or not ρ is maximally mixed with probability at least 3/4.Then N = (log(min(M , n))B 2 ∞ / 2 ).Proof of Theorem 12.-Let A be an algorithm that solves the task in Theorem 12 to error /3.We can use this to give an algorithm for the task in Lemma 12: applying A to the min(M , n) observables we can produce /3-accurate estimates for tr(ρP j ) for all j ∈ [min(M , n)].Note that if ρ is maximally mixed, tr(ρO j ) = 0 for all j , whereas if ρ = ρ i then tr(ρO j ) = In particular, by checking whether there is a j for which tr(ρP j ) > 2 /3, we can determine whether ρ is maximally mixed or equal to some ρ i .The lower bound in Lemma 12 thus implies the lower bound in Theorem 12.
For convenience, define n min(M , n).Note that, for any i ∈ [n], (ρ i ) ⊗N is diagonal, so we can assume without loss of generality that A simply makes N independent measurements in the computational basis.Proving Lemma 12 thus amounts to showing a lower bound for a classical distribution testing task.
Note that distribution π i over outcomes of a single measurement of ρ i in the computational basis places mass on each string x ∈ {0, 1} n .Distribution π over outcomes of a single measurement of the maximally mixed state in the computational basis is uniform over all strings x ∈ {0, 1} n .The following basic result in binary hypothesis testing lets us reduce proving Lemma 12 to upper bounding Lemma 13 (Le Cam's two-point method [81]).-Letp 0 , p 1 be distributions over a domain for which there exists a distribution D such that d TV (p 0 , p 1 ) < 1/3.Then there is no algorithm A that maps elements of to {0, 1} for which Pr Proof of Lemma 12.-To bound the expression in Eq. (C17), it suffices to bound the chi-squared divergence χ 2 (E i [(π i ) ⊗N ] π ⊗N ) because, for any distributions p, q, we have d TV (p, q) ≤ 2 χ 2 (p q).For convenience, let us define the likelihood ratio perturbation and observe that, for any i, j Also, given strings x 1 , . . ., x N ∈ {0, 1} n and S ⊆ [N ], define We then have the standard calculation (see, e.g., [82,Lemma 22.1]) We conclude that so, for N = c log(n )/ 2 for a sufficiently small constant c > 0, this quantity is less than 1/3.By applying Lemma 13 to p 0 = π ⊗N and we obtain the claimed lower bound.

APPENDIX D: LEARNING TO PREDICT AN UNKNOWN OBSERVABLE
We begin with a definition of invariance for distribution over quantum states.
Definition 5 (Invariance under a unitary).-Aprobability distribution D over quantum states is invariant under a unitary U if the probability density remains unchanged after the action of U, i.e., for any state ρ.
In this appendix, we utilize the norm inequalities in Appendix B to give a learning algorithm that achieves the following guarantee.The learning algorithm can learn any unknown n-qubit observable O (unk) even if the scale O (unk) is unknown.The mean squared error E ρ∼D |h(ρ) − tr(O (unk) ρ)| 2 scales quadratically with the scale of the unknown observable O (unk) .We can see that the sample complexity N has a quasipolynomial dependence on the errors , relative to the scale of the unknown observable O (unk) , and depends only on the system size n and the failure probability δ logarithmically.

Low-degree approximation under the mean squared error
In order to characterize the mean squared error we need the following definition of a modified purity for quantum states.
Definition 6 (Nonidentity purity).-Given a k-qubit state ρ, the nonidentity purity of ρ is Nonidentity purity is bounded by purity: Lemma 14 (Mean squared error).-Giventwo n-qubit and a distribution D over quantum states that is invariant under single-qubit H and S gates, we have Proof.-Consider U 1 , . . ., U n to be independent random single-qubit Clifford unitaries.Because D is invariant under single-qubit Hadamard and phase gates, D is invariant under any tensor product of single-qubit Clifford unitaries.This implies that the distribution of the random state ρ is the same as the distribution of the random state Using this fact, we expand the mean squared error as Using the unitary 2-design property of a random Clifford unitary and SWAP = 1 2 P∈{I ,X ,Y,Z} P ⊗ P, we have We can now write the target value as The claim follows from Definition 6 on nonidentity purity γ .
The following lemma tells us that the mean absolute error can be upper bounded by the root-meansquared error.Hence, both the mean absolute error and the mean squared error are characterized by the 2 distance between the Pauli coefficients (as well as the average nonidentity purity).Because of the following relation, we focus on the mean squared error throughout the text.

Lemma 15 (Mean absolute error
and a distribution D over quantum states that is invariant under single-qubit H and S gates, we have Proof.-Jensen's inequality gives Combining with Lemma 14 yields the stated result.
From Lemma 14, we can construct a low-degree approximation by removing all high-weight Pauli terms for any observable O.The approximation error decays exponentially with the weight of the Pauli terms.
Corollary 13 (Low-degree approximation).-Supposethat we have an n-qubit observable O = P∈{I ,X ,Y,Z} ⊗n α P P and a distribution D over quantum states that is invariant under single-qubit H and S gates.For k > 0, consider O (k) = P : |P|<k α P P. We have Proof.-Using Lemma 14 and the fact that γ ( ) ≤ γ ( ) ≤ 1 for any state , we have The norm inequality given in Proposition 1 establishes the claim.

Tools for extracting and filtering Pauli coefficients
In order to learn the low-degree approximation of an arbitrary observable O, we need to be able to extract the relevant α P .Furthermore, we impose criteria for filtering out uninfluential Pauli observables P to prevent them from increasing the noise and leading to a higher prediction error.(D17) The claim follows from the definition of the nonidentity purity γ * .
For each Pauli observable P ∈ {I , X , Y, Z} ⊗n , define the quantity we can extract using the lemma to be

b. Filtering the small-weight factor
The first filter sets the estimate αP to be zero when the average nonidentity purity E ρ∼D γ * (ρ dom(P) ) is close to zero.We define the weight factor for a Pauli observable P to be The weight factor β P depends on distribution D, which may be unknown.Hence, we can only obtain an estimate βP for β P by utilizing the training data.Recall from Lemma 16 that we can only obtain an estimate xP for x P = α P β P .The mean squared error (Lemma 14) shows that the contribution from error in αP is The presence of β P in the mean squared error is very useful since it counteracts the fact that we cannot estimate αP accurately when β P is close to zero.The following lemma shows that estimates for β P and x P are sufficient to perform filtering and achieve a small mean squared error.Proof.-Consider the first case in which β ≤ 2˜ .We have For the second case in which β > 2˜ , we have β > ˜ .By applying the triangle inequality, we have The first term can be bounded as The second term can be bounded by the same expression Using the fact that √ z + ˜ /z is monotonically decreasing for z > 0, we have ˜ and the claim is established.

c. Filtering uninfluential Pauli observables
Consider a set S ⊆ {I , X , Y, Z} ⊗n that contains the Pauli observables of interest.For example, we later consider S to be the set of all few-body Pauli observables.Using the norm inequalities given in Appendix B, we can filter out more α P to achieve an improved mean squared error.Below is the filtering lemma that combines both the filtering of Pauli observables with a small weight factor (Lemma 17) and the filtering of those with a small contribution (characterized by |x P |/β 1/2 P ).Lemma 18 (Filtering lemma).-Supposethat ˜ , η > 0 and that we have a set S ⊆ {I , X , Y, Z} ⊗n .Consider α P ∈ [−η, η], β P ∈ [0, 1], and x P = α P β P ∈ [−η, η] for all P ∈ S. Assume that there exist A > 0 and 1 ≤ r < 2 such that ˜ .The set S u contains all the unfiltered Pauli observables.We define S f to be S \ S u , which contains all the filtered Pauli observables.We separate the contributions of S u and S f in the mean squared error P∈S β P | αP − α P | 2 : (D28) A key quantity for the analysis is β The last inequality uses the fact that β P > ˜ , βP /β P > 2, and, hence, (D31) We are now ready to analyze the contributions of S u and S f .For the unfiltered Pauli observables (those in set S u ), we can use Lemma 17 to obtain Together, we have the upper bound For the filtered Pauli observables (those in set S f ), we have There are two types of Pauli observables in S f .
(1) For P with βP ≤ 2˜ , we have |β Combining the contributions of S u and S f yields Thus we have established the first statement of the lemma.We now focus on the second statement of the lemma.For Pauli observable P that satisfies the first and the third cases of Eq. (D27), we can use Lemma 17 to obtain β P | αP − α P | 2 ≤ 3η 2 ˜ < 9η 2 ˜ .For the second case of Eq. (D27), we can use Eq.(D30) to see that Hence, for all P ∈ S, we have

Learning algorithm
In this section, we present a learning algorithm satisfying the guarantee given in Theorem 13.Consider the full training data {ρ , y = tr(O (unk) ρ ))} N =1 of size N .The learning algorithm splits the full data into a smaller training set of size N tr and a validation set of size N val with N = N tr + N val .The training set is used to extract Pauli coefficients and perform filtering with a hyperparameter η.The is a random variable with expectation value equal to a.For example, an unbiased estimator for tr(Pρ ) 2 can be obtained by performing two quantum measurements on two individual copies of ρ using the observable P and multiplying the results, or by utilizing classical shadow formalism [46] and randomized measurement [47].

Rigorous performance guarantee
In this section, we prove that the learning algorithm presented in the last section satisfies Theorem 13.We separate the proof for achieving the sample complexity on the left and right of Eq. (D39).
The proof for the sample complexity stated on the left of Eq. (D39) consists of three parts: (1) a characterization of the prediction error, (2) the existence of a good hyperparameter η that achieves a small prediction error, (3) the best hyperparameter η * found by a grid search over the validation set must yield a small prediction error.
The proof for the sample complexity stated on the right of Eq. (D39) is simpler and is given at the end.

a. Characterization of the prediction error
We begin with a lemma about the sample maximum.Lemma 19 (Sample maximum).-Let 1 > , δ > 0. Consider an arbitrary real-valued random variable X .Let X 1 , . . ., X N be N independent samples of X with N = log(1/δ)/ , and let ˆ = max i X i .Then with probability at least 1 − δ.
Proof.-Recall that the cumulative distribution function is defined as F(θ ) = Pr [X ≤ θ].We define the approximate maximum as inf θ : F(θ)≥1− θ. (D56) Using the right continuity of Thus, with probability at least 1 − δ, we have ˆ ≥ .Using the monotonicity of F(θ ), we have with probability at least 1 − δ/3.Hence, with probability at least 1 − δ/3, we have Using Lemma 14 on the mean squared error and Corollary 13 on the low-degree approximation, we have with probability at least 1 − δ/3.Let us define the variables |P| , for all P ∈ S. (D66) Then, with probability at least 1 − δ/3 over the sampling of the training set, we have the following characterization of the prediction error for all η > 0: We utilize this form to show the existence of a good hyperparameter η .

b. Existence of a good hyperparameter η
By considering the training set size to be we can guarantee Eq. (D67) with probability at least 1 − δ/3.Furthermore, utilizing Hoeffding's inequality and the union bound, we can also guarantee that with probability at least 1 − δ/3.The norm inequality given in Corollary 11 shows that for a constant given by We now condition on the event that Eqs.(D67) and (D69) both hold, which happens with probability at least 1 − (2/3)δ.We are now ready to define the good hyperparameter η .Let hyperparameter η belonging to the grid in Eq. (D42) be defined as We separately consider two cases: For the second case η < 2 R ˆ , we have the following bound on η : The filtering lemma given in Lemma 18 shows that In both cases (1) and ( 2), using the definitions r = 2k/(k + 1) and ˜ = ( /12) k+1 (C(k)/3) 2k , we have Combining with Eq. (D67), we have with probability at least 1 − (2/3)δ.

c. The prediction performance of hyperparameter η *
From the definition of h(ρ; η), for any quantum state ρ, we have (D80) Using Hoeffding's inequality and the union bound, we can show that, given a validation set of size with probability at least 1 − δ/3, we have for all η ∈ {2 0 ˆ , . . ., 2 R ˆ }.Using the definitions of η * and η , we have with probability at least 1 − δ/3 over the sampling of the validation set.Combining with Eq. (D79) and employing the union bound, we have with probability at least 1 − δ, as claimed in Eq. (D3).Finally, by noting that |S| = O(n k ) and k = log 1.5 (1/ ), and recalling the definition of ˜ in Eq .(D41) on the right-hand side of Eq. (D68), we have So it suffices to have Furthermore, by noting that R = log 2 1/˜ = O(k log( ) + k log 2 k) in Eq. (D81), we see that it suffices to have Recall that the full data size N = N tr + N val , and the quantity in Eq. (D87) is dominated by that in Eq. (D86), yielding one argument in the minimum of the sample complexity claimed in Theorem 13.

d. Establishing sample complexity on the right of Eq. (D39)
By considering the full dataset size to be Hoeffding's inequality and the union bound can be used to guarantee that with probability at least 1 − δ.Using Lemma 17 on filtering the small-weight factor, we have Using Lemma 14 on the mean squared error and Corollary 13 on the low-degree approximation, we have From the definition of ˜ in Eq. (D53), we have The sample complexity is which completes the sample complexity claimed in Theorem 13.

APPENDIX E: LEARNING QUANTUM EVOLUTIONS FROM RANDOMIZED EXPERIMENTS
We recall the following definitions pertaining to classical shadows for quantum states and quantum evolutions, based on randomized Pauli measurements and random input states.In the following, we define the classical shadow of a quantum state based on randomized Pauli measurements.Classical shadows could also be defined based on other randomized measurements [46].Definition 9 (Classical shadow of a quantum state).-Letn, N > 0. Consider an n-qubit state ρ.A size-N classical shadow S N (ρ) of quantum state ρ is a random set given by where |ψ = n i=1 |s ,i is the outcome of the th randomized Pauli measurement on a single copy of ρ.
We can generalize classical shadows from quantum states to quantum processes by considering random product input states and randomized Pauli measurements.A similar generalization has been studied in Ref. [33].
Definition 10 (Classical shadow of a quantum process).-Consideran n-qubit CPTP map E. A size-N classical shadow S N (E) of quantum evolution E is a random set given by where ,i is a random input state with |s (in)   ,i ∈ stab 1 sampled uniformly, and |ψ (out) = n i=1 |s (out)   ,i is the outcome of performing the randomized Pauli measurement on E(|ψ (in) ψ (in) |).
After obtaining the outcome from N randomized experiments, we can design a learning algorithm that learns a model of the unknown CPTP map E such that, given an input state ρ and an observable O, the algorithm could predict tr(OE(ρ)).The rigorous guarantee is given in the following theorem.

Learning algorithm
Recall that a size-N classical shadow S N (E) of the CPTP map E is a set given by

Rigorous performance guarantee
In this section, we prove that the learning algorithm presented in the last section satisfies Theorem 14.The proof uses the tools presented in Appendix D 2 and is similar to the proof of Theorem 13.

a. Definitions
For a given observable that is a sum of κ-qubit observables, where κ = O(1) and each qubit is acted on by We also define the standard n-qubit input state distribution D 0 to be the uniform distribution over the tensor product of n single-qubit stabilizer states.A nice property of D 0 is that, for any state ρ in the support of D 0 , the nonidentity purity for a subsystem A of size L is This completes the proof of Theorem 14.

APPENDIX F: NUMERICAL DETAILS
In the numerical experiments, we consider two classes of Hamiltonians:

model). (F2)
Here h i = 0.5 for the homogeneous Z field, and h i is sampled uniformly at random from [−5, 5] for the disordered Z field.We solve for the time-evolved properties using the Jordan-Wigner transform to map the spin chains to a freefermion model and the technique described in Ref. [83] to solve the free-fermion model.We consider the training set to be a collection of N random product states |ψ , = 1, . . ., N , and their associated measured properties y corresponding to measuring an observable O after evolving under U(t) = exp(−itH ).The measured properties are averaged over 500 measurements.Hence, y is a noisy estimate of the true expectation value tr(OU(t)|ψ ψ |U(t) † ).We consider essentially the same ML algorithm as described in Sec.III A, but utilize a more sophisticated approach to enforce sparsity in αP .We

FIG. 1 .
FIG.1.Learning to predict an arbitrary unknown quantum process E. Consider an unknown quantum process E with arbitrarily high complexity, and a classical dataset obtained from evolving random product states under E and performing randomized Pauli measurements on the output states.We give an algorithm that can learn a low-complexity model for predicting the local properties of the output states given the local properties of the input states.
on the output state.Recall that a randomized Pauli measurement measures each qubit of a state in a random Pauli basis (X , Y, or Z) and produces a measurement outcome of |ψ (out) = n i=1 |s (out) i , where |s (out) i ∈ stab 1 {|0 , |1 , |+ , |− , |y+ , |y− }.We denote the classical dataset of size N by computation of xP (O) and αP (O) can be done classically.The basic idea of αP (O) is to set the coefficient 3 |P| xP (O) to zero when the influence of Pauli observable P is negligible.Given an n-qubit state ρ, the algorithm outputs h(ρ, O) = P : |P|≤k αP (O) tr(Pρ).

FIG. 2 .
FIG. 2. Prediction performance of ML models for learning E(ρ) = e −itH ρe itH for a large time t.(a) Hamiltonians.We consider an XY or Ising model with a homogeneous or disordered Z field on an n-spin open chain.(b) Error scaling with training set size (N ).We show the root-mean-square error (RMSE) for predicting the Pauli-Z operator Z i on the output state E(ρ) for random product states ρ. (c),(d) Error scaling with evolution time (t) and system size (n).Panel (d) shows the RMSE for the XY model with a homogeneous Z field.The prediction error remains similar as we exponentially increase t and the Hilbert space dimension 2 n .
Hamiltonian H = P α P P, we say that H has an expansion coefficient c e and expansion dimension d e if, for any ϒ ⊆ {1, . . ., n} with |ϒ| = d e , P∈{I ,X ,Y,Z} ⊗n 1[α P = 0 and (ϒ ⊆ dom(P) or dom(P) ⊆ ϒ)] ≤ c e , (A2) where dom(P) is the set of qubits that P acts nontrivially on.The expansion property captures the connectivity of the Hamiltonian.We give two examples, the general k-local Hamiltonian and the geometrically local Hamiltonian, to provide more intuition on the expansion property.Fact 1 (Expansion property for a general k-local Hamiltonian).-AnyHamiltonian given by a sum of k-qubit observables has expansion coefficient 4 k and expansion dimension k.Proof.-LetH = P α P P. All the Pauli observables P with nonzero α P act on at most k qubits.For any ϒ with |ϒ| = k, all the Pauli observables with nonzero α P must have a domain contained in ϒ.There are at most 4 k such Pauli observables.Hence, the claim follows.Fact 2 (Expansion property for a bounded-degree k-local Hamiltonian).-AnyHamiltonian given by a sum of k-qubit observables H = j h j , where each qubit is acted on by at most d of the k-qubit observables h j , has expansion coefficient c e = 4 k d and expansion dimension d e = 1.Proof.-For every ϒ with |ϒ|, ϒ = {i} for some qubit i.For each qubit i (corresponding to ϒ = {i}), we have at most d k-qubit observables acting on i.Each of the k-qubit observables can be expanded into at most 4 k Pauli terms.Hence we can set c e = 4 k d and d e = 1.Fact 3 (Expansion property for a geometrically local Hamiltonian).-AnyHamiltonian given by a sum of geometrically local observables has expansion coefficient c e = O(1) and expansion dimension 1. Proof.-For a geometrically local Hamiltonian H = P α P P, each qubit i is acted on by at most a constant number c i = O(1) of P with nonzero α P .Hence, for any qubit i, P∈{I ,X ,Y,Z} ⊗n 1[α P = 0 and (ϒ ⊆ dom(P) or dom(P) ⊆ ϒ)] = c i .Thus, we can set d e = 1 and c e = max i c i = O(1).
Hamiltonian H = P : |P|≤k α P P, there is a randomized algorithm that runs in time O(n k ) and produces a random product state |ψ follows from k P : |P|=κ * |α P | r ≥ k κ=1 P : |P|=κ |α P | r = P =I |α P | r .For the fourth step, the analysis of polynomial optimization given in Appendix A 4 d (Corollary 10) can be combined with the above inequality to obtain all s ∈ [k].Then define pol(P) := 1 k! π ∈S k pol π (P).(A32) We can extend pol(•) linearly and define pol(O) P α P pol(P).We refer to pol(O) as the polarization of O.The squared Frobenius norm of O and pol(O) are related by tr(O 2 ) = k! tr(pol(O) 2 ).
FIG.4.Visualization of the ML model's prediction for a highly entangled initial state ρ = |ψ ψ|.We consider the expected value of Z i (t) = e itH Z i e −itH , where H corresponds to the 1D 50-spin XY chain with a homogeneous Z field.The initial state |ψ has a GHZlike entanglement over the left-half chain and is a product state with spins rotating clockwise over the right-half chain.To prepare |ψ with 1D circuits, a depth of at least (n) is required.Even though the ML model is trained only on random product states (a total of N = 10 000), it still performs accurately in predicting the highly entangled state over a wide range of evolution time t.
Theorem 5 (Random product states for optimizing a k-local Hamiltonian).-Consideran n-qubit k-local Hamiltonian H = P : |P|≤k α P P with expansion coefficient c e and dimension d e .Let r = 2d e /(d e + 1) ∈ [1, 2) and nnz(H ) |{P : α P = 0}|.There is a randomized algorithm that runs in time O(nk + nnz(H )2 k ) and produces either a random maximizing state |ψ = |ψ 1 6 (Random product states for optimizing a k-local Hamiltonian; alternative).-Consideran n-qubit k-local Hamiltonian H = P : |P|≤k α P P with k = O(1).Let nnz(H ) |{P : α P = 0}|.There is a randomized algorithm that runs in time O(nk + nnz(H )2 k ) and produces a random state |ψ = |ψ 1 Hamiltonian H = P : |P|≤k α P P with bounded degree d, |α P | ≤ 1 for all P, and k = O(1), there is a randomized algorithm that runs in time O(nd) and produces either a random maximizing state |ψ = |ψ 1 ⊗ • • • ⊗ |ψ n satisfying E |ψ [ ψ| H |ψ ] ≥ E |ψ [ ψ| H |ψ ] ≤ E [44]nother example, consider a 1D n-qubit nearest-neighbor chain (hence d = 2), where |α P | = 1 for only a constant number of Pauli observables P and |α P | = 1/√ n for the rest of the Pauli observables.The improvement over the Haar random state by our algorithm and that by the algorithm in Ref.[44]are given by (A19)We can see that our algorithm gives a larger improvement for the scaling with degree d.
)The function f (t) = tr(H ρ(t)) is a polynomial of degree at most k.We can compute function f (t) efficiently in time O(nnz(H )k) as ρ(t) is a product state.The optimization can thus be performed efficiently by sweeping through all possible values of t on a sufficiently fine grid.Let t * be the optimal t.Definition 2 (Homogeneous k local).-AHermitian operator H is homogeneous k local if H = P : |P|=k α P P.
).-Consider an n-qubit Hamiltonian H = P α P P with an expansion coefficient c e and expansion dimension d e .Let r = 2d e /(d e + 1).For any κ * ≥ 1, we have i∈[n] P∈{I ,X ,Y,Z} ⊗n : |P|=κ * , P i =I Permutation π can be obtained by sorting the n qubits.The above ensures that, for all i ∈ [n], i P∈{I ,X ,Y,Z} ⊗n : |P|=κ * , P π(i) =I α 2 P ≤ j ∈[n] P∈{I ,X ,Y,Z} ⊗n : |P|=κ * , P π(j ) =I 3 |P| E ρ∼D γ * (ρ dom(P) ).The error in the estimate αP could be arbitrarily large if (2/3) |P| E ρ∼D γ * (ρ dom(P) ) is close to zero.Hence, we present a filter in Appendix D 2 b below to handle this issue.In addition to this filter, the norm inequalities given in Appendix B show that most α P would be close to zero.Hence, when α P is small, we could simply set them to zero to avoid noise build-up.This gives rise to the second filtering layer given in Appendix D 2 c below.