Tackling Sampling Noise in Physical Systems for Machine Learning Applications: Fundamental Limits and Eigentasks

,


I. INTRODUCTION
Learning with quantum systems is a promising application of near-term quantum processors, with several recent demonstrations in both quantum machine learning (QML) [1][2][3][4][5][6] and quantum sensing [7][8][9].A broad class of such data-driven applications proceed by embedding data into the evolution of a quantum system, where the embedding, dynamics, and extracted outputs via measurement are all governed by a set of general parameters θ [10][11][12].Depending on the learning scheme, different components θ of this general framework may be trained for optimal performance of a given task.In all cases the fundamental role of the quantum system is that of a high-dimensional feature generator: given inputs u, a set of frequencies for the occurrence of different measurement outcomes act as a parameterized feature vector implementing a function f (u) that minimizes a chosen loss function (see Fig. 1).The relationship between the physical structure of the model and the function classes that can be expressed with high accuracy is a fundamental question of basic importance to the success of quantum models.Recent results have begun to shed light on this important question and provide guidance on the choice of parameterized quantum models [13][14][15][16][17][18][19][20][21].Yet when it comes to experimental implementations, the presence of noise is found to substantially curtail theoretical expectations for performance [1][2][3].
Given an input u to a general dynamical system, we define its Expressive Capacity (EC) as a measure of the accuracy with which K linearly independent functions {f (u)} of the input can be constructed from K measured features.This * These two authors contributed equally is a suitable generalization of the Information Processing Capacity introduced in Ref. [22] to noisy systems, and a direct quantification of the information content of these measured features.A central challenge in determining the EC for quantum systems is the fundamentally stochastic nature of measurement outcomes.Even when technical noise due to system parameter fluctuations is minimized as in an error-corrected quantum computer, there is a fundamental level of noise, the quantum sampling noise (QSN), which cannot be eliminated in learning with quantum systems.Quantum sampling noise therefore sets a fundamental limit to the EC of any physical system.Although QSN is well-understood theoretically, a formulation of its impact on learning is a challenging task as it is strongly determined by the quantum state of the system relative to the measurement basis, and is highly correlated when quantum coupling is present.Consequently, the impact of QSN is often ignored [10-12, 23, 24] (with a few exceptions [13,[25][26][27]), even though it can place strong constraints on practical optimization [25] and performance [26].
In this article, we develop a mathematical framework to quantify the EC that exactly accounts for the structure of QSN, providing a tight bound for a quantum system with K measurement outcomes under S samples, and illustrate how a mathematical framework for its quantification can guide experimental design for QML applications.While the strength of the EC lies in its generality, we provide numerical examples and experimental results confirming that higher EC is typically indicative of improved performance on specific tasks.As such, the EC provides a metric to guide ansätzedesign for improved learning performance in a task-agnostic and parameter-independent manner.Specifically, our work identifies enhancement in measurable quantum correlations as a general principle to increase the EC of quantum systems under finite sampling.(a) Representation of the learning framework considered in this work: inputs u are transformed to a set of outputs via a parameterized feature generator, here implemented using a finitely-sampled quantum system as shown in (b).Inputs are encoded in the state of a quantum system via a general quantum channel U, and information is then extracted through a positive operator-valued measure.This framework describes a wide range of practical quantum systems, from quantum circuits used in supervised or generative quantum machine learning, to quantum annealers exhibiting continuous evolution, and beyond, all defined by a general quantum channel with parameters θ.Extracted information takes the form of K stochastic features X obtained under finite shots S. The geometric structure of distributions of these measured features is fundamentally determined by quantum sampling noise, which depends on the quantum state ρ(u; θ), and hence on the nature of the mapping from input u to this quantum state.We show four obtained distributions differing only in the values of inputs u to highlight this dependence.As shown in (a), learned estimates for desired functions are then constructed via a linear combination w of X, with a resolution limited by S. Capacity C[f ] then quantifies the error in the approximation of a target function f via this scheme.
Our work goes beyond simply defining the EC as a figure of merit for parameterized quantum systems, however.In particular, we offer a reliable methodology to identify the native function set that is most accurately resolvable by a given encoding under finite sampling parameterization under QSN.Equivalently, we show that this defines a construction of measured features spanning the accessible information which is optimally robust to noise in readout, thereby furnishing a critical tool employable in any QML framework to improve learning in experimental settings.This strategy for defining the noise-constrained EC naturally focuses on accessible noisy output features under a specified measurement scheme, as opposed to unmeasured degrees of freedom.This makes the EC an efficiently-computable quantity in practice, as we demonstrate using both numerical simulations and experiments on IBM Quantum's superconducting multi-qubit processors [28].

II. THEORETICAL ANALYSIS A. Quantum Sampling Noise and Learning
The most general approach to learning from data using a generic quantum system is depicted schematically in Fig. 1.A table with symbols and abbreviations used in the text can be found in Appendix A. Any quantum learning scheme begins with embedding the data u through a quantum channel parameterized by θ acting on a known initial state, ρ(u; θ) = U(u; θ)ρ 0 . ( This channel includes all quantum operations applied to the input data; to obtain the computational output or perform further classical processing, one must extract information from the quantum system via a set of measurements described most generally as a positive operator-valued measure (POVM).Specifically, we define a set of K POVM elements { Mk }, each associated with a distinct measurement outcome indexed k, and constrained only by the normalization condition K−1 k=0 Mk = I (and hence not necessarily commuting).A single measurement or "shot" then yields a discrete index k (s) (u) specifying the observed outcome: for input u, if outcome k is observed in shot s then k (s) (u) ← k.Measured features are then constructed by ensemble-averaging over S repeated shots: Hence Xk (u) in this case is the empirical frequency of occurrence of the outcome k in S repetitions of the experiment with the same input u.These measured features are formally random variables that are unbiased estimators of the expected value of the corresponding element Mk as computed from ρ(u) .Explicitly so that x k is the probability of occurrence of the kth outcome as specified by the quantum state.These probability amplitudes encompass the accessible information in ρ(u; θ): any observable under this set can be written as a linear combination of POVM elements In QML theory, it is standard to consider the limit S → ∞, and to thus use expected features {x k (u)} for learning.In any actual implementation however, measured features { Xk (u)} must be constructed under finite S, in which case their fundamentally quantum-stochastic nature can no longer be ignored.More precisely, X are samples from a multinomial distribution with S trials and K categories, which can be decomposed into their expected value -the quantum-mechanical event probabilities x -and a zero-mean, input-dependent noise term ζ(u): Here ζ encodes the multinomial statistics of QSN; it has nonzero cumulants of all orders, of which the covariances take the particular S-independent form, or more concisely Σ = diag(x) − xx T .We note that the above expressions are exact; the factor of 1/ √ S is merely extracted for convenience of the analysis to follow, and in particular is not meant to suggest an expansion for large S; cumulants of ζ beyond second-order inherit a complicated Sdependence [29].
Before developing our capacity analysis, we note that when viewed in isolation Eq. ( 4) defines an extremely general map between inputs u and outputs X(u) assembled from S measurements.The readout features it describes could therefore have been extracted from any dynamical system with stochastic outputs, including systems that are entirely classical.This is no restriction -Eq.( 4) also applies to a very broad class of quantum systems: ultimately, measurement outcomes from quantum systems are also recorded by an observer as classical stochastic variables.Where, then, is the quantum nature of measured features apparent?This is encoded in the fact that for quantum systems all statistical properties of stochastic readout features X(u) -namely first-order cumulants x(u), second-order cumulants Σ(u), and all higher-order cumulants -are determined explicitly by the quantum state ρ(u), which itself may be a distribution that is hard to generate classically.
The framework we develop here allows characterization of the function-learning capacity of any noisy dynamical system satisfying Eq. ( 4), and provides a practical, experimentally applicable methodology to optimize learning by avoiding overfitting to noise in ML tasks.When applied to classical systems, it can be viewed as a means of statistical inference based on data containing classical noise [30].However, our primary interest is the study of quantum systems, where the fundamental model for the noise process depends nontrivially on the quantum state, a dependence we account for exactly.This enables us to extract the limits on function-learning capacity set by QSN and its dependence on the encoding.In this paper, using both theoretical studies and experiments on real quantum devices, we analyze how this capacity depends on quantum properties such as the degree of measured correlations, and how our framework can be applied for optimal learning in practical QML tasks such as classification.

B. Expressive Capacity
Returning to the situation depicted in Fig. 1, QML and quantum sensing can generically be cast as encoding data in a parameterized quantum system, and then using measurement outcomes to approximate a desired function f (u) (here assumed to be square-integrable The input data is defined with respect to a distribution p(u) which can be continuous or discrete: Recalling that features X(u) are estimators of POVM expectation values -the linear combination of which can be used to construct all accessible observables -f (u) is approximated for finite S as f W (u) = W T X(u).To quantify the fidelity of this approximation, we introduce the functional capacity [13,22,24], which is simply the normalized mean-squared accuracy of the estimate f Minimizing error in the approximation of f (u) by f W (u) over the input domain to determine capacity thus requires finding which can be always be expressed analytically via a pseudoinverse operation (see Appendix C).This function capacity is constructed such The choice of a linear estimator and a mean squared error loss function may appear restrictive at first glance, but the generality of our formalism averts such limitations.The use of a linear estimator applied directly to readout features appears to preclude classical nonlinear post-processing of measurements; however, this is simply to ensure the calculated functional capacity is a measure of the parameterized quantum system itself, and not of a classical nonlinear layer.Furthermore, the mean squared loss effectively describes the first term in a Taylor expansion of a wide range of arbitrary nonlinear post-processing and non-quadratic loss functions (see Appendix C 5).
To extend the notion of capacity to a task-independent metric representing how much classical information about an input can be extracted from a system in the presence of noise, we sum the function capacity over a basis of functions {f } ∈N which are complete and orthonormal with respect to the input distribution, i.e. equipped with the inner product f , which effectively quantifies how many linearly-independent functions can be expressed from a linear combination of { Xk (u)}.Our main resultproven in detail in Appendix C 4 -is that given any S ∈ N + , the EC for a physical system whose measured features are stochastic variables of the form of Eq. ( 4) is given by The first equality, arrived at through straight-forward algebraic manipulation, is written in terms of the expected feature Gram and covariance matrices G ≡ E u [xx T ] and V ≡ E u [Σ] respectively.We later demonstrate that these expected quantities can be accurately estimated in experiment and consequently under finite S (see Appendix D 1 and Eq.(D1)).The second equality remarkably provides a closed-form expression for C T at any S, which is independent of the generallyinfinite set {f } ∈N (and thus not subject to numerical challenges associated with its evaluation over such a set [22]).Instead the EC is entirely captured by the function capacity of K distinct functions, and for a given physical system is fully characterized by the spectrum of eigenvalues {β 2 k } k∈[K] satisfying the generalized eigenvalue problem In the above, all quantities depend on θ and thus the specific physical system and input embedding via the Gram (G) and covariance (V) matrices.Associated with each β 2 k is an eigenvector r (k) living in the space of measured features and thus defining a set of K orthogonal functions via the linear transformation We refer to {y (k) } as eigentasks, as they form the minimal set of orthonormal functions (E u [y (j) y (k) ] = δ jk ) which completely accounts for the EC of a physical system and thus the accessible information content present in its measured features.Specifically, the capacity to approximate a given y (k) with S shots is C[y (k) ] = 1/(1 + β 2 k /S): the EC in Eq. ( 7) is simply a sum of eigentask capacities.This further highlights that a given parameterized system can only approximate a target function to the degree that it can be written as a linear combination of {y (k) }.The eigentasks thus serve as a powerful basis for learning, as shall be explored in Sec.III.

Defining measured eigentasks
we find (see Appendix C 3) that {r (k) } specify a unique linear transformation that simultaneously orthogonalizes not only the signal, but also the associated noise: The term β 2 k /S is thus the mean squared error, or noise power, associated with the approximation of eigentask y (k) ; equivalently, ȳ(k) has a signal-to-noise ratio of S/β 2 k .This leads to a natural interpretation of {β 2 k } as noiseto-signal (NSR) eigenvalues.The eigentasks, ordered in increasing noise strength 0 are the orthogonal set of functions maximally robust to noise.
Having developed our framework for EC in the most general context, in the remainder of this paper we will use it to analyze quantum systems in particular.The same quantitative metrics -EC, eigentasks, and NSR eigenvalues -now carry the significance of being determined by an arbitrary parameterized quantum state ρ(u; θ), whose data-dependence ideally can be hard to model classically.Our formulation of EC hence encompasses general quantum states, to the best of our knowledge the first of its kind, going beyond characterizations of noise-constrained capacity that have been attempted for linear classical systems [31] and Gaussian quantum systems [26].The eigentasks then reveal the set of orthogonal functions best approximated by the quantum system, and hence are sensitive to properties such as the degree of quantum correlations.Finally, the fidelity of approximation of these native functions -determined by NSR eigenvalues -is constrained fundamentally by QSN.
From Eq. (7) we have lim S→∞ C T = Rank{G}, where Rank{G} = K, the number of measured features, provided no special symmetries exist (see Appendix C 6).This important result reveals that in the absence of noise all dynamical systems -independent of pararameterization -have a capacity which is simply the number of independent accessible degrees of freedom [22,31] The generic exponential scaling of measured degrees of freedom with quantum system-size (e.g.K = 2 L for L-qubit systems subject to a computational basis measurement) is often-cited as a motivator for performing ML with quantum systems [13,24,32].However, as will be demonstrated shortly, the EC of quantum systems can be significantly reduced from this limit for finite S in a way that strongly depends on the encoding.By evaluating the ability of quantum systems to accurately express functions in the presence of QSN, the capacity analysis above provides an important metric to asses the utility of quantum platforms for learning in practice.

C. Expressive Capacity of Quantum 2-designs
We first consider the EC of quantum 2-designs: systems with fixed θ that map inputs to a unitrary ensemble {p(u)du, Û (u; θ)} whose first and second moments agree with those from a uniform (Haar) distribution of unitaries.Quantum 2-designs are important to recent QML studies [11,21] due to their role in defining "expressibility" [16,17]: a metric quantifying how close a parameterized quantum system is to such a 2-design.The capacity eigenproblem Eq. ( 8) for any quantum 2-design over K-dimensions can be solved analytically (see Appendix G), yielding a flat spectrum of NSR eigenvalues β 2 k = K(1 − δ k0 ).This results in an EC which at finite S can be significantly lower than K.For qubitbased systems with K = 2 L , all k = 0 eigentasks have a noise strength 2 L /S, requiring S to grow exponentially with qubit-number L in order to extract useful features.
A quantum 2-design is thought of as having maximal "expressibility", however we see that its EC always vanishes exponentially with system size for a fixed finite S. It is exactly such systems that have been shown to lead to barren plateaus which preclude learning [21].To emphasize the distinction with "expressibility", we note that EC reflects how much classical information can be extracted from the entire "quantum computational stack" in practice: from an abstract algorithm, to the quantum hardware on which its implemented, and the classical electronics used for control and readout.EC requires only noisy computational outputs { Xk (u)} and is thus efficiently-computable in experiment -unlike more abstract metrics [7,16,17] -yielding a directly relevant metric for learning with quantum hardware.

III. EXPERIMENTAL RESULTS
To demonstrate the practical utility of our framework, we now show how the spectrum {β 2 k }, the EC, and eigentasks can all be computed for real quantum devices in the presence of parameter fluctuations and device noise.We reiterate at the outset that our approach for quantifying the EC of a quantum system is very general, and can be applied to a variety of quantum system models.For practical reasons, we perform experiments on L-qubit IBM Quantum (IBMQ) processors, whose dynamics is described by a parameterized quantum circuit containing single and two-qubit gates.However, as an example of the broad applicability of our approach, in Appendix E we compute the EC for L-qubit quantum annealers via numerical simulations, governed by the markedly different model of continuous-time Hamiltonian dynamics.
On IBMQ devices, resource limitations restrict our computation of EC to 1D inputs u that are uniformly distributed, p(u) = Unif[−1, 1], see Fig. 2(a).Specifically, we are limited to N = 300 distinct inputs; a 1D distribution then ensures features { Xk (u)} are sufficiently densely sampled to approach the continuum limit, and are also easy to visualize.We emphasize that this analysis can be straightforwardly extended to multi-dimensional and arbitrarily-distributed inputs given suitable hardware resources, without modifying the form of the Gram and covariance matrices.
We are only now required to specify the model of the quantum system, and choose an ansatz tailored to be natively implementable on IBMQ processors (see Appendix B).We fix ρ0 = |0 0| ⊗L ; note, however, that any other initial state may be implemented via an additional unitary and absorbed into the "encoding", i.e. the quantum channel U(u; θ) of Eq. (1).In this way, the dependence of EC on initial states could be explored in future studies.
The circuit we choose consists of τ ∈ N repetitions of the same input-dependent circuit block depicted in Fig. 2(a).The block itself is of the form R x (θ x /2)W(J)R z (θ z + θ I u)R x (θ x /2), where R x/z are Pauli-rotations applied qubit-wise, e.g.R z = l R z (θ z l + θ I l u).A two-qubit coupling gate acts between physically connected qubits in the device and can be written as Within the structure of this ansatz, we will choose all singlequbit rotation parameters randomly: generally representing a circuit trained for a particular unspecified task.Each instance of random parameters, along with associated dissipative processes, specifies the quantum channel U(u; θ) which we refer to as an "encoding".We will study the performance of an overall ansatz by looking at the behavior averaged across encodings as hyperparameters such as J are varied.In this work we also choose τ = 3, which limits circuit depth and associated prevalence of gate errors, while still generating a complex state with correlation generally distributed throughout all qubits.
Finally, we consider feature extraction via a computational basis measurement as is standard in quantum information processing: the POVM elements are the K = 2 L projectors Mk = |b k b k |, where b k is the L-bit binary representation of the integer k.However, as with state preparation, measurements in any other basis can be (and in practice, are) realized using an additional unitary prior to computational basis readout, whose effect can similarly be analyzed as part of the general encoding U(u; θ).
Note that for this ansatz, the choice J = 0 (mod π) yields either W = Î or σz ⊗ σz , both of which ensure ρ(u) is a product state and measured features are simply products of uncorrelated individual qubit observables -equivalent to a noisy classical system.Starting from this product system (PS), tuning the coupling J = 0 (mod π) provides a controllable parameter to realize a quantum correlated system (CS), for which the 2 L -dimensional multinomial distribution x(u) cannot be represented as a tensor product of L marginal binomial distributions on each qubit.In general, such non-product systems intuitively result in u-dependent quantum states which exhibit entanglement and can potentially be more difficult to describe classically.This control enables us to address a natural question regarding EC of quantum systems under finite S: what is the dependence of EC and realizable eigentasks on J, and hence on quantum correlations?

A. Expressive Capacity of Quantum Circuits
To perform the capacity analysis, one must extract measured features from the quantum system as the input u is varied, as exemplified in Fig. 2(a) for the IBMQ ibmq perth device.For comparison, we also show ideal-device simulations (unitary evolution, no device noise), where slight deviations are observed.The agreement with experimental results is improved when the effects of gate errors, readout errors, and qubit relaxation are included, hereafter referred to as "device noise" simulations, highlighting both the non-negligible role of device nonidealities, and that our analysis incorporates them.
The measured features under finite S are used to estimate the Gram and covariance matrices (see detailed techniques in Appendix D), and to therefore solve the eigenproblem Eq. ( 8) for NSR eigenvalues {β 2 k }.Typical NSR spectra computed for a random encoding (i.e.set of rotation parameters) on the device are shown in Fig. 2(b), for J = 0 (PS) and J = π/2 (CS), together with corresponding spectra from device noise simulations, with which they agree well.We note that at lower k, the device NSR eigenvalues are larger than those from ideal simulations, and at larger k deviate from the direct exponential increase (with order) seen in ideal simulations.Both these effects are captured by device noise simulations as well and can therefore be attributed to device errors and dissipation.The NSR spectra therefore can serve as an effective diagnostic tool for quantum processors and encoding schemes.
The NSR spectra can be used to directly compute the EC of the corresponding quantum device for finite S, via Eq.(7).Practically, at a given S only NSR eigenvalues β 2 k S contribute substantially to the EC.An NSR spectrum with a flatter slope therefore has more NSR eigenvalues below S, which gives rise to a higher capacity.

FIG. 2.
(a) A representation of the EC analysis, featuring the IBMQ Perth device and a schematic of the quantum circuit considered in this section.On the right, the specific feature plotted is X1(u) (b1 = 000001) with S = 2 14 shots.(b) Left panel: Device noiseto-signal spectrum β 2 k for a specific encoding as a correlated system (CS), J = π/2 (blue crosses) and product system (PS), J = 0 (brown diamonds).Ideal (solid) and device noise (dashed) simulations are also shown.Note the agreement between device and simulation, along with distortion from more direct exponential growth in β 2 k with k in the ideal case, due to device errors.Right panel: CT vs. S calculated from the left panel.At a given S, the CT can be approximated by performing the indicated sum over all β 2 k < S. (c) Expressive capacity CT (top panel) and expected total correlation T (lower panel) for the chosen encoding under S = 2 14 from the IBM device, and device noise simulations (dashed peach).Average metrics over 8 random encodings for device noise (solid peach) and ideal (solid gray) simulations are also shown.The S → ∞ expressive capacity of these encodings always attains the max{CT } = 64, indicated in dashed red.
generally exhibits an NSR spectrum with a flatter slope than the PS, yielding a larger capacity for function approximation across all sampled S.
To more precisely quantify the role of quantum correlations in EC, we introduce the expected total correlation (ETC) of the measured state over the input domain of u [33,34], where ρM (u) and ρl = Tr [L]\{l} {ρ} is the reduced density matrix.Therefore, non-zero ETC signals the generation of quantum states over the input domain u that on average have nontrivial correlations amongst their constituents, including for example pure many-body states that are entangled.We now compute EC and ETC using S = 2 14 in Fig. 2(c) as a function of J, for the same random encoding considered above on the device.We note that the experimental results show excellent agreement in both cases with the corresponding device noise simulation; we also show average EC at S = 2 14 and ETC across 8 random encodings in both ideal and device noise simulations.The influence of individual encodings, i.e., random rotation parameters, is seen to manifest as small deviations from the overall EC trend governed by global hyperparameters, such as J here (or L in Appendix Fig. 9).This justifies our choice of random circuits to evaluate the overall capacity of an ansatz for learning.
We note that product states by definition have T = 0 [35]; this is seen in ideal simulations for J = 0 (mod π).However, the actual device retains a small amount of correlation at this operating point, which is reproduced by device noise simulations.This can be attributed to gate or measurement errors as well as cross-talk, the latter being especially relevant for the transmon-based IBMQ platform with a parasitic always-on ZZ coupling [36].With increasing J, T increases and peaks around J ≈ π/2 (mod π); interestingly, C T also peaks for the same coupling range.From the analogous plot of EC, we clearly see that at finite S, increased ETC appears directly correlated with higher EC.We have observed very similar behaviour using completely different quantum system models (see Appendix Fig. 6 [37,38]).This indicates the utility of enhancing quantum correlations as a means of improving the general expressive capability of quantum systems.
We caution that this connection between measurement correlations and EC is an observed trend, rather than a law derived from first principles.One can come up with contrived situations where increasing correlation has no effect on EC: for example, appending a layer of CNOT gates directly prior to measurement will generally increase the ideal ETC of any ansatz.For measured features however this amounts to a simple shuffling of labels x k (u) ↔ x k (u), thus yielding the same NSR spectrum and EC.The input, quantum-state, and feature mapping ultimately governs EC: only increases in correlation that also increase the complexity of the measured features' u-dependence (as achieved via the intermediate W gates here) are beneficial from the perspective of information processing.
As a final important point, note that at finite S, even with increased quantum correlations, the maximum EC is still substantially lower than the upper bound of K = 64.This remains true even for ideal simulations, and over several random encodings, so the underperformance cannot be attributed to device noise or poor ansatz choice respectively.It is worth emphasizing that the impact of device noise is captured in the small EC gap between the ideal and noise simulation curves, with the remainder of the reduction from K = 64 attributable to QSN alone.These results clearly indicate that the resulting sampling noise at finite S is the fundamental limitation for QML applications on this particular IBM device, rather than other types of noise sources and errors.

B. A Robust Approach to Learning
While we have demonstrated the EC as an efficientlycomputable metric of general expressive capability of a noisy quantum system, some important practical questions arise.First, does the general EC metric have implications for practical performance on specific QML tasks?Secondly, given the limiting -and unavoidable -nature of correlated sampling noise, does the EC provide any insights on optimal learning using a particular noisy quantum system and the associated encoding?
Our formulation addresses both these important questions naturally, as we now discuss.Recall that beyond being a simple figure of merit, the EC is precisely the sum of capacities to approximate a particular set of orthogonal functions native to the given noisy quantum system: the eigentasks.Furthermore, these eigentasks ȳ(k) (u) can be directly estimated from a noisy quantum system via the generalized eigenvectors {r (k) }, and are ordered by their associated NSR {β 2 k }.In Fig. 3(a) show a selection of estimated eigentasks from the device for the CS (J = π/2) and PS (J = 0) encodings of Fig. 2(b).For both systems, the increase in noise with eigentask order is apparent when comparing two sampling values, S = 2 10 and S = 2 14 .Furthermore, for any order k, eigentasks for the PS are visibly noisier than the CS; this is consistent with NSR eigenvalues for PS being larger than those for CS (Fig. 2(b)).The higher expressive capacity of the CS can be interpreted the ability to accurately resolve more eigentasks at fixed S.
The resolvable eigentasks of a finitely-sampled quantum system are intimately related to its performance at specific QML applications.To demonstrate this result, we consider a concrete application: a binary classification task that is not linearly-separable.The domain u ∈ [−1, 1] over which EC was evaluated is separated into two classes, as depicted in Fig. 3(b).A selection of N train = 150 total sampleswith equal numbers from each class -are input to the IBMQ device, and eigentasks {ȳ (k) (u (n) )} KL are estimated using S = 2 14 shots.A linear estimator applied to this set of eigentasks is then trained using logistic regression to learn the class label associated with each input.Finally, the trained IBMQ device is used to predict class labels of N test = 150 distinct input samples for testing.Note that we use the random circuits of the previous section to draw more direct comparisons between EC and task performance.By training only external weights instead of internal parameters θ we are employing the framework of Reservoir Computing [22,23], which allows one to avoid the computational overhead and difficulty associated with training quantum systems while still achieving comparable performance [4,13,26,32].
This task can equivalently be cast as one of learning the likelihood function that discriminates the two input distributions, shown in Fig. 3(c), with minimum error.The set of up to K L eigentasks ȳ(k) (u), where K L ≤ K, serves as the native orthonormal basis of readout features used to approximate any target function using the quantum system.Importantly, the basis is ordered, with eigentasks at higher k contributing more noise, as dictated by the NSR eigenvalues β 2 k .In particular, at any level of sampling S, there exists an eigentask order K c (S) after which the NSR β 2 k /S first drops below unity: K c (S) = max k {β 2 k < S}.Heuristically, including eigentasks k > K c (S) should contribute more 'noise' to the function approximation task than 'signal'.In Fig. 3(c), we plot the learned estimates of the likelihood function using K L = K c (S) eigentasks for both the CS and PS.First, we note that K c is lower for the PS than the CS; the former has fewer resolvable eigentasks at a given S.This limitation on resolvable features limits function approximation capacity: the learned estimate of the likelihood function using K c eigentasks is visibly worse for the PS than the CS.
In this way, higher EC allows noisy quantum systems to better approximate more functions, which translates to improved learning performance -this result is explored systemically in Fig. 4(b).Of course, it is natural to ask whether using K c (S) ≤ K eigentasks is optimal: exactly this question is investigated in Fig. 4(a), where we plot the training and test accuracy of both device encodings as a function of the number of measured eigentasks K L .The performance on the specific training and test set shown in Fig. 3(b) is indicated with markers, and solid lines indicate the average performance over 10 distinct divisions of the data into training and test sets.This permutation of the learning task is a standard technique to optimize hyperparameters in ML, and is done here to eliminate the sensitivity of these results to the choice of training set.First note that in all cases, using all eigentasks (K L = K)or equivalently all measured features { X} -leads to far lower test accuracy than is found in training.The observed deviation is a distinct signature of overfitting: the optimized estimator learns noise in the training set (comprised of noisy eigentask The optimal test set performance is found near the noise-to-signal cutoff Kc(S = 2 14 ) (dash-dotted lines) informed by the quantum system's noise-to-signal spectra.(b) Testing set classification accuracy as a function of J for our optimal learning method.In all cases, the average performance over the 10 task permutations is reported, using Kc(S = 2 14 ).Markers indicate device results for the chosen encoding, and the corresponding simulation is shown in solid peach.Dashed peach shows the average of these results over the 8 device noise simulation encodings, and dashed grey the ideal simulation performance in the S → ∞ limit, where all K = 64 features are used.The horizontal line denotes the performance of a software neural network with KL = 64 nodes (and 1153 Kc trained parameters) for comparison.
Improvements in model training performance with added features are only meaningful insofar as they also lead to better performance on new data: in both encodings we see test set classification accuracy peaks near K c (S).This is particularly clear for the averaged results, but even for individual datasets the test accuracy at K c (S) is within ≈ 2% of its maximum, thus confirming our heuristic reasoning that eigentasks beyond this order, with an NSR < 1, hinder learning.The eigentask-learning approach naturally allows one to decompose the outputs from quantum measurements into a compressed basis with known noise properties, and then select the set of these which exactly captures the resolvable information at a given S.This robust approach to learning enabled by the capacity analysis maximizes the ability of a noisy quantum system to approximate functions without overfitting to noise, in this case fundamental QSN.
Finally, Fig. 4(b) shows the classification accuracy for this device encoding as J is varied, where following the above ap-proach, the optimal K c (S) set of eigentasks are used for each encoding.We also show the performance of a similar-scale (K L = 64 node) software neural network and ideal simulations in the S → ∞ limit (K c (∞) = 64) for comparison.Note that only these infinite-shot results approach the classical neural network, with QSN imposing a significant performance penalty even for J ≈ π/2 (mod π).We highlight the striking similarity with Fig. 2(c): encodings with larger quantum correlations and thus higher expressive capacity will perform generically better on learning tasks in the presence of noise, because they generate a larger set of eigentasks that can be resolved at a given sampling S. Expressive Capacity is a priori unaware of the specific problem considered here; this example thus emphasizes its power as a general metric predictive of performance on arbitrary tasks.

IV. DISCUSSION
We have developed a straightforward approach to quantify the expressive capacity of any quantum system in the presence of fundamental sampling noise.Our analysis is built upon an underlying framework that determines the native function set that can be most robustly realized by a finitely-sampled quantum system: its eigentasks.We use this framework to introduce a methodology for optimal learning using noisy quantum systems, which centers around identifying the minimal number of eigentasks required for a given learning task.The resulting learning methodology is resource-efficient and robust to overfitting.We demonstrate that eigentasks can be efficiently estimated from experiments on real devices using a limited number of training points and finite shots.We also demonstrate across two distinct qubit-based ansätze that the presence of measured quantum correlations enhances expressive capacity.Our work has direct application to the design of circuits for learning with qubit-based systems.In particular, we propose the optimization of expressive capacity as a meaningful goal for the design of quantum circuits with finite measurement resources.
Appendix A: Eigen-NSR associated with eigentask k y (k)  Eigentask, k r linear combination of expected features {x k } forming y (k) ȳ(k) Finite-S estimate of eigentask, k r Cutoff index where β 2 k approaches S, max k {β 2 k < S}

TABLE I. Table of notations used in main text
Appendix B: Feature maps using quantum systems In the main text, we introduce the idea of encoding inputs into the state of a quantum system via a parameterized quantum channel, reproduced below: one then measures this state to approximate desired functions of the input.Figure 5 gives a simple example of this mapping from classical inputs u (here in a 2D compact domain) to a quantum state generated by a u-dependent encoding, and finally to the measured features in a 2-qubit system undergoing commuting local measurements in the computational basis.The measurement outcomes are therefore bitstrings, of which there are K = 2 L = 4, namely: b k ∈ {00, 01, 10, 11}.A given shot will yield one of these possible bitstrings.
On the right we plot samples of features Xk constructed with different numbers of shots S. As expressed in Eq. ( 4), the noise and thus variance in the distribution of samples scales with S. As S → ∞ this distribution thus collapses to a single deterministic point, the corresponding quantum probability x(u).It is also evident from this plot that the shape and orientation of these clusters depends on the underlying quantum state ρ(u; θ) and associated probabilities x(u) via Eq.( 5).In the remainder of this section, we will consider more complex quantum models, such that they generate mappings which can be useful for learning.To describe these models, we begin by first limiting to 1-D inputs u as analyzed in the main text; generalizations to multidimensional inputs u are straightforward.Then, we write Eq. (B1) in the form In the main text, we have considered a model for dynamics of an L-qubit quantum system that is natively implementable on modern quantum computing platforms: namely an ansatz of quantum circuits with single and two-qubit gates.We refer to this encoding as the circuit ansatz (or C-ansatz for short) for which the operator Û (u; θ) takes the precise form For completeness, we recall that R x/z are Pauli-rotations applied qubit-wise, e.g.R z = l R z (θ z l + θ I l u), while the coupling gate acts between physically connected qubits in the device and can be written as W(J) = l,l exp{−i J 2 σz l σz l }.We emphasize here again that τ ∈ N + is an integer, representing the number of repeated blocks in the C-ansatz encoding.We note that the actual operations implemented on IBMQ processors also include dynamics due to noise, gate, and measurement errors, and thus must be represented as a general quantum channel as in Eq. (B1).As discussed in the main text, the EC of a quantum system can be computed in the presence of these more general dynamics, and is sensitive to the limitations introduced by them.
An alternative ansatz which we analyze in this SI, is where the operator Û (u; θ) describes continuous Hamiltonian dynamics.This ansatz is relevant to computation with general quantum devices, such as quantum annealers and more generally quantum simulators.In this case, which we refer to as the Hamiltonian ansatz (or H-ansatz for short), Here t is a continuous parameter defining the evolution time; and Ĥ0 = where N (0, 1) defines the standard normal distribution with zero mean and unit variance.We consider nearest-neighbor interactions J l,l , which can be constant J l,l ≡ J, or drawn from J l,l ∼ Unif[0, J max ], where Unif As an aside, we note that the C-ansatz quantum channel described by Eq. (B3) can be considered a Trotterization-inspired implementation of the H-ansatz in Eq. (B4).In particular, if we set θ x/z/I = h x/z/I ∆ • τ , where t = ∆ • τ , and consider the limit ∆ → 0 while keeping t fixed, Eq. (B3) corresponds to a Trotterized implementation of Eq. (B4).This correspondence is chosen for practical reasons, but is not necessary in our analysis.
Take the ensemble averages as readout features:  A universal function approximation theorem (which will be formally stated in Appendix J), as a basic requirement of most neural network models, can be made concrete by defining a metric to quantify how well a given quantum system (or any dynamical system) approximates general functions.Suppose an arbitrary probability distribution p(u) for a random (scalar) variable u defined in D ⊆ R.This naturally defines a function space L 2 p (D) containing all functions f : D → R with f 2 (u)p(u)du < ∞.The space is equipped with the inner product structure f 1 , f 2 p = f 1 (u)f 2 (u)p(u)du.A standard way to check the ability of fitting nonlinear functions by a physical system is the information processing capacity [22], where functions f (u) are orthogonal target functions f , f p = f (u)f (u)p(u)du = 0 for = .The total expressive capacity is defined as capturing the ability of what type of function the linear combination of physical system readout features can produce.Dambre et.al.'s argument claims that the total capacity must be upper bounded by the number of features C T ≤ K. [22] is quite general, it neglects the limitations due to noise in readout features, a fact that is unavoidable when using quantum systems in the presence of finite computational and measurement resources.It is generally accepted that the capacity is reduced in the presence of additive noise, but there are no general results on how to quantify that reduction.This is our goal here, to arrive at an exact result for capacity reduction under well-defined conditions.

While Dambre et. al.'s result
In this section, we will focus on the impact of fundamental quantum readout noise, or quantum sampling noise (QSN), on this upper bound under finite sampling S. Given u and S, the quantum readout features Xk (u) = 1 S S s=1 δ(k (s) (u), k) are stochastic variables.The expectation vector and covariance matrix of X(u) can be expressed in terms of ρ(u) To determine the optimal capacity to compute an arbitrary normalized function f (u) = ∞ j=0 (Y) j u j using the noisy readout features X(u) extracted from the quantum system, we need to find an optimal W such that By expanding the numerator of the right-hand side for a given, finite number of shots S, we find where we have approximated the integral over the input domain by a finite sum in the limit of a large number of inputs N .Next, note that if n = n , then X k1 (u (n) ) and X k2 (u (n ) ) are independent random variables (though not necessarily identically distributed).The sums over N on the right hand side are therefore sums of bounded independent random variables.In the limit of large N 1, the deviation between stochastic realizations of these sums and their expectation values is exponentially suppressed, as determined by the Hoeffding inequality.Then, with large probability, the sums over N may be replaced by their expectation values, The first approximation above comes from the Hoeffding inequality, where terms that are dropped are proportional to 1/ √ N .In going from the second to the third line, we have used Eq.(C3).The final expression is obtained by rewriting sums over u as integrals, with an error proportional to 1/ √ N once more.Thus we can say the original integral in Eq. ( C4) is approximately equal to Eq. (C6) to O(1/ √ N ).In the limit of a large number of input samples, N → ∞, we conclude that all approximations can be replaced by exact equalities.
The goal of the remaining part of this section is deducing a more compact generalized Rayleigh quotient form of functional capacity.The dependence of readout features x k (u) on the input u can always be written in the form of a Taylor expansion, where we define the transfer matrix T(θ) ≡ T ∈ R K×∞ that depends on the density matrix ρ(u), and in particular on parameters θ characterizing the quantum system.The first term in Eq. (C6) does not depend explicitly on the function f (u) being constructed, and introduces quantities that are determined entirely by the response of the quantum system of interest to inputs over the entire domain of u.In particular, we introduce the Gram matrix G ∈ R K×K as where in the second line we have also introduced the generalized Hilbert matrix Λ ∈ R ∞×∞ as Secondly, we introduce the noise matrix V ∈ R K×K , Here we have also introduced the second-order-moment matrix D ∈ R K×K such that (D) k1k2 = δ k1k2 x k1 (u)p(u)du.Then, the noise matrix simply defines the covariance of readout features, and is therefore given by V = D − G.The second term in Eq. (C6) depends on f (u) and can be simplified using the Λ matrix as well, With these definitions, Eq. (C4) can be compactly written in matrix form as a Tikhonov regularization problem: The least-squares form ensures that the optimal value (argmin) w of W has closed form

TΛY. (C13)
Substituting w into the expression for C, we obtain the optimal capacity with which a function f can be constructed, which takes the form of a generalized Rayleigh quotient

Eigentasks
Eq. (C14) defines the optimal capacity of approximating an arbitrary function f (u) = ∞ j=0 (Y) j u j .We can therefore naturally ask which functions f maximise this optimal capacity.To this end, we first note that the denominator of Eq. ( C14) is simply a normalization factor that can be absorbed into the definition of the function f (u) being approximated, without loss of generality.More precisely, we consider: Then, we can rewrite the optimal capacity from Eq. (C14) as Here we have defined the matrix Q ∈ R ∞×∞ as by introducing the matrix square root of G 1 2 ∈ R K×K , and R the noise-to-signal matrix.The decomposition in Eq. (C17) may be verified by direct substitution into Eq.(C16).The ability to calculate matrix powers and in particular the inverse of G requires constraints on its rank, which we show are satisfied in Appendix C 6.
We now consider the measure-independent part of the eigenvectors of Q, indexed Y (k) , satisfying the standard eigenvalue problem: where k = 0, • • • , K − 1.From Eq. (C16), it is clear that these eigenvectors have a particular meaning.Consider the function y (k) (u) defined by the eigenvector Y (k) , namely which we will refer to from now on as eigentasks.Suppose we wish to construct the function y (k) (u) using outputs obtained from the physical system defined by Q in the S → ∞ limit (namely, with deterministic outputs).At a first glance, before we dive into solving the eigenproblem Eq.(C20), we do not know any relationship between y (k) and x(u).The rest part of this subsection is aiming to prove that y (k) must be a specific linear combination of features x(u).Then, the physical system's capacity for this construction is simply given by the corresponding eigenvalue C k , as may be seen by substituting Eq. (C20) into Eq.(C16).Formally, the y (k) (u) serves as the critical point (or stationary point) of the generalized Rayleigh quotient in Eq. (C14).Consequently, the function that is constructed with largest capacity then corresponds to the nontrivial eigenvector with largest eigenvalue.
To obtain these eigentasks, we must solve the eigenproblem defined by Eq. (C20).Here, the representation of Q in Eq. (C17) becomes useful, as we will see that the eigensystem of Q is related closely to that of the noise-to-signal matrix R. In particular, we first define the eigenproblem of R, RG with NSR eigenvalues β 2 k and corresponding eigenvectors r (k) , which satisfy the orthogonality relation r (k )T Gr (k) = δ k,k .Here the r (k) is equivalent to be defined as the solution to generalized eigen-problem: This is because k) .The prefactor G 1 2 is introduced for later convenience.Eq. (C22) then allows us to define the related eigenproblem Next, we note that Q is related to the matrix in brackets above via a generalized similarity transformation defined by B, Eq. (C17).In particular, Comparing with Eq. (C20), we can now simply read off both the eigenvalues and eigenvectors of Q, where we have used the definition of B from Eq. (C18).The functions defined by the eigenvectors Y (k) are automatically orthonormalized:

Noisy eigentasks from readout features
We can now also discuss the interpretation of {β 2 k } for a physical system -in this case a quantum circuit -for which {r (k) } are known.Consider a single run of the quantum system under finite shots S, which yields a single instance of the readout features X(u).We can simply read off that an noisy version of the kth eigentask, ȳ(k) (u) can be constructed as which is equivalent to requiring the output weights W = r (k) .The corresponding set of noisy function is also orthogonal, this is because This equation can be further decomposed into two parts.Let the linear transformation of noise ξ(u) by defining It means that the combination {r (k) ∈ R K } k∈[K] not only produces orthogonal eigentasks {y (k) (u)} for signal, but also induces a set of orthogonal noise functions {ξ (k) (u)}.
If the quantum circuit can be run multiple times for a given S, multiple instances of X(u) can be obtained, from each of which an estimate of the kth eigentask ȳ(k) (u) can be constructed.The expectation value of these estimates then simply yields If we have access to only a single instance of X(u), however, and thus only one estimate ȳ(k) (u) (as y (k) (u) and ȳ(k) (u) depicted in Fig. 8), it is useful to know the expected error in this estimate.This error can be extracted from Eq. (C12).In particular, requiring Y (k) = T T r (k) , we have This mean squared error in using ȳ(k) (u) to estimate y (k) (u) over the domain of u decreases to zero for S → ∞ as expected, since the noise in X decreases with S.However, β 2 k defines the S-independent contribution to the error.In particular, this indicates that at a given S, certain functions with lowers NSR eigenvalues β 2 k may be better approximated using this physical system than others.We present in Fig. 8 the measured features X, the eigentasks y and their S-finite version ȳ in a 6-qubit Hamiltonian based system.The associated eigen-NSR spectrum, expressive capacity, and total correlations are also depicted for both CS J = 0 and PS J = 0.

Expressive capacity
Given an arbitrary set of complete orthonormal basis functions The total capacity is independent of the basis choice 5. Estimation in case of nonlinear functions after linear output layer Usually, instead of taking the linear transformation W • X, the training process can involve some complicated nonlinear activation functions or classical kernel, which may also be fed into a non-quadratic nonlinear loss function afterwards.These two processes can be unified to be σ NL ( X(u)) with any smooth function σ NL .In this subsection, we show how to translate our result obtaining from quadratic nonlinear function Eq. (C4) into a more general loss function with form of Now let us first transform all noisy measured features { Xk } into the naturally orthogonal basis of signal {y (k) } and noise {ξ (k) }.
, we can deal with the nonlinearity by taking the cumulant expansion up to the quadratic term, and we get where the first order terms vanish due to Hoeffding inequality again.We then make a further approximation of Eq. (C38) by replacing the ξ (k1) ξ (k2) with its u-average E u [ξ (k1) ξ (k2) ] = δ k1k2 β 2 k1 /S: In fact, any of the second terms can be further simplified by chain rule: The approximation in Eq. (C39) is rough, but it still gives us a sufficient reason to do the following manipulation: for optimized L , the dependence on y (k) with β 2 k /S > 1 will be strongly suppressed in large-N limit, hence we can pre-exclude the eigentasks whose β 2 k /S > 1.Let us use one typical example, the widely used logistic regression in classification, to illustrate our argument here.As what we will introduce in Appendix J, the target function is the conditional probability distribution f (u) := Pr[u ∈ C 1 |u] in such classification model (see Eq. (J4)), and then there is one more layer of softmax and cross-entropy function acting on linear map where σ is sigmoid function (e.g.softmax function σ(z) = 1/(1 + exp(−z))), and H(p, q) = −p ln q − (1 − p) ln(1 − q) is the cross-entropy.Especially, any linear combination of { Xk } can be translated into linear combination Again, such vector Ω = Γ T W must also uniquely exist.For any σ NL = g(W • x), one always have It helps us read from the prefactor β 2 k /S induces a natural regularization on Ω k in loss function, in addition to the S-infinity term lim S→∞ L = E u [H(f, σ(Ω • y))].We will leave the detailed discussion of this important application in Appendix I and Appendix J. Recall that before we analytically find the eigenvectors of Q, we first show that the matrix G is invertible.It comes from that all K readout features {x k (u)} k∈[K] being linear independent is entirely equivalent to the full-rankness of the corresponding Gram matrix Rank(G) = K.Thanks to the linearity of readout, we can show such linear independence by contradiction.Suppose on the contrary there exists coefficients {c k } k∈[K] such that However, this means that the quantum observable K−1 k=0 c k Êk is a zero-expectation readout-qubit quantity for any state U(u)ρ 0 under arbitrary input u, which is impossible.This shows the linear independence.Furthermore, we then argue that it ensures G has no non-trivial null space.This is because that any {c k } k∈[K] will satisfy where the RHS is the norm of function  We have shown that the problem of obtaining the eigentasks for a generic quantum system, and deducing its expressive capacity under finite measurement resources, can be reduced simply to solving the eigenproblem of its noise-to-signal matrix R, Eq. (C22).Note that constructing R = G − 1 2 VG − 1 2 requires computing the inverse of G.However, G can have small (although always nonzero) eigenvalues, especially for larger systems, rendering it ill-conditioned and making the computation of R numerically unstable.Fortunately, certain simplifications can be made to derive an equivalent eigenproblem that is much easier to solve.Practically, the probability representation is native to measurement schemes in contemporary quantum processors, and therefore minimizes the required post-processing of readout features obtained from a real device.More importantly, the strength of the probability representation lies in the fact that it renders the second-order moment matrix D diagonal.In particular, Finally, considering the inverse of the matrix on the left hand side, we obtain the simplified eigenproblem for the matrix which shares eigenvectors with R, and whose eigenvalues are a simple transformation of the NSR eigenvalues β 2 k .Importantly, constructing D −1 G no longer requires calculating any powers of G, and when further choosing readout features in the probability representation, it relies only on the inversion of a simple diagonal matrix D.
The matrix D −1 G has significance in spectral graph theory, when interpreting the Gram matrix G as the adjacency matrix of a weighted graph.This connection is elaborated upon in Appendix C 8.

Connections to spectral graph theory
Let us have a small digression to the graphic theoretic meaning of G and D −1 G. Now we consider a weighted graph with adjacency matrix G.In spectral graph theory, the matrix D −1 G is exactly the random walk matrix associated with graph G, and then the second order matrix D happens to be the degree matrix of this graph since (D) kk = K−1 k =0 (G) kk .Then the eigentask combination coefficient r (k) is precisely the right eigenvector of random walk matrix.Another concept associated with a graph is , the normalized Laplacian matrix of G, while the matrix D − 1 2 GD − 1 2 is always referred to be normalized adjacency matrix in many literatures.The eigenproblem of normalized adjacency matrix can also be solved easily, because From perspective of spectral graph theory, roughly speaking, a reservoir with stronger ability to resist noise are those who has more "bottlenecks" in graph G's connectivity.The extreme case is supposing that α k = 1 (or 1 − α k = 0) for all k.According the basic conclusion in spectral graph theory, the normalized Laplacian matrix has K zero eigenvalues iff the graph G is fully disconnected.This gives us the condition when noisy information capacity obtain its upper bound K: there exists a partition Appendix D: Spectral analysis based on finite statistics While Eq. (C46) is a numerically simpler eigenproblem to solve than Eq.(C22), it still requires the approximation of G (recall that D can be obtained from G) from readout features X(u) under finite sampling, due to the finiteness of shots S, the number of input points N , and also the number of realizations of readout features for a given S. To be more precise, in experiment one only has access to measured features sampled at finite-S X (indeed, this distinction is the underlying premise of this article).However, in Eq. ( 8) G and V are defined with respect to the ideal x.
The objective of Appendix D 1 is showing that the eigen-analysis {β 2 k , r (k) } can be accurately substituted with and r (k) = r(k) from solving generalized eigenvalue problem V r(k) = β2 k Gr (k) .In what follows, we show how an approximation G N of G can be constructed from finitely-sampled readout features, as relevant for practical quantum devices.Secondly, we also describe an approach to obtain the eigentasks y (k) (u) and corresponding NSR eigenvalues β 2 k that avoids explicit construction of the Gram matrix, and is thus even more numerically robust.For practical computations, readout features X(u) from the quantum system for finite S can be computed for a discrete set of u (n) ∈ [−1, 1] for n = 1, . . ., N .Labelling the corresponding readout features X(u (n) ), we can define the regression matrix constructed from these readout features, F N ≡ ( X(u (1) ), X(u (2) ), N,k after that are larger the S − 1 so that they are not correctable.
From Eq. (D12), we see that to lowest order in 1 S , β 2 k ≈ β2 k .However, this expression also supplies corrections to higher orders in 1  S , which are non-negligible even for β 2 k < S, as we see in example of Fig. 6.In contrast, the estimated eigenvectors r(k) to any order in 1 S equal the desired eigenvectors r (k) without any corrections.Of course, in practice we do not have access to the matrices G and D, as these are only defined precisely in the limit where N → ∞.However, for large enough N , we can approximate these matrices to lowest order by their finite N values, Then, the eigenproblem in Eq. (D11) can be expressed in the final form, where the eigenvalues β2 N,k , αN,k and eigenvectors r(k) N in the large N limit must satisfy Here the invertibility of the empirically-computed matrix D N required for Eq.(D14) is numerically checked, based on which we can establish a better numerical method in Appendix D 2.
Eq. (D14) represents the eigenproblem whose eigenvalues β2 N,k and eigenvectors r(k) N we actually calculate.For large enough N and under finite S, we can use these as valid approximations to the eigenvalues and eigenvectors of Eq. (D11).This finally enables us to directly estimate the N, S → ∞ quantities β 2 k and r (k) using Eqs.(D12), (D13): It is clear that the approximation of β 2 k to lowest order will be an underestimate, as the contribution of order 1 S is positive.In Fig. 7, we plot the estimated eigenvectors r(k) N computed under finite statistics (N = 300, S = 1000, where these two numbers are relevant for IBM quantum processors) in H-encoding, together with the N, S → ∞ eigenvectors r (k) , and the estimated eigenvalues.Fig. 8(c), the β 2 k spectrum, having a much flatter slope for larger k (note the plot is semilog).Finally, Fig. 8(d) shows the EC of both systems as a function of S. EC rapidly rises for small S for both systems, but the rise of the J = 0 system is steeper.After a certain threshold in S, however, the CS grows more rapidly, approaching the upper bound 2 6 = 64 with S ∼ 10 8 ; in contrast, the PS has a significantly lower C T .
For J → ∞ we also have T = 0 because ρ0 = |0 0| ⊗L is eigenstate of the encoding (ρ(u) = ρ0 This implies there must be a peak at some intermediate J, which for both EC and ETC occurs when the coupling is proportional to the transverse field J ∼ h x .
Our results elucidate the same kind of improvement, as can be observed when we consider how the EC C changes with J, and compare it to the total correlation ETC T , as shown in Fig. 8(f).For J → 0 we have a PS with T = 0, whereas in the J → ∞ we also have T = 0 because ρ0 = |0 0| ⊗L is an eigenstate of the encoding (ρ(u) = ρ0 ).This implies there must be a peak at some intermediate J, which for both EC and ETC occurs when the coupling is proportional to the transverse field J ∼ h x .At finite S, increased ETC is directly related to a higher EC.
Another interesting aspect is the clear trend seen in the maximization of EC around J ∼ h x rms for various h x rms , possibly hinting at the role of increased correlation around the MBL phase transition in random spin systems [38].This trend is consistent with results in quantum metrology -in general, the SNR obtained from averaging L uncorrelated probes scales as 1/ √ L. This scaling can become favorable in the presence of quantum correlation and other non-classical correlations, in which case the scaling of the SNR can show up to a quadratic improvement 1/L [37].For even larger J, we find that ρ(u) → ρ0 = |0 0| ⊗L , which clearly reduces T , but also C T as the quantum system state becomes u-independent.
for L = 7 qubits.Note, however, that even if one is well below C T = 2 L due to this finite sampling constraint, increasing the dimension of the quantum system is always an effective way to increase the EC, particularly when compared to the logarithmic growth with S of Fig. 2 of Main Text.The scaling of expressive capacity with system size in general is hard to quantify, and is likely best approached with a numerical or experimental study for a specific quantum model, as done in Fig. 9.However, we can analytically solve for the EC of a class of quantum models: 2-design parametric quantum circuits {p(u)du, Û (θ; u)}.We clarify that we are referring here to systems with specific parameters θ which result in 2-designs with respect to the input distribution p(u); the ensemble average is taken with respect to inputs u.Quantum literature [21] often refers to general ansätze which form 2-designs with respect to parameters θ instead, which is not what we are considering here.
To be more specific, an ensemble {p(u)du, Û (θ; u)} is a 2-design if the following two quantum channels, defined on any 2L-qubit state, ρ are equal where µ H is the uniform (Haar) measure.We can verify that all information in the Gram matrix is explicitly contained in the elements of C(ρ 0 ⊗ ρ0 ).To be more specific, implies that we can compute the Gram matrix by instead integrating over the Haar measure [39]: (G3) Then the corresponding second-order matrix D is given by The highly nonlinear readout feature x k (u) should have Taylor expansion x k (u) = ∞ j (T) kj u j .Such complicated functions will span a certain functional space.One fundamental question is what the limit of approximation ability is based on the architecture we proposed.Hereby we first show that this architecture under infinite sampling is capable of approximating any continuous function on the domain [−1, 1] to arbitrary precision.Furthermore, the linearity of quantum moment readout and complexity of quantum evolution will help us to understand why such a quantum system has capability to approximate a highly nonlinear function, under finite and bounded computational resources.Exploring the capacity for function approximation under finite measurement resources, as is done in the main text and Appendix C, highlights the fundamental limitations placed by quantum noise on computation using the reservoir computing approach.

Universal Function Approximation
A very general question is that what type of functions such a single-step quantum evolution can approximate.One conclusion which can be drawn is the universal function approximation property.That is, give any continuous function from space of continuous functions on domain

1D classification as function approximation for noiseless measured features
In this section, we will show how the universal function approximation property of the architecture described in Appendix J 1 enables it to perform -among others -paradigmatic machine learning tasks such as classification.
Suppose two classes C 0 and C 1 of samples, each of which is generated from distributions p 0 (u) and p 1 (u) respectively.The probability of occurrence of C 0 and C 1 are both 50%, and we simply let each class equally contain 5000 samples and thus N = 10000 samples in total.Both distribution are artificially defined by summing several Gaussian distributions with different amplitudes and widths together.Domain of both distributions are restricted in [−1, 1] and both distributions are also normalized.Due to the overlap of two distributions, there is some theoretical maximal classical accuracy to distribution whether a given u belongs to either C 0 or C 1 .
During the training, we feed each sample u (n) (belonging to class C c (n) ) into a 5-qubit quantum system.The quantum system will be read out with K eff = mmax m=0 L m different features {x k (u (n) )} k∈[K eff ] .Then features of N sample forms the regressor matrix.According to the standard supervised learning procedure, we simply train based on (x(u (n) ), c (n) ) by logistics regression where one should minimize the cross-entropy loss Noise-to-signal ratio of correlated system vs product system in a 10-qubit quantum annealing system with shot number S = 1000 by feeding u = 1/2.The hyperparameters are chosen to be ( hx , h x rms ; hz , h z 1,rms ) = (8, 2; 3, 2) in unit 1/t.The purple and red colors correspond to coupling being switched on and off, respectively; and the coupling hyperparameter in CS is Jmax = 2/t.For each m, the N = 30 dots are relative error x S) (purple dashed line).We take y-axis being log-scale, and one may find in these regime correlated system 1/SNR grows exponentially faster than product system NSR (red stars) and hence product system readout scheme will be less powerful in sense of quantum sampling noise resistant.
For classical polynomial readout the amplitude of relative error is obtained by rule of uncertainty propagation If there is no correlation in quantum system, then the readout features for both quantum moment readout and classical polynomial readout are the same σz However, even if the expectations under infinite sampling limit S → ∞ are the same, the measurement noise under finite sampling are still different.For classical polynomial readout, the scaling of still follows the simple additivity relation of uncertainty propagation in Eq. (??).But now the noise of x l1 (u) • • • x lm (u) in quantum moment readout will be very strong, this is because x l1 (u) • • • x lm (u) is now close to zero, thus ) FIG. 1. (a)Representation of the learning framework considered in this work: inputs u are transformed to a set of outputs via a parameterized feature generator, here implemented using a finitely-sampled quantum system as shown in (b).Inputs are encoded in the state of a quantum system via a general quantum channel U, and information is then extracted through a positive operator-valued measure.This framework describes a wide range of practical quantum systems, from quantum circuits used in supervised or generative quantum machine learning, to quantum annealers exhibiting continuous evolution, and beyond, all defined by a general quantum channel with parameters θ.Extracted information takes the form of K stochastic features X obtained under finite shots S. The geometric structure of distributions of these measured features is fundamentally determined by quantum sampling noise, which depends on the quantum state ρ(u; θ), and hence on the nature of the mapping from input u to this quantum state.We show four obtained distributions differing only in the values of inputs u to highlight this dependence.As shown in (a), learned estimates for desired functions are then constructed via a linear combination w of X, with a resolution limited by S. Capacity C[f ] then quantifies the error in the approximation of a target function f via this scheme.
Fig. 2(b) shows that the CS FIG. 3. (a) Device eigentasks for correlated system (CS, left) and product system (PS, right), constructed from noisy features at S = 2 10 and S = 2 14 .(b) Classification demonstration on IBMQ Perth.Binary distributions to be classified over the input domain are shown.(c) The classification task can be cast as learning the likelihood function separating the two distributions; this target function is shown in the upper panel.Lower panels show the learned estimate of this target based on the Ntrain = 150 points shown in (b), using only Kc(S) eigentasks for S = 2 14 ; this cutoff is indicated by the dashed red lines.For the correlated system Kc(S) = 40, while for the product system Kc(S) = 29.

FNNFIG. 4 .
FIG. 4. (a) Training (light) and testing (dark) accuracy for the device encodings of Fig. 4(a), as a function of the number of eigentasks used to approximate the target function.Markers indicate performance on the dataset shown in Fig. 4(b), and solid lines are the average over 10 random selections of training and test sets.The shaded region denotes the maximum and minimum test accuracy observed.The optimal test set performance is found near the noise-to-signal cutoff Kc(S = 214 ) (dash-dotted lines) informed by the quantum system's noise-to-signal spectra.(b) Testing set classification accuracy as a function of J for our optimal learning method.In all cases, the average performance over the 10 task permutations is reported, using Kc(S = 214 ).Markers indicate device results for the chosen encoding, and the corresponding simulation is shown in solid peach.Dashed peach shows the average of these results over the 8 device noise simulation encodings, and dashed grey the ideal simulation performance in the S → ∞ limit, where all K = 64 features are used.The horizontal line denotes the performance of a software neural network with KL = 64 nodes (and 1153Kc trained parameters) for comparison.

FIG. 5 .
FIG. 5. Schematic of a simple L = 2 qubit circuit, comprised of a CNOT gate sanwiched by input-dependent local x-rotation gates {Ri(u)}.Different 2D inputs shown on the left are mapped to the finite-S feature space on the right via this circuit.Specifically, a 2D slice ( X00 and X11) of the 4D feature space is shown.Each point represents an individual sample or experiment, i.e. an output constructed with S < ∞ shots via Eq.(2).Distinct values of S = 10 2 , 10 3 , 10 4 are shown in different colors (blue, red, green).For each input u and shots S, the simulation is conducted for 100 repetition.
transverse x-field strength h x l = hx + ε x l and longitudinal z-drive strength h z,I l = hz,I + ε z,I l are all randomly chosen and held fixed for a given realization of the quantum system, [a, b] is a uniform distribution with non-zero density within [a, b].

6 .
Proof that the Gram matrix G is full rank

K− 1 k=0
c k x k (u).The summation K k1,k2=1 c k1 c k2 (G) k1,k2 = 0 vanishes if and only if function K−1 k=0 c k x k (u) is a zero function.That is why the linear independence of features {c k } k∈[K] is equivalent to that symmetric matrix G has no zero eigenvalues, namely Rank(G) = K.

7 .
Simplifying the noise-to-signal matrix and its eigenproblem

1 .
Approximating eigentasks and NSR eigenvalues under finite S and N

2 S = 10 2 FIG. 6 .
FIG.6.Eigen-analysis in L = 5 H-ansatz system by taking S = 10 2 shots on each of N = 10 4 samples, with true eigen-noise-to-signal ratios β 2 k (black), S-finite sampled β2 N,k (blue) and corrected (S • β2 N,k )/((S 1) − β2 N,k ) (purple).β2 k , the large N limit of β2 N,k is also plotted in red for comparison.The data correction is necessary since all β2 N,k are below the S = 10 2 , and the corrected data show much better performance even if β 2 k S. The estimated line (in purple) are cutoff at k = 25 since all sampled β2 N,k after that are larger the S − 1 so that they are not correctable.

FIG. 8 .
FIG. 8. Eigen analysis in a 6-qubit H-ansatz system (with N = 5000 and S = 1000) forming a 1D ring.The Hamiltonian parameters are chosen randomly with zero-mean and variance (h x rms , h z rms , h I rms ) = (20, 5, 5), and t = 5 (See Appendix B for details).Coupling strength is uniformly J = 0 (correlated system) or J = 0 (product system).(a) All 2 L = 64 noisy features Xk (u) and (b) noisy eigentasks ȳ(k) (u) = r (k) • X(u) for selected k from the features in (a), as well as their expected values y (k) (u) = limS→∞ ȳ(k) (u) = r (k) • x(u) (black).(c) Noise-to-signal ratio spectrum β 2 k and (d) CT vs shots S for both correlated system and product system encodings.(e) CT at S = 10 5 and (f) ETC, T (ρ M ) in representative random 6-qubit H-ansatz, as a function of coupling strength J.The peaks of capacity and correlation coincide, around J ∼ h x rms .

FIG. 9 .
FIG. 9. (a) H-ansatz and (b) C-ansatz at finite S as a function of qubit number L. Various colours indicate different S values, with the S → ∞ bound in dashed black.Individual noisy simulations are indicated in small and transparent dots, with their average as a thick line, and the expressive capacity of the C-ansatz device for encoding 1 and 2 are indicated with '×' and '+' respectively.

FIG. 12 .
FIG.12.Function approximation by using y = K−1 k=0 w k x k (u) (solid red lines) to approximate sine function and steep tanh function (dashed purple lines) in a 5-qubit quantum annealing system, where K eff = mmax m=0 FIG.16.Noise-to-signal ratio of correlated system vs product system in a 10-qubit quantum annealing system with shot number S = 1000 by feeding u = 1/2.The hyperparameters are chosen to be ( hx , h x rms ; hz , h z 1,rms ) = (8, 2; 3, 2) in unit 1/t.The purple and red colors correspond to coupling being switched on and off, respectively; and the coupling hyperparameter in CS is Jmax = 2/t.For each m, the N = 30 dots are relative error x

k
(u)/x k (u) − 1 of 30 repetitions r = 1, 2, • • • , 30.The standard deviation of those relative errors (namely NSR) are also plotted.The correlated system NSR (purple stars) is well fitted by O(1/ √ S) (purple dashed line) while the product system NSR (red stars) scales exponentially as O(2 m / √ Table of Symbols and Abbreviations k Expected features, Tr{ Mk ρ} Xk Empirical observed feature, (1/S) s δ(k (s) , k) ζ k Noise component of Xk G Gram matrix of expected features {x k } VExpected covariance matrix of random variable X