Generalization in Quantum Machine Learning: a Quantum Information Perspective

Quantum classification and hypothesis testing are two tightly related subjects, the main difference being that the former is data driven: how to assign to quantum states $\rho(x)$ the corresponding class $c$ (or hypothesis) is learnt from examples during training, where $x$ can be either tunable experimental parameters or classical data"embedded"into quantum states. Does the model generalize? This is the main question in any data-driven strategy, namely the ability to predict the correct class even of previously unseen states. Here we establish a link between quantum machine learning classification and quantum hypothesis testing (state and channel discrimination) and then show that the accuracy and generalization capability of quantum classifiers depend on the (R\'enyi) mutual informations $I(C{:}Q)$ and $I_2(X{:}Q)$ between the quantum state space $Q$ and the classical parameter space $X$ or class space $C$. Based on the above characterization, we then show how different properties of $Q$ affect classification accuracy and generalization, such as the dimension of the Hilbert space, the amount of noise, and the amount of neglected information from $X$ via, e.g., pooling layers. Moreover, we introduce a quantum version of the Information Bottleneck principle that allows us to explore the various tradeoffs between accuracy and generalization. Finally, in order to check our theoretical predictions, we study the classification of the quantum phases of an Ising spin chain, and we propose the Variational Quantum Information Bottleneck (VQIB) method to optimize quantum embeddings of classical data to favor generalization.

Quantum classification and hypothesis testing are two tightly related subjects, the main difference being that the former is data driven: how to assign to quantum states ρ(x) the corresponding class c (or hypothesis) is learnt from examples during training, where x can be either tunable experimental parameters or classical data "embedded" into quantum states. Does the model generalize? This is the main question in any data-driven strategy, namely the ability to predict the correct class even of previously unseen states. Here we establish a link between quantum machine learning classification and quantum hypothesis testing (state and channel discrimination) and then show that the accuracy and generalization capability of quantum classifiers depend on the (Rényi) mutual informations I(C:Q) and I2(X:Q) between the quantum state space Q and the classical parameter space X or class space C. Based on the above characterization, we then show how different properties of Q affect classification accuracy and generalization, such as the dimension of the Hilbert space, the amount of noise, and the amount of neglected information from X via, e.g., pooling layers. Moreover, we introduce a quantum version of the Information Bottleneck principle that allows us to explore the various tradeoffs between accuracy and generalization. Finally, in order to check our theoretical predictions, we study the classification of the quantum phases of an Ising spin chain, and we propose the Variational Quantum Information Bottleneck (VQIB) method to optimize quantum embeddings of classical data to favor generalization.

I. INTRODUCTION
Quantum information and machine learning are two very active areas of research which have become increasingly interconnected [1][2][3][4]. In this panorama, many works have considered and designed learning models that are built by using quantum states and algorithms [5][6][7][8][9][10][11][12][13][14][15][16][17][18][19]. Some of these proposals have focused on learning classical data by exploiting the capability of quantum machines to easily perform computations that are in principle unfeasible using classical computers [5,6]. Other works have instead focused on "quantum data", i.e., information embedded in quantum states or quantum channels.
For the latter case, a fundamental model with particular relevance is that of quantum channel discrimination. This is known to have non-trivial implications for quantum sensing [20], in tasks such as the detection of targets [21][22][23] or the readout of memories [24,25]. Recently, it has been applied to study the model of channel position finding [26], associated with absorption spectroscopy [27], and more sophisticated problems of barcode decoding and pattern recognition [12]. In the latter, the use of quantum light sources was shown to drastically reduce the error affecting the supervised classification of images, even when the output measurements are not optimized.
The main difference between (supervised) quantum machine learning (QML) and quantum hypothesis testing (QHT), such as state and channel discrimination [28], * leonardo.banchi@unifi.it is the role of prior information. In QHT all possible sets of states and their prior probabilities are known. This is not the case in any machine learning approach, where instead prior information is in the form of samples, that is a collection of correctly classified states. These samples are not enough to cover all possible cases and the most important question in data driven strategies is to check for generalization: After having trained the model using a few known examples, can the model accurately classify unseen data? In previous QML literature, the generalization capabilities were numerically verified by computing the classification error over a testing set. Some theoretical bounds were studied in [10,11,29,30], but only for particular classifiers or regression models, while intuitive geometrical characterizations were given in [7,8], yet without a formal proof.
Here we study generalization in QML classification tasks using tools from quantum information theory. In Sec. II we first establish a fruitful link between QML and QHT. This allows us to bound the QML classification error by exploiting some QHT results, and to study the role of unknown prior in QHT. In Sec. III we introduce the main technical result of this paper: quantities linked to either the training or testing errors can be bounded by the quantum mutual information between some suitable quantum states and classical variables. Based on the study of these quantities, in Sec. IV we show different implications of our theoretical bounds: we introduce a quantum version of the bias-variance tradeoff, which defines fundamental limitations on the testing error for finite amounts of data; we then show how to use results developed in the quantum communication/cryptography literature to study how to optimally embed classical information onto quantum states; finally, we show how different properties of the quantum states affect the classification accuracy and generalization, such as the dimension of the Hilbert space, the amount of noise, and the amount of neglected information via, e.g., pooling layers. Our results are based on the study of the linear loss function, yet we show how similar conclusions can be obtained in a loss-independent framework, by defining a quantum version of the information bottleneck principle [31]. In Sec. V we consider different applications of our theoretical results. We first study the Quantum Phase Recognition problem of an exactly solvable quantum Ising chain and then define the Variational Quantum Information Bottleneck method to train quantum embeddings of classical data for good generalization. Conclusions are drawn in Sec. VI. The mathematical derivations of our results, as well as the extension to multi-ary classification, are presented in the appendices.

II. QUANTUM HYPOTHESIS TESTING VS SUPERVISED CLASSIFICATION
We study the classification of either quantum states, as in Fig. 1a, or classical data, as in Fig. 1b, using the framework of QHT. Let us first consider the simpler case where a quantum device can only be in N C possible states {ρ c } c=1,...,N C , for some integer N C . The possible values of c are called hypotheses in QHT, or classes in this paper. An experimentalist (Alice) performs a measurement on the device with a positive operatorvalued measurement (POVM) {Π c }, whose outcome is the predicted value of c. Via Naimark's dilation theorem, such a POVM can be effectively implemented as shown in Fig. 1c, namely by using an ancillary system whose Hilbert space dimension is equal to N C , by first applying a unitary circuit that couples the state and the ancilla, and then performing a projective measurement |c c| on the ancillary system. In the most general setting, the states ρ c are not orthogonal and Alice cannot discriminate between them with a single measurement. When the device can be reinitialized in the same state, Alice can use N copies ρ ⊗N c and the probability of wrong discrimination can decrease exponentially with N [32]. We remark that the common approach of using N measurement "shots" is just a particular case, possibly nonoptimal, of the above general framework, with independent measurements on each copy.
Unlike QHT, in QML classification tasks there are different states that belong to the same class. The number of classes N C is finite, but the available states ρ(x) are possibly infinite. In this paper the inputs x model tunable classical parameters. For instance, consider a device that depending on parameters x outputs either entangled (c = 1) or separable (c = 0) states [33], or a many-body system that may be in different phases c depending on some external magnetic fields x [9]. We may also be inter- . . . 1. (a) Example binary classification of quantum states that depend on some external parameters x. (b) Classification of classical data using a quantum embedding circuit with L layers. The classical data are sampled from an unknown distribution P (c, x), where x describes, e.g., images of animals and c specifies the kind of animal, e.g., a cat. The classical input x is embedded into a quantum state ρ(x) via layers of x-dependent and x-independent gates. A POVM {Πc} is performed at the end of the circuit. The predicted class of x corresponds to the measurement outcome c. (c) Any POVM can be expressed as a unitary circuit followed by a projective measurement |c c| on a suitably large ancillary system. (d) In quantum channel discrimination, the images x live in the physical world; a quantum probe senses the outside world and ρ(x) is the scattered state of light collected by the detector, which depends on the outside objects. The detector then classifies the image with a POVM, as in (c). ested in classifying classical data x (e.g. images) using a quantum algorithm, to look for algorithmic quantum advantage [11,34] (faster classification), or classifying quantum channels using quantum probes to look for quantum advantage in accuracy [12] (fewer measurements). When dealing with classical inputs x, the quantum embedding circuit can be written as in Fig. 1b with x-dependent and x-independent gates U i (x) and V j , which may be optimized during training. Finally, a mathematically related, yet physically different problem consists in classifying physical objects using quantum sensors [12,[21][22][23][24][25][26][27], as in the example shown in Fig. 1d. There, ρ(x) = E x [ρ in ] describes the state received by a quantum detector, where ρ in is the input probe state of light, possibly entangled, and E x describes how the photons are scattered depending on the objects x living in the physical world.
In all the examples described above, we are interested in learning the unknown functional relation c = f (x) between a classical input x and output class c, yet through measurements on a quantum device. The motivations can be quite diverse and range from quantum device characterization depending on external parameters to the use of quantum algorithms to classify classical data. Following common practices in theoretical machine learning, we assume that all possible pairs of data (c, x) follow some unknown probability distribution P (c, x), so data pairs are independent samples from P (see Fig. 1b). Formally, our ignorance can be modelled using mixed states where P (x|c) is the unknown conditional probability. For N copies, such states read ρ With a slight abuse of notation, to simplify the mathematical expressions we may hide the dependence on N inside ρ(x), namely as ρ(x) =ρ(x) ⊗N for someρ(x). The main difference between QHT and the classification problem studied in this paper is that we take measurements on the instances ρ(x) rather than onto the discrete states ρ c . Measurements are still described via a POVM {Π c }, possibly acting on N copies, constructed such that its outcome c is the predicted class x. Such a quantum classifier is probabilistic: given an input x the predicted class c is found with probability p Q (c|x) = Tr[Π c ρ(x)]. Other classifiers can be built using different techniques of quantum decision theory [19,28], for instance by repeating the measurement many times and taking the most likely class, or by defining an observable M = c m c Π c , for certain real numbers m c , and then assigning a certain class depending on the expectation value Tr[Mρ(x)]. For instance, for binary classification problems with c = {0, 1} we may set m 0 = −1, m 1 = 1 and then assign the class depending on the sign of Tr[Mρ(x)] [7]. We remark that exact expectation values on real hardware can only be obtained in the limit of infinitely many shots, namely for N → ∞ copies.
Since in QML the probability distribution is unknown, a (sub)optimal classifier must be built from a finite amount of training data. Is the trained classifier able to predict the correct class of previously unseen data? To answer this question, in the next sections we use tools from quantum information theory to formally study the two main errors, namely the approximation and generalization errors, which rigorously formalize the empirical testing error (see Fig. 2). We call ρ(x) parametric quantum states (PQS) that depend on some tunable classical  2. Summary of the error sources for given parametric quantum states ρ(x) and a finite number of training samples. The classification error R(Π * ) is the average loss with the unknown optimal measurement Π * . The average testing error R(Π T ) replaces Π * with the POVM Π T estimated from the training set T via empirical risk minimization. The testing error R T (Π T ) is a finite sample approximation of R(Π T ) over the testing set T . The difference between the average testing error and the Bayes risk R Bayes is split into the approximation error A(ρ) and the generalization error G T (ρ). The training error typically behaves as R(Π * ). . Summary of some of the main conclusions of this paper. The approximation and generalization errors are mathematically related, respectively, to the training and testing error over some datasets, and they cannot be simultaneously minimized (bias-variance tradeoff). We use quantum information quantities to bound these errors and show how they are affected by the dimensionality of the quantum Hilbert space, noise, and "information pooling". parameters x and refer to the mapping x → ρ(x) as quantum embedding. We will study how different properties of the embedding affect accuracy or generalization, as schematically shown in Fig. 3, and then introduce fundamental limitations on the errors that we may expect for a given data distribution and finite training samples.

Training and testing with linear loss
We first formalize the various sources of error that may prevent generalization. Readers already familiar with this topic may skip this section and refer to Fig. 2 for the notation.
In supervised learning the available data are split between a training and a testing set. Both these sets are composed of pairs (c k , x k ), namely inputs x k and their true class c k , but are used differently. We consider a training set T = {(c k , x k )} k=1,...,T with T pairs, and similarly a testing set T with T pairs. In the training part a model is optimized in order to minimize a suitable distance between the true class c and the predicted class for all possible pairs (c, x) ∈ T in the training set. For given PQS ρ(x) and POVM {Π c } the quantum model predicts a classc with probability Tr[Πcρ(x)], as in Fig. 1. If c is the true class of x, the linear loss is defined as the probability of misclassification, namely the probability that the predicted classc is different from the true class c where the second equality follows from c Π c = 1 1. The linear loss allows us to link QML to QHT [19,28]. Training is done via empirical risk minimization, where the empirical risk is the average loss over all possible pairs (c k , x k ) in the training set and the minimization is over the parameters of the model, namely the POVM and, in some applications, also the embedding. In general such minimization does not have an analytic solution, except for a few notable cases.
For binary classification problems, where c = {0, 1} can only take two distinct values, the optimal T -dependent POVM, Π T = argmin Π [R T (Π, ρ)], is the Helstrom measurement [35], which is extensively used in QHT. The Helstrom measurement operator Π T 0 (Π T 1 ) is the projection onto the eigenspace of positive (negative) eigenvalues of T0 where T c is the number of inputs in the training set with class c and ρ T c is a mixture of all the states ρ(x) with inputs in T and fixed class c. Although not necessary, to simplify the equations we will always assume that the training set contains an equal number of inputs per class, so T c /T = 1/2. Using the optimal Helstrom measurement the minimum empirical risk can be written analytically in terms of the trace distance between the two average states ρ T 0 and ρ T 1 The above quantity is what defines the training error for a given PQS ρ(x), namely the average loss over the training set. From the above equation, zero training error is possible only when ρ T 0 − ρ T 1 1 = 2, which happens when ρ T 0 and ρ T 1 have orthogonal support. We will show in Appendix A that similar conclusions also hold when the number of classes is greater than two.
Does the model generalize? Empirically we need to check how the model performs with inputs not present in the training set. This is normally done by studying the testing error R T (Π T ), which is similar to Eq. (3), but where the samples are taken from the testing set T and the POVM Π T is the one minimizing the empirical risk. In order to define the generalization error more formally, we need first to define the true average classification error where in the last expression we use the chain rule P (c, x) = P (x|c)P (c) = P (c|x)P (x), P (x) = c P (c, x), P (c) = x P (c, x) and the definition of Eq. (1). The training error (3) is an empirical approximation of the classification error (5) where the formal average over all possible pairs (c, x) is substituted with a finite average over the training set. The optimal classification POVM, Π * = argmin Π R(Π), is in general different from the Π T that we get from empirical risk minimization. Overfitting happens when this difference is significant, namely when Π * and Π T disagree on the class of an input not present in the training set. The generalization error, also called the estimation error, is defined as R(Π T ) − R(Π * ), namely as a difference between two classification errors, where in one case we use the true classifier and in the other we use the classifier built from the training set T . Notice that the testing error R T (Π T ) is an empirical approximation of R(Π T ).
In order to have a low testing error, we need to have both a low generalization error and a low classification error R(Π * ). The lowest possible classification error is the (normally unknown) Bayes classifier R Bayes -see next section for a formal definition. The difference between the average testing error R(Π T ) and R Bayes can be written as with In Eq. (6) the difference between the average testing error and the Bayes risk has been split into two positive terms (see also Fig. 2): G T (ρ) is the previously defined generalization error while A(ρ) is called the approximation error. A standard result of statistical learning theory, dubbed the bias-variance tradeoff [36], shows that it is impossible to minimize both A and G. Simple classifiers may escape from overfitting but have a bias in the resulting predictions, while too complex classifiers lead to overfitting and a higher variance in the predictions. We remark that these complexity analyses cannot explain the success of deep learning, where models with millions of parameters generalize well in spite of their complexity. There are some explanations of why deep-learning works in particular models [37,38], but this is still a subject of intensive research. Moreover, quantum models that can be trained in near-term quantum hardware are quite far from the regime where deep learning operates, so in this paper we focus on models of "moderate complexity".

III. QUANTUM INFORMATION BOUNDS FOR SUPERVISED LEARNING
In this section we study bounds on the approximation and generalization errors using tools from quantum information theory, the main theoretical results of this paper. In Sec. IV we study how different properties of the PQS ρ(x) affect these errors, while in Sec. V we study more practical applications.

A. Generalization error
Employing tools from statistical learning theory [36] and quantum information, in Appendix A we prove one of our main results: Theorem 1. For a given embedding x → ρ(x) and for any δ > 0, with probability at least 1 − δ, the generalization error is bounded as where T is the size of the training set, depends on the embedding, and P (x) is the (unknown) prior probability for an image x.
We refer to B as the generalization bound, which constrains how large the generalization error can be for a fixed number T of training pairs. The inequality (9) applies to binary classification problems, but its general form, derived in Appendix A, is equivalent to (9) up to a constant that depends on the number of (equiprobable) classes -see Theorem 2. The inequality (9), with the explicit form of B in (10), represents one of the central results of this paper, as it links the generalization error to properties of the embedding that are measured by information theoretic quantities. Indeed, the quantity found in the second equality of Eq. (10) is the 2-Renyi mutual information between subsystems X and Q of the classical-quantum state For general α and subsystems A and B, the α-Renyi mutual information [39] is defined as For α → 1 one recovers the quantum mutual information In Eq. (11) we have introduced three Hilbert spaces: the quantum space Q where the PQSs ρ(x) live, the class space C spanned by {|c } c=1,...,N C and the input space X spanned by {|x } for all possible values of x, namely where each input x is mapped onto a different orthogonal state |x . For instance, if the inputs x are made of classical images with n pixels, each with a 16 bit color, then |x lives in a space of 4n qubits. For continuous inputs, e.g. when ρ(x) is an equilibrium state of a many-body system and x some external parameters, one must consider a suitably-regularized infinite dimensional Hilbert space. Here for simplicity, we assume that X is discrete and can be represented using N X classical bits, so ρ CXQ lives in a N C 2 N X +N Q dimensional Hilbert space.
Inequalities as in (9) are common in statistical learning theory and show that, with high probability, a model generalizes well whenever T → ∞. The importance of (10) is in quantifying when the size T of the training set is "large". According to our analysis, a training set is large whenever T 2 I2(X:Q) , namely when log 2 (T ) is much larger than the number of bits required to describe the information shared between the input distribution and the quantum embedding, as measured by I 2 (X:Q).

B. Approximation error
Fixing the embedding x → ρ(x) is like fixing the model class in classical machine learning, e.g. a neural network with a given architecture and a certain number of nodes. The difference between the minimum classification error with a given architecture and the theoretical minimum over all possible architectures, namely the Bayes risk, is the approximation error (8). For a known P (c, x) and a given x, the Bayes classifier picks the class that maximizes P (c|x). The corresponding Bayes risk for binary classification problems with P (c) = 1/2 is then Using the definition of the approximation error (8) and the classification error R(Π * ), which is analogous to (4) but with the states (1), we find that the approximation error for quantum binary classification problems can be written as It is simple to show that 0 ≤ A ≤ ∆. Indeed, the upper bound is trivial, and can be achieved when ρ 0 = ρ 1 . As for the lower bound, by defining ρ XQ c = x P (x|c)|x x| ⊗ ρ(x), we first note by explicit calculation that ρ XQ 0 − ρ XQ 1 1 = 2∆. Then by the contractivity of the trace distance over quantum channels we find The approximation error A can be interpreted as a generalization of the probability of error in QHT, where the difference is due to the measurements over the instances ρ(x) rather than over the discrete hypothesis states (1). The two errors coincide up to a multiplicative factor when X ≡ C.
In the previous section, we have showed how to bound the generalization error using the mutual information between subsystems X and Q in (11). We can use entropies to bound the average classification error R(Π * ), and hence A. Indeed, using the quantum Chernoff bound [40] and explicit calculations we get R( , which shows that a low classification error is possible when I 1 2 (C:Q) is large. A more general result, valid for any number of classes, can be found using conditional entropies [41], which is valid for states of the form given in Eq. (11), where H min (C|Q) ≤ H(C|Q) is the min-conditional entropy, which is smaller than von Neumann's conditional entropy [42]. The second inequality comes from H(C|Q) = H(C) − I(C:Q) and H(C) = log 2 N C for a classification problem with N C classes. Since Q and C are classically correlated, I 1 (C:Q) ≤ H(C) and a small risk is possible when the mutual information between Q and C is large.
To conclude, small G is possible for small I 2 (X:Q) while small A is possible for large I(C:Q). In the following section, we build upon these theoretical bounds to study how different properties of the PQS ρ(x) affect the approximation and generalization errors.

IV. BIAS VARIANCE TRADEOFF FOR QUANTUM MACHINE LEARNING
The bias-variance tradeoff is a central result in machine learning, stating that it is impossible to minimize both the approximation and the generalization errors. Models with lots of parameters and structure are expected to have low approximation error, potentially at the cost of poor generalization (overfitting). On the other hand, a low-dimensional model with few parameters would be easier to learn, but it might not reliably classify the data.
We study the bias-variance tradeoff using quantum information. Remember that the difference between the average testing error and the Bayes risk can be written as a sum of two positive terms (6), the generalization error G and the approximation error A. The latter is the difference between the average classification error R and the Bayes risk, while the former can be bounded by the generalization bound B from (10), and can be made arbitrarily small by considering arbitrarily large training sets with T → ∞. For finite and fixed T , we show that it is impossible to minimize both A and G, and that different properties of PQS ρ(x) affect the approximation and generalization errors, as schematically shown in Fig. 3.
Thanks to our framework, many characterizations of PQS ρ(x) will be formally derived in the next section using tools from quantum information. We will use, in particular, the contractivity of the trace distance under quantum channels E Q→Q [43], mapping from the space Q to the space Q , i.e., E Q→Q (ρ − σ) 1 ≤ ρ − σ 1 , the data processing inequality [39] I 2 (X:Q) ≥ I 2 (X:Q ), and finally the bounds satisfied by B, studied in the Appendices A and B where D = 2 N Q is the dimension of the embedding Hilbert space, i.e. N Q is the number of qubits in the PQS, and , is the Rényi entropy of the classical input distribution.

A. Properties of quantum embeddings
Here we focus on the mapping x → ρ(x) and discuss some properties and desirable features.
It is impossible to minimize both B and R: The optimal embedding is the one that discards all the irrelevant information from the input space X that is not necessary to predict the class C. Indeed, according to Eqs. (10) and (15), I 2 (X:Q) must be small while I(C:Q) must be large. We now show that it is impossible to minimize both B and R by studying the two extreme cases where the information about X is either fully discarded or fully maintained.
Constant embeddings provide zero generalization error, but the largest approximation error : indeed, the generalization error (7) is defined as the distance between the risk obtained by minimizing the empirical loss over the training data and the true average loss. For constant embeddings this difference is zero: if we restrict ourselves to classifiers that always produce a constant answer, then it is trivial to learn this classifier from data, but the average classification error will be as high as 50%. Mathematically, from the definition (10) and the bounds (16), it is clear that the minimum B is achieved with the trivial constant embedding, ρ(x) = ρ for all x, but such a trivial embedding provides both the largest R = 1/2 and the largest training error R T = 1/2 from Eq. (4). Moreover, using mutual informations, the constant embedding is the only embedding for which the space Q is uncorrelated from both C and X in (11) and so I(X:Q) = I(C:Q) = 0.
Basis encoding guarantees zero approximation error, but the largest generalization error. Basis encoding [2,44] is defined as ρ(x) = |x x|, namely different inputs are mapped onto orthogonal vectors on a suitably large Hilbert space. No information is lost, or hidden, in the quantum embedding, so using Eqs. (14) and (16), we get the lowest possible approximation error A = 0, meaning the average loss can achieve the Bayes risk. However, for the same reason, we also get the largest B = 2 H2(X) , since X = Q. Therefore, basis encoding allows us to reach the theoretical minimum classification error, but it requires a large embedding space and many (T B) training pairs to avoid overfitting.
High-dimensional embeddings may have lower approximation error : in Appendix A (Theorem 3) we show that if we define an embedding by taking N copies of a simpler one, i.e. if we consider states. Intuitively this happens because asymptotically it is possible to correctly discriminate all the states ρ(x) via QHT [12,32], effectively achieving a basis encoding for N → ∞. If ρ(x) is an N Q qubit state, then optimized embeddings using N × N Q qubits can only have a lower approximation error than ρ(x) ⊗N , since the latter is a particular case. However, high dimensional embeddings may suffer from poor generalization, as B may be larger. A numerical check of this prediction is shown in Fig. 4 for a binary discrimination problem with Gaussian priors. We see that, as the number of qubits increases, the classification error risk quickly decreases but the generalization bound increases. This is consistent with our numerical prediction. According to our Theorem 3, for many-copies, A decreases exponentially with N . The asymptotic behaviour of I(X:Q) is still an open question, but we note that for local measurements I(X:Q) can be bounded by the mutual information between an input random variable and N output observations, which may display two different regimes O(log N ) or O(N ) [45]. Therefore we conjecture that B may slowly increase with N (e.g. polynomially) for particular datasets and embeddings, as we observe numerically for small N . Low-entropy datasets and low-dimensional embeddings can in principle generalize well : this is a trivial consequence of Eq. (16), when the entropy of the dataset is measured by H 2 [X]. The statement about the dimension can be made a bit more precise by focusing not just on the dimension of the Hilbert space, but on how much the information is distributed within the Hilbert space. For instance, let us assume a pure state embedding ρ(x) = U (x)|0 0|U (x) † with a unitary embedding circuit U (x). If the embedding is such that the input information is "fully-scrambled" in a d-dimensional subspace, with d 2 N Q , then we may write x P (x)ρ(x) 2 ≈ 1 1 d /d. Substituting this approximation in (10) we get Therefore, an embedding is capable of generalization not just when built using few qubits, but rather when it "scrambles" information in a small subspace of the full N Q -qubit Hilbert space. Geometric characterization: There is an intuitive geometrical characterization of "good" embeddings (see e.g. Ref. [7,8]). A good embedding is possible when the fidelity between two embedded states is small if the inputs are from different classes and high if the inputs are from the same class, as schematically shown in Fig. 5. This intuitive picture can be explicitly proved using our results. Indeed, using the Fuchs-van de Graaf inequality and the strong concavity of the fidelity, we get pure state embeddings, where r c is the rank of ρ c . The general case is discussed in Appendix (B). The above inequality shows that low generalization error is possible when the average embedding states ρ c have low rank and/or high purity. Since ρ c is an ensemble of embeddings for inputs from the same class, the above requirement is satisfied when ρ(x) effectively maps all the inputs from the same class to the same state. More precisely, for pure state embeddings ρ(x) = U (x)|0 0|U (x) † , the purity can be written as where F (ρ(x), ρ(y)) = | 0|U (y) † U (x)|0 | is the fidelity. Therefore, good generalization is possible whenever F (ρ(x), ρ(y)) is large for all possible pairs (x, y) of inputs with the same class, namely when ρ(x) and ρ(y) are always geometrically close in the embedding Hilbert space. Combining this with (18) we see that a desirable feature to get a good embedding is that the fidelity between two embedded states is small if the inputs are from different classes and high if the inputs are from the same class, as in Fig. 5.

Noisy operations:
We focus on what happens when ρ(x) = (1 − )U (x)|0 0|U (x) † + 1 1/2 N Q , namely when the embedding discussed in the previous example is degraded by depolarising noise with strength . Again assuming that the average fully scrambles information in a d-dimensional subspace, then From the above equation, we see that the generalization error does not increase with noise. It actually decreases when 1 d 2 N Q , as for large the embedding approaches the constant embedding, which has the lowest generalization error but the highest classification and approximation errors.
Kernels close to identity are not good for generalization: quantum embeddings can be used to define quantum kernels [5][6][7]11]. These "kernels" are nothing but the fidelity between two states. Sometimes working with kernels rather than quantum states can be beneficial. In Appendix C we show that for pure state embeddings ρ(x) = |ψ(x) ψ(x)| we can express the generalization bound quantity as where K is the normalized kernel operator, whose matrix elements are Accordingly, the generalization bound can be computed from the eigenvalues η k of the normalized kernel matrix as B = ( k √ η k ) 2 . Note that the study of normalized kernel eigenvalues was also recently proposed as a possible explanation of the generalization capabilities of deep learning [38]. From the above expression, one readily finds that worse generalization performances, according to our bounds (9) and (16) are obtained when ψ(x)|ψ(y) δ x,y , namely when different states have almost orthogonal support. Indeed, when ψ(x)|ψ(y) = δ x,y Eq. (21) results in the upper bound of (16).
Pooling may help: We have seen that a large embedding Hilbert space may favour the classification accuracy, yet hinder generalization. According to (16) the generalization bound may approach its largest value when N Q , i.e., the number of qubits in the embedding, is as large as H 2 (X). What about the minimum number of qubits? For binary classification problems a good embedding can be obtained even with N Q = 1. Indeed, the simplest embedding that achieves the minimum Bayes risk is ρ Bayes (x) = |c c|, wherec = argmax c P (x|c) is, for a given x, the class with largest conditional probability. Clearly it is impossible to construct this embedding, as the probabilities P (x|c) are unknown, but the above example shows that a good embedding is possible with a single qubit. Although the state before the measurement must be as low-dimensional as possible, we may start from a large dimensional embedding and then iteratively throw away information, either via measuring some qubits and then applying a different unitary on the remaining ones depending on the measurement result or, equivalently, by applying a conditional gate and then discarding some qubits via a partial trace.
Since the generalization error depends only on the dimension of the final Hilbert space, one can use pooling to iteratively reduce the number of qubits, using different layers, eventually leaving a single qubit for measurements. Promising forms of pooling have been proposed as a basis for Quantum Convolutional Neural Networks (QCNN) [9,46], where the pooling layers are constructed using a reverse Multiscale-Entanglement-Renormalization-Ansatz (MERA) circuit, whose depth depends logarithmically on the total number of qubits. QCNNs have some desirable features, such as the ability to distinguish states corresponding to complex phases of matter [9], and the lack of barren plateaus in their parameter landscape [47], which aids training. Our analysis shows that QCNNs, or other embeddings built by iteratively pooling information, also have good generalization capabilities.

B. Quantum Information Bottleneck
In the previous section we showed that it is impossible to minimize both the approximation and the generalization errors, when these are defined starting from the linear loss (2). We now show how the generalization/approximation tradeoff can also be understood from information theoretic principles that are independent of the choice of loss function. In classical settings, a method designed for this purpose is the information bottleneck (IB) principle [31,48], whose aim is to find the "best" compressed representation Z of the input X that nonetheless has all the relevant information required to predict the class C. The amount of compression can be quantified using the classical mutual information I(X:Z), while I(C:Z) quantifies the residual information between C and Z. In order to have accurate classification I(C:Z) must be large, while to have good compression I(X:Z) must be small. The information bottleneck principle finds a compromise between accuracy and compression by minimizing the Lagrangian L IB = I(X:Z) − βI(C:Z) for a certain value of β. The parameter β allows us to explore different regimes and to favour either accuracy or compression. When β = 0 the minimization of L IB achieves the best compression, without caring about correct classification. While for β → ∞ the minimization of L IB achieves optimal classification without compression. Quantum generalizations of the information bottleneck principle were considered for quantum communication problems in [49,50]. Here we apply the IB principle to the different problem of finding the optimal embedding.
We focus on the state (11) where X and C are, respectively, the classical spaces of inputs and classes, and Q is the quantum embedding Hilbert space. We then define the quantum IB Lagrangian as where I(A:B) = I α→1 (A:B) in (12) is the quantum mutual information. Both I(X:Q) and I(C:Q), can be expressed using Holevo's accessible information [51]. In (9), (10) and (15), we have shown that good generalization is possible whenever I 2 (X:Q) is small, while low classification error is possible when I(C:Q) is large. These conclusions were found for the linear loss (5). We may assume that I α (X:Q) defines a family of generalization bounds for different loss functions, so the minimization of the generalization error is consistent with the minimization of I(X:Q), according to some metric, while the maximization of I(C:Q) is consistent with the accurate prediction of C from Q. For a particular value of β, the optimal embedding is then obtained as min ρ(x) L IB . From the definition, we find the explicit form of the IB Lagrangian as where S[ρ] is the von Neumann entropy of ρ,λ x are Lagrange multipliers to force correct normalization, and η contains all the terms that are independent of the embedding. The optimal embedding corresponds to a minimum of L IB , which satisfies ∂L IB ∂ρ(z) = 0. By explicit computation we find that the above condition defines a recursive equation for the optimal embedding whereρ = c P (c)ρ c and λ z is directly related toλ z and is needed to enforce normalization. Alternatively, by restricting to pure state embeddings ρ(x) = |ψ(x) ψ(x)|, we get λ z |ψ(z) = e (1−β) logρ+β c P (c|z) log ρc |ψ(z) .
From Eqs. (25) and (26) we see that, for β = 0, we get a constant embedding, while for large β the optimal embedding for a given x is iteratively obtained from one of the eigenvectors of c P (c|x) log ρ c with the largest eigenvalue, or a mixture of them. A numerical solution of the IB equations is shown in Fig. 6, where "pure" and "mixed" refer to either (26) or (25), which were solved for two Gaussian distributions and a single-qubit embedding, using a fixed number of iterations (1000). The Bayes risk is plotted as a reference, giving the smallest possible value of R that can be obtained with any embedding. For a fixed 1 ≤ β ≤ 3, we first compute the optimal embedding via either (26) or (25), and then compute the classification error R from Eq. (5) and the generalization bound B from Eq. (10). Recall that, for a given P (c, x), the classification error is equivalent to the approximation error A, up to the constant Bayes risk (8). Fig. 6(a) shows the approximationgeneralization tradeoff for the different regimes that we have explored by varying β. For low values of β, we get an almost constant embedding with a large classification error (up to 50%) and low B, while for large values of β ≥ 2, we find that R approaches the theoretical lower bound (Bayes risk), but at the expense of a larger generalization error, as B gets close to the theoretical upper bound (16). We point out though that for this particular  Fig. 4(a). example, with a single-qubit embedding and two Gaussian priors, the generalization error is always low due to the bound (16).
The properties of the optimal embedding are shown in Figs. 6(b) and (c). In particular, in panel (b) we observe that data belonging to different classes are clustered, but not well separated from each other. On the other hand, for larger values of β, points belonging to different classes are typically very far apart in the Bloch sphere, though there are still some points in the wrong cluster. This prediction is consistent with the analysis of the fidelity between two different embeddings discussed in the previous section and sketched in Fig. 5: a good embedding is such that F (ρ(x), ρ(y)) is large whenever x and y belong to the same class and small otherwise.

V. APPLICATIONS
In this section we study two different applications of our theoretical results. The first one deals with "quantum data", where the parametric quantum states ρ(x) are fixed by the problem. The second one focuses on the classification of classical data, where the quantum embedding x → ρ(x) can be optimized. In this latter case, we propose the Variational Quantum Information Bottleneck (VQIB) method for optimizing embeddings in order to favour generalization.

A. Quantum Phase Recognition
In Quantum Phase Recognition [9] the task is to recognise the phases of matter of a quantum many-body system, by taking measurements on the quantum device itself, without having access to a classical description of its state. Here we focus on a paradigmatic exactly solvable model of quantum statistical mechanics, namely the one-dimensional transverse-field Ising model [52] where σ x,y,z j are the Pauli matrices acting on site j and we consider periodic boundary conditions, σ α L+1 ≡ σ α 1 . For this model, the classical input is the magnetic field h ≡ x. In the thermodynamic limit L → ∞, the model displays a quantum phase transition at the critical value h = 1, separating an ordered phase for |h| < 1 with two-fold degenerate ground states from a disordered phase for |h| > 1 with unique ground state. The model can be exactly solved via fermionization [52]. To simplify our analysis for finite L, here we ignore the subtleties of the different fermion parity sectors by considering a small symmetrybreaking term that that forces the ground state to have even parity. In that case, for even L the ground state can be expressed as [53] |Φ GS (h) = where |00 k and |11 k are respectively the vacuum and occupied states by two fermion pairs with opposite momentum k, −k, and From the above expression, it is trivial to com- In the thermodynamic limit the fidelity induced distance 1−f (h, h+ ) for small diverges at the critical point [53]. Therefore, we may expect that the fidelity between two states from the different phases become very small. This is indeed shown in Fig. 7(a). Geometrically this means that the states belonging to different phases are clustered in distant areas of the Hilbert space, as in Fig. 5. However, f (h, h ) decreases exponentially in L for h = h , so for large L the matrix f (h, h ) is almost diagonal, thus signalling bad generalization performances according to our Eq. (21). A scaling analysis of B as a function of L is beyond the scope of this work. In what follows we test our theoretical predictions for a fixed chain length L = 100. In this case, we consider a uniform distribution P (h) over [0, 2] and compute B from (21) -where x there is the magnetic field h. More specifically, we have discretized the interval such that (21) can be computed from the numerical eigenvalues, and we have observed that the result converges to B 5.9 for 100 discretization points.
We then train a fidelity classifier [19] to recognise the phases of the quantum Ising model (27). In general the fidelity classifier associates to an unknown state |ψ the class of the state from the training set with highest fidelity with |ψ . Such fidelity can be estimated via the SWAP test using S shots, namely S copies of |ψ . Since the swap test measurement operator is idempotent, the result of the SWAP test is a Bernoulli random variable with mean F , the fidelity, and variance F (1 − F )/S. The fidelity measurement provides a non-optimal classification POVM, so this classifier is expected to perform slightly worse then the optimal strategies discussed theoretically in the previous sections.
For numerical simulations we consider a training set with T random elements with h > 1 and T random elements with h < 1, and verify the Quantum Phase Recognition problem by generating new testing states |Φ GS (h) for h uniformly distributed in [0, 2]. In Fig. 7(b) we numerically observe that even with T = 1 the testing error is almost zero, except near the critical point. By in-creasing the number of shots, the fidelity is estimated more precisely, and given that states belonging to different phases have very low fidelity, as shown in Fig. 7(a), the testing error decreases. When T ≈ B the training error is normally very low, except near the critical point. For T = 10 B we always find zero training error, irrespective of the number of shots. Therefore, this analysis confirms the predictions of our Theorem 1.

B. Variational Quantum Information Bottleneck
We now focus on using a quantum algorithm to classify classical data. In this case, the states ρ(x) are not fixed by the problem, as in the previous section, and can be optimized together with the measurement POVM. The embedding x → ρ(x) can be optimized by training a quantum circuit as in Fig. 1. More specifically, we consider one of the simplest yet most general classification circuits with a single-qubit classifier, dubbed "data reuploading" [8]: here we use a slightly modified version where the embedding is obtained as a composition of L layers of x-dependent single-qubit rotations around the y and z axes where R α (θ) = e iθσ α , σ α are the Pauli matrices and the weight tensor w α k can be optimized during training. Based on the Quantum Information Bottleneck principle proposed in Sec. IV B we study the variational minimization of the QIB Lagrangian (24) with respect to the parametric states (30). For single-qubit states, the entropies in Eq. 24 can be expressed without loss of generality in terms of the purity as are the eigenvalues of ρ, which only depend on the purity P(ρ) = Tr[ρ 2 ]. Since the state (30) is pure, S(ρ(x)) = 0 in Eq. (24). Moreover, in order to train the embedding, we approximate the averages over the distribution P (c, x) with empirical averages over the elements of the training set T , so from Eq. (24) we get where constant terms have been neglected, and by explicit computation, the purities read where T refers to the double sum over the elements (c x , x), (c y , y) from the training set, while in Tc the sum is restricted over elements with class c x = c y = c. The ordering x < y refers to the index of the inputs in the training set, and is used just to avoid double counting.
As an example for numerical simulations, we consider a binary classification problem with the 2-moons dataset shown in Fig. 8(a), where each point is described by two real coordinates x ≡ (x 1 , x 2 ). Moon points are organized in the two different patterns shown with different colors in Fig. (8)(a), which represent the two classes. Data have been generated using a noise parameter 0.3, which makes the classification less deterministic. We generate a training set of 100 samples per class and optimize Eq. (33) using the Nelder-Mead algorithm with starting point w α k = 0 (constant embedding). In Figs. 8(b) and (c), we show the fidelity between two trained embeddings F (ρ(x), ρ(y)), where training was performed using either β = 30 or β = 1.5. After training, we use the fidelity classifier [19] to study both the training and testing errors. Unlike the previous section, here we study an exact evaluation of the fidelity, which would require an infinite amount of measurement shots. The training error we get with the optimized embedding is always zero. This is consistent with our theoretical analysis (see Theorem 3 in Appendix A), as for N → ∞ copies we may formally get zero approximation error.
As shown in Figs. 8(b), for large β the trained embedding is able to separate most data points belonging to different classes into almost orthogonal quantum states. More precisely, the fidelity is almost zero for most inputs belonging to different classes, yet being mostly very high for states belonging to the same class, thus signalling good generalization. Indeed, by generating a testing set with 100 elements per class (also shown in Fig. 8a), we observe a testing error 4.5%. With a much larger testing set of 10 4 points we get a testing error of 2.6%.
Nonetheless, even better generalization can be obtained for β = 1.5, although the optimized embedding is almost constant, as shown in Fig. 8(c), with largest infidelity 10 −7 . The testing errors over the testing sets of 100 or 10 4 elements per class described above are respectively 3.5% and 1.9%, both smaller than those obtained with larger β. The price to pay is that, due to the small infidelities, many more measurements are needed to estimate the fidelity with the due high-precision for correct discrimination.
The wrongly classified samples in the smaller testing set are shown in Fig. 8(a) with a cross. We observe that for the small β = 1.5 only the elements near the boundaries may be wrongly classified, while for the larger β = 30, in spite of neater class separation shown in Fig. 8(b), there are wrongly classified samples in the "bulk" of the moons. Something similar was also observed in the numerical simulation shown in Fig. 6(c).
Our analysis shows that the Variational Quantum Information Bottleneck method can be successfully used to train quantum embeddings with different generalization properties.

VI. CONCLUSIONS
We have introduced measures of complexity to quantify the generalization and approximation capabilities QML classification problems, either with general parametric quantum states ρ(x) or quantum embeddings x → ρ(x) of classical data x, when optimal measurements are performed on the system. One of the main results of this paper is the bound on the generalization error via the Rényi mutual information I 2 (X:Q) between the embedding space Q and the classical input space X. Thanks to our bound, overfitting does not occur when the number of training pairs T is much bigger than 2 I2(X:Q) . Moreover, we have shown how to bound the approximation error via the mutual information between the embedding space and the class space, and shown that the classification error can approach it lowest possible value (Bayes risk), in the limit of many measurement shots or large Hilbert spaces. Our bounds were obtained for the linear loss function, routinely employed in QHT, but different losses can be linked to the linear loss via bounds. We have also introduced an information bottleneck principle for quantum embeddings, which is independent of the choice of loss function and allows us to explore different tradeoffs between accuracy and generalization.
Based on our theoretical results and bounds, we have studied different applications for both the classification of quantum and classical data. In particular we have studied the classification of the quantum phases of an Ising spin chain and proposed the Variational Quantum Information Bottleneck (VQIB) to train quantum embeddings with good generalization properties.
Our analysis can be applied to models of moderate complexity, such as those that can be trained with nearterm quantum hardware. It is currently an open question to understand whether quantum classifiers of very-high complexity can mimic the generalization capabilities of classical deep learning. Here we show a brief overview of the tools from statistical learning theory [36] that we use throughout the manuscript. As in the previous chapters, we assume that there exists an abstract probability distribution that models the inputs and their corresponding classes P (c, x). This distribution is obviously unknown, but by construction, the samples in the training set T are drawn independently from P (c, x). Suppose now that we have built a classifier h ∈ H, where H is the set of classifiers that we are considering. We may define the error due to misclassification via the loss function h (c, x), which is zero if and only if c is the correct class of x. Training is done by minimizing the empirical risk, namely the average loss over the training set while the "true" risk of a classifier h is given by Supervised learning is practically done via empirical risk minimization, namely the optimal data driven classifier is obtained from The generalization error defines how h T performs with unseen data, i.e., data not present in the training set.
Formally the generalization error is then defined as R(h T ) − inf h∈H R(h). Setting h * = argmin h∈H R(h) as the true optimal classifier, we may bound the generalization error G as where in the first inequality we used the fact that h T is optimal for R T , therefore R T (h T ) ≤ R T (h * ). The upper bound is known as the uniform deviation bound. It represents the maximum deviation between the true and empirical risks, Eqs. (A1)-(A2), maximized over the possible classifiers. The goal of statistical learning theory is to study how much larger the risk R(h T ) is than the Bayes risk, namely R Bayes = inf h R(h) where the infimum is over all possible hypotheses, not restricted to H. Then by summing and subtracting R(h * ) we get where A = R(h * ) − R Bayes is the approximation error, which depends on the hypothesis space H. One of the central results of statistical learning theory is the following [36]: if has support in [0, 1] then with probability at least 1 − δ we have that where C T (H) is the Rademacher complexity of H, which is defined as where σ k is a random variable which can take two possible values, ±1, with the same probability 1/2, and the notation T ∼ P T means that the T elements in the training set T are sampled independently from the distribution P . From (A7) we see that if the Rademacher complexity of H decreases with T , then, for sufficiently large T , the model is able to generalize and correctly predict the class of a new input, not present in the training set T .

Quantum Rademacher Complexity
Let us calculate the Rademacher complexity of the quantum loss function introduced in (2), for which it is clear from the definition that 0 ≤ (c k , x k ) ≤ 1, as requested. For a fixed embedding, defining P as the set of all possible POVMs, the Rademacher complexity of this quantum classifier (2) is where in the second line we used the second equality in Eq. (2), the fact that the constant term (from substituting in Eq. (2)) commutes with the sup and is averaged out by E σ , and finally the fact that the minus sign can be removed by noting that σ and −σ have the same distribution. Let us define then by linearity we may rewrite Eq. (A9) as In the following sections we show how to bound C T (P) using quantities that can be easily computed for a given embedding x → ρ(x). The main technical result that allows such simple expressions is the following Lemma 1. Let A i be a set of operators and i a random variable with probability distribution p i . Then where Proof: We define the positive operators X i := A i A † i . Thanks to the definition of trace norm A 1 = Tr √ AA † and the linearity of the trace, it is sufficient to prove that where the operator inequality Y ≥ 0 means that Y is a positive operator. The above inequality is proven as follows.
Since the function f (x) = x 2 is operator convex [54], we may write Moreover, since g(x) = √ x is operator monotone [54] we may take the square root of both sides of the above equation and get (A13). Note that a convex combination of positive matrices is also positive, so the left hand side of (A13) is a positive operator, and thus is equal to the square root of its square. This completes the proof.
We are now ready to write the main result of this section, namely a bound that allows us to express the Rademacher complexity of the quantum classifier via the quantity that was introduced in (10) and Theorem 1. We focus on binary classification problems, where there are two possible classes, which we call 0 and 1, so c ∈ {0, 1} and a POVM consists of two positive operators, Π 0 and Π 1 = 1 − Π 0 . Then, we extend the result to a general multiary classification problem with N C classes.
Theorem 2. For binary classification problems with fixed embedding x → ρ(x) and POVM P 2 = {Π, 1 1 − Π}, we find where B was defined in Eq. (10). For multiary classification problems with N classes we get which is slightly larger than (A15) when N C = 2.
Proof: We first focus on binary classification problems. Since constant terms are averaged out we can write Eq. (A11), for Π 0 = Π and Π 1 = 1 1 − Π, as where again the constant term is averaged out. The maximization over Π can be done by adapting the Helstrom theorem (see Theorem 13.2 in [55]): where |A| = √ AA † and A 1 = Tr |A|. In the first inequality, we are again able to average out the constant term, despite the fact it is within an absolute difference, because setting σ → −σ changes its sign whilst not affecting the other term in the absolute difference. The second inequality comes from the fact that | Tr[AB]| ≤ Tr[A|B|] for any operator B and positive operator A. The third inequality comes from the linearity of the trace and the fact that the elements of the POVM sum to the identity. Therefore, where we used the fact that (δ c k ,0 − δ c k ,1 ) 2 = 1 and that (c k , x k ) are independent and identically distributed. Inserting the above equation into (A19) and using P (x) = c P (c, x) we get (A15), which completes the first part of the theorem.
For the multiary classification problem with N C equiprobable classes, using Hölder's inequality in (A11), and noting that Π c ∞ ≤ 1, we may write where in the second line we have substituted the sum over c with an average where c is sampled from the uniform distribution with p c = 1/N C , in the third line we use Eq. (A12), in the fourth line we perform the averages as in Eq. (A21), and in the last line we simply employ the marginal distribution, as in Eq. (A15).

Bound on the approximation error
In this section we focus on the approximation error (14) and prove the following important result.

Proof: From the definition of the approximation error
where Bayes (c, x) = δ c,b(x) and b(x) = argmax c P (c|x). Note that the second summation is over all c and c such that c = c . Therefore, The approximation error is calculated using the optimal measurement for a given encoding ρ(x), however we can upper bound it by replacing this optimal measurement with a suboptimal measurement. We may find a suboptimal strategy as follows: we know that there always exists some POVM Π x that obeys [12,32] q for any probability distribution q(x). From Π x we can then construct the POVM Π c as follows namely we first try to learn the value of x, and then perform the standard Bayesian classification to get the class. Now there are many possibilities to find bounds depending on the choice of q(x) in (A31). Here we chose to use q(x) = p(x|c), namely we train the measurements to recognise all inputs within a certain class, and then check whether the Bayes classifier predicts a different result. We may write where, in the last line, the summation is over all x and y such that x = y, and where F c xy = P (x|c)P (y|c)F (ρ(x), ρ(y)).
The upper bound (A33) is typically too large to be practical. However, it can be used to show an important result.
Thanks to the above theorem, we see that taking copies of a simple embedding guarantees that A → 0 for N → ∞, as we observe in the numerical simulations shown in Fig. 4.

Appendix B: Further inequalities
In this section we discuss other inequalities and connections with other entropic quantities. We first recall the following inequality, which will be extensively used in this section: which is valid for any set of positive operators X i (see Ref. [56] for a proof). We first discuss the risk (2) and empirical risk (3) for multiary classification problems with N C classes. In this case there is no known analytic form of the optimal POVM, but suboptimal choices can be constructed using pretty good measurements [12,57]: calling T c the number of samples in the training set with class c, we may write R T (Π) = 1 − c Tc T Tr[Π c ρ T c ] and the error for an optimal measurement can be bounded as [12,57] where F (ρ, σ) = √ ρ √ σ 1 is the quantum fidelity. Using the strong concavity of the fidelity which is a multiclass generalization of (18). Therefore, even for classification problems with N C > 2 low risk is possible when the fidelity between the average states with inputs belonging to different classes is low. As for the generalization error, we see that the complexity B defined in Eq. (10) does not depend on the classes. Nonetheless, using (B1) we also get the following inequalities For pure state embeddings, σ c = ρ c and we get (19). Another interesting bound can be found by applying the Cauchy-Schwarz inequality Tr[ √ X] 2 ≤ Tr[X] Tr[1 1] to (10). We find where D is the dimension of the embedding Hilbert space. We now study the embedding ρ(x) using tools from quantum information. We define an extended tripartite mixed ρ CXQ state as in Eq. (11), where CX is the data Hilbert space, C is spanned by the classes |c and X by the labels |x , while Q is the Hilbert space of the quantum embedding ρ(x). We now introduce the Rényi conditional mutual information of ρ CXQ following Prop. 8 from [39]  where ρ CX = Tr Q [ρ CXQ ] = xc P (c, x)|cx cx|, and ρ C = Tr X [ρ XC ] = c P (c)|c c|. When multiplying operators that span different Hilbert spaces, it is implicit that the operators take a tensor product with the identity on the spaces that they do not span (e.g., ρ CX ρ CXQ = (ρ CX ⊗ 1 1 Q )ρ CXQ ). By explicit computation We note a similarity between the above expression and the quantities that are found in the generalization bound (B5). Indeed, for a uniform distribution over N C classes, P (c) = 1/N C , i.e., when all classes are equally likely, we find from (B5) that B ≤ 2 I2(X:Q|C) N C , and thus show a direct link between the generalization bound and the Rényi conditional mutual information of ρ CXQ . Therefore, good generalization is possible whenever I 2 (X:Q|C) is small or for large training sets T 2 I2(X:Q|C) N C . We can interpret the space Q in (11) as compression of the input into a quantum state. Assuming that the conclusions from Ref. [48] (which were originally formulated for the classical Shannon entropy) can be trivially extended to Rényi entropies, optimal compression happens when I α (X:Q|C) = I α (X:C|Q) = I α (C:Q|X) = 0. (B16) A zero conditional mutual information means that the three systems form a Markov chain: conditioning over one of the three systems makes the other two mutually independent. Quantum mechanically I(X:Q|C) = 0 if , which is not generally satisfied by the state (11). According to Ref. [48], we should both minimize I(X:Q|C) and maximize I(C:Q). The Rényi generalization of I(C:Q) can be obtained using a similar expression to (B11), by applying the expression for conditional mutual information to a state that is independent of the conditioning system (i.e., by calculating I(C:Q|X) for the state ρ CQ ⊗ ρ X ). We get the expression I α (C:Q) = α α − 1 log 2 Tr Tr where ρ c was introduced in Eq. (1). Simpler expressions that are directly connected with the fidelity may be obtained when for α = 1/2. For pure state embeddings ρ(x) α = ρ(x) and we get where F(c) = x,y P (x|c)P (y|c)F (ρ(x), ρ(y)) 2 , is the average squared fidelity between embeddings of inputs from the same class c. Therefore, F(c) should be maximized to minimize I 1/2 (X:Q|C). In other words, low conditional mutual information is possible when F (ρ(x), ρ(y)) 1 for x and y belonging to the same class. Moreover, is related to the fidelity between the average states ρ c and ρ c for different classes. Since the logarithm is monotonic, the minimization of −I 1/2 (C:Q) is possible by minimizing F (ρ c , ρ c ). Therefore, using Rényi entropies with α = 1/2, we recover the same conclusions as in the previous section: a good embedding is one for which the fidelity between two embedded states is small if the inputs are from different classes and high if the inputs are from the same class. where the operatorK has entries K x,y = p(x)p(y) ψ(x)|ψ(y) .
When |z − 1| < 1, the square root function admits the expansion √ z = ∞ k=0 where (a) k = a(a + 1) · · · (a + k − 1) is the Pochammer symbol. Since (z − 1) k can be expressed as a sum of z n for n ≤ k, we can apply the above lemma to show that for ensemble of pure states which is Eq. (21). Note that the matricesK and K are related by a similarity transformation and, accordingly, have the same eigenvalues. For almost diagonal kernel matrices, we may writeK = K d +K o where (K d ) xy = p(x)δ xy is the diagonal part and K o the off-diagonal part, as in Eq. (C2) with x = y. If the off-diagonal elements are much smaller (O( )) than the diagonal ones, then we may expand K = √ K d + K 1 + 2 K 2 + O( 3 ). Taking the square on either sides we can find and Tr √ K = 2 H2(X)/2 − 1 4 x =y p(x)p(y) p(x) + p(y) F 2 xy + O( 4 ), (C7) where F xy = | ψ(x)|ψ(y) | and 2 H2(X) = Tr[ √ K d ].