Group-Invariant Quantum Machine Learning

Quantum Machine Learning (QML) models are aimed at learning from data encoded in quantum states. Recently, it has been shown that models with little to no inductive biases (i.e., with no assumptions about the problem embedded in the model) are likely to have trainability and generalization issues, especially for large problem sizes. As such, it is fundamental to develop schemes that encode as much information as available about the problem at hand. In this work we present a simple, yet powerful, framework where the underlying invariances in the data are used to build QML models that, by construction, respect those symmetries. These so-called group-invariant models produce outputs that remain invariant under the action of any element of the symmetry group $\mathfrak{G}$ associated to the dataset. We present theoretical results underpinning the design of $\mathfrak{G}$-invariant models, and exemplify their application through several paradigmatic QML classification tasks including cases when $\mathfrak{G}$ is a continuous Lie group and also when it is a discrete symmetry group. Notably, our framework allows us to recover, in an elegant way, several well known algorithms for the literature, as well as to discover new ones. Taken together, we expect that our results will help pave the way towards a more geometric and group-theoretic approach to QML model design.


I. INTRODUCTION
Symmetries have always held a special place in the imaginarium of scientists seeking to understand the universe through physical theories. As such, it is not strange for a scientist to equate a theory's beauty and elegance with its symmetry and harmony [1]. Still, the role of symmetries in science is more than simply aesthetic, as in many cases they constitute the underlying force behind a theory. For instance, Galilean invariance is pivotal in Newton's laws of motion [2], and Lorentz and gauge invariances were fundamental for Maxwell to unify electricity and magnetism into the general theory of electromagnetism [3]. In the 20th century, symmetries would take the center stage as Einstein's theory of general relativity provided the first geometrization of symmetries [4,5]. Soon after, Noether's theorem showed a connection between differentiable symmetries and conserved quantities [6], proving that symmetries have defining implications in nature.
More recently, the importance of symmetries has been explored in the context of machine learning, and is core to the development of the field of geometric deep learning [7]. Here, the key insight was to note that the most successful neural network architectures can be viewed as models with inductive biases that respect the underlying structure and symmetries of the domain over which they act. The inductive bias refers to the fact that the model explores only a subset of the space of functions due to the assumptions imposed on its definition. Geometric deep learning not only constitutes a unifying mathematical framework for studying neural network architectures, but also provides guidelines to incorporate prior physical (and geometrical) knowledge into new architectures with better generalization performance, more efficient data requirements, as well as favorable optimization landscapes [7][8][9][10].
In this work, we propose to import ideas from the field of geometric deep learning to the realm of Quantum Machine Learning (QML). QML has recently emerged as a leading candidate to make practical use of near-term quantum devices [11,12]. QML formally generalizes classical machine learning by embedding it into the formalism of quantum mechanics and quantum computation. This formal generalization can lead to practical speedups with the potential to significantly outperform classical machine learning [13], since quantum computers can efficiently manipulate information stored in quantum states living in exponentially arXiv:2205.02261v2 [quant-ph] 26 Sep 2022 large Hilbert spaces.
Similar to their classical counterparts, the ability of QML models to solve a given task hinges on several factors, with one of the most important being the choice of the model itself. If the inductive biases [14] of a model are uninformed, its expressibility is large, leading to issues such as barren plateaus in the training landscape [15][16][17][18][19][20]. Adding sharp priors to the model narrows the effective search space, enhancing its trainability and improving its generalization performance [21][22][23][24]. As such, a great deal of effort has been recently put forward towards designing more problem-specific schemes with strong inductive biases [25][26][27][28][29][30][31][32][33]. Despite these efforts, most architectures currently used in the literature remain problem-agnostic, as there is no overarching theoretical framework that provides guidelines for how to embed symmetries of the problem into the QML model.
The main contribution of this work is a series of Propositions (Propositions 1, 2, 3 and 4) characterizing the landscape of possible quantum machine learning models that are invariant under a given symmetry group G. By identifying different types of strategies leading to group invariant models, we pave the way for a more systematic and efficient symmetry-informed QML model design. The power of our framework is showcased by applying it to classifying datasets based on purity, time-reversal dynamics, multipartite entanglement, and graph isomorphism, where we are able to recover, in an effortless manner, many celebrated quantum protocols and algorithms including very recent ones [34][35][36][37][38][39][40][41][42][43][44][45][46]. We also discuss the extension of our framework to the design of equivariant quantum neural networks. Finally, we highlight the exciting outlook for the new field of geometric quantum machine learning, for which our article lays some of the groundwork.

II. PRELIMINARIES
Here we provide the background and definitions needed for our group-invariant QML framework.

A. Symmetry groups in supervised QML
In this work we consider supervised binary classification tasks on quantum data. We remark, however, that the methods derived here can be readily applied to more general supervised learning scenarios or to unsupervised learning tasks. In addition, our work applies to classical data that has been encoded into quantum states.
For our purposes, we consider the case where one is given repeated access to a set of labeled training data from a dataset of the form S = {(ρ i , y i )} N i=1 . Here, ρ i are n-qubit quantum states from a data domain R in a d-dimensional Hilbert space (with d = 2 n ), while y i are binary labels from a label domain Y = {0, 1}. The data in S is drawn i.i.d. from a distribution defined over R×Y, and we assume that the labels associated to each quantum state are assigned according to some (unknown) function f : R → Y, so that f (ρ i ) = y i . As shown in Fig. 1(a), the goal is to train a parameterized model (or hypothesis) h θ to produce labels that match those of the target function f with high probability. Here θ denotes the set of trainable parameters in the model. For a given dataset, a fundamental question to ask is: what is the set of unitary operations on the states ρ i that leave their respective labels y i unchanged? Such set of operations forms a group [47] G ⊆ U(d), a subset of U(d) the unitary group of degree d. In the following, G is referred to as the symmetry group of the dataset. Explicitly, for every element V in G, the label associated with any transformed state V ρ i V † is exactly the same as the label associated Figure 2. Group invariance in QML. Consider a QML task of classifying single-qubit states according to their purity. The symmetry group G of this task is the unitary group U(2), as any unitary V ∈ U(2) preserves the spectral properties of the data. By imposing G-invariance into the quantum model, so that h θ (V ρV † ) = h θ (ρ) for all V ∈ U(2), we can rediscover several known algorithms such as those listed in the figure. with the original ρ i . Hence, it is natural to require that the model we are training should produce labels that also remain invariant under the action of G on the data. To capture such invariance, we introduce the following definition.
In principle, such G-invariance could be heuristically learnt by h θ via data-augmentation [48], i.e., by including additional training instances of the form {V ρ i V † , y i }. However, such effort is undesirable for two main reasons. First, it obviously means an increased algorithmic run-time cost. But second, and most importantly, such invariance learning is not guaranteed to be completely successful (especially when G is large or continuous) [7]. Instead, as shown in Fig. 2, our main approach here is to design QML models h θ that are, by construction, G-invariant for all θ. To achieve this, we introduce biases in the structure of h θ , for instance, by carefully choosing the architecture of the quantum neural network employed and the physical observable measured.
In the context of binary classification, there are two main scenarios to consider. In a first scenario, the data in both classes is invariant under a same symmetry group G, and thus h θ needs to be invariant under this sole symmetry group. In the second scenario the data in different classes have different symmetries, and we denote as G 0 and G 1 the symmetry groups corresponding to data with labels y i = 0 and y i = 1, respectively. In such case, we have the freedom to consider QML models that are either G 0 -invariant, G 1invariant, or both. As shown below, it is often convenient to build models that are invariant only under the action of one of the symmetry groups, as this is sufficient for data classification.

B. Conventional and quantum-enhanced experiments
Thus far, we have not defined what constitutes the parameterized model h θ . This choice is tied to the physical resources one may have access to, i.e., the way quantum data can be stored, accessed and measured. Since there is a large amount of freedom in this regard, we find it useful to restrict ourselves to two scenarios. Following the demarcation proposed in [49,50] we consider two settings, a conventional and a quantum-enhanced one, which are defined as: Definition 2 (Conventional experiment). In conventional experiments, each data instance ρ i is processed in a quantum computer and measured individually.
Definition 3 (Quantum-enhanced experiment). In quantum-enhanced experiments, multiple copies of each data instance ρ i can be stored in a quantum memory, and later simultaneously processed and measured in a quantum computer.
In both settings, the model predictions are obtained from the quantum device experiment outcomes. However, as illustrated in Fig. 3, the key difference between the classical and quantum-enhanced settings is that, in the latter, the QML model is allowed to act coherently on multiple copies of ρ i . This is in contrast with the conventional setting where the QML model can only operate over a single copy of ρ i at a time.

C. QML model structure
Throughout this work we consider models consisting in a quantum neural network U (θ) (i.e., a parameterized unitary that can be realized on a quantum computer) operating on k copies of an input state ρ, followed by a measurement on the resulting state. In other words, we work with models belonging to the following hypothesis class.
Hypothesis Class 1. We define the Hypothesis Class H 1 as composed of functions of the form where k is the number of copies of the data state ρ, U (θ) is a quantum neural network, and O is a Hermitian operator. . Conventional and quantum-enhanced experiments. a) In a conventional experiment, each data instance ρi in the dataset is sent to a quantum device. The state obtained at the output of the quantum computation is measured, with the measurement outcomes used to make predictions. b) In a quantum-enhanced experiment, a quantum memory allows us to store several copies of each state ρi in the dataset. These copies can be simultaneously sent to a quantum device. The state obtained at the output of the quantum computation is measured, with the measurement outcomes used to make predictions.
For k = 1 copies, the models belonging to the Hypothesis Class H 1 correspond to those that can be computed on a conventional experiment according to Definition 2. On the other hand, k 2 copies lead to models in quantumenhanced experiment according to Definition 3. Arguably, models from the Hypothesis Class 1 are not of the most general form. For instance, these could be extended to allow for non-trivial classical post-processing of the measurement outcomes and also to involve more than one circuit or observable. Still, the Hypothesis Class 1 already encompasses most of the current QML frameworks [51] and can serve as a basis for more expressive QML models. In the following we restrict our attention to models pertaining to H 1 and leave the study of more general models for future work.

D. Classification accuracy
Let us define some terminology that will allow us to assess the accuracy of a model's classification. First, we remark that we do not consider precision issues when discussing classification accuracy. Recall that the model's predictions in Eq. (2) are expectation values, which in practice need to be estimated via measurements on a quantum computer. Hence, given a finite number of shots (measurement repetitions), these can only be resolved up to some additive errors. However, for the sake of simplicity, we here assume the limit of zero shot noise (i.e., infinite precision), and we will challenge this assumption when appropriate in the results section. With this remark in hand, consider the following definitions of different degrees of classification accuracy.
Definition 4 (Classification Accuracy). i) We say that a model provides no information that can classify the data if its outputs are always the same irrespective of the label associated to the input quantum state. ii) We say that a model performs noisy classification if its outputs are the same for some, but not all, data in different classes. iii) We say that a model perfectly classifies the data if its outputs are never the same for data in different classes.
We note that in some cases a model can, at best, only perform noisy classification as its accuracy will be fundamentally limited by the distinguishability of the quantum states in the dataset. Note that this is typically not an issue for classical datasets, although the issue does arise for noisy classical data. In some cases, for example as in the time-reversal dataset that we consider below, the quantum data states associated with different output labels are nonorthogonal. In this case, perfect classification cannot be achieved, regardless of the form of the model.

E. Useful definitions
In this, rather mathematical, section we present definitions that will be used throughout the main text. For further reading we refer to [52,53].
While G describes the symmetries in the data, it will be crucial to characterize the symmetries of G itself. The symmetries of a group are captured by the commutant which is the vector space of all d × d complex matrices that commute with G. More generally one can also consider the space of matrices in C d k ×d k commuting with the k-th power tensor of the elements in G [54].
Definition 5 (k-th order symmetries). Given a unitary representation G ⊆ U(d) of a group, its k-th order symmetries are for all positive integers k.
First-order symmetries (k = 1) are known as the linear symmetries, while second order ones (k = 2) are known as quadratic symmetries of G. In general, there may be k-th order symmetries that are not Hermitian (and thus, not physical observables). However, as proved in Appendix A, any matrix in C (k) (G) has non-zero projection into the Hermitian subspace of C (k) (G). Hence, one can always associate any non-Hermitian element in C (k) (G) to a Hermitian one that also belongs in C (k) (G).
While the k-th order symmetries can be defined for any group, in the case when G is a Lie group there exists additional structure that one can exploit. In particular, there exists an associated Lie algebra g ⊆ u(d) such that e g = G.
That is, g = {g ∈ u(d) | e g ∈ G}. Here, u(d) denotes the set of d × d skew-symmetric matrices. We also find it convenient to introduce the following definition: Definition 6 (Orthogonal complement). Given a Lie algebra g ⊆ u(d), its orthogonal complement with respect to the Hilbert-Schmidt norm is defined as Note that g ⊥ is not a Lie algebra.

III. GENERAL RESULTS FOR G-INVARIANCE
In this section we determine conditions leading to models that are G-invariant by design. These results are stated in a general problem-agnostic way and will be applied to specific datasets in Secs. IV and V.

A. A single symmetry group
Let us first consider the case when there is a single symmetry group G associated with all the instances in the dataset. We aim at finding models h (k) θ from Hypothesis Class H 1 that are G-invariant, i.e., models such that h Evidently, the model will be invariant under G if [ O(θ), V ⊗k ] = 0 for all V ∈ G. Thus, the following proposition holds: ∈ H 1 be a model in Hypothesis Class 1, and G be the symmetry group associated with the dataset. The model will be Proof. The proof of this proposition follows from Definition 5. If O(θ) belongs to the vector space of the k-th Furthermore, as previously discussed, we can guarantee that O(θ) in Proposition 1 can always be taken as a Hermitian operator and thus as an observable.
Complementary to Proposition 1 we now prescribe a second way of ensuring G-invariance of the model when G forms a Lie group. This is achieved when the operator O(θ) can be taken orthogonal to (V ρV † ) ⊗k for all V in G and for all ρ in S. We formalize this statement in the following proposition, proved in Appendix B.
∈ H 1 be a model in Hypothesis Class 1. Then, let G be the symmetry Lie group associated with the dataset, and let g ⊆ u(d) be its Lie algebra with i1 1 ∈ g. The model will be G-invariant when ρ ∈ ig and O(θ) ∈ span({A j ⊗ A j } j ). Here, A j ∈ ig ⊥ , an element of the orthogonal complement of g, is a Hermitian operator acting on the j-th copy of ρ, and A j is an operator acting on all copies of ρ but the j-th one.
Note that in Proposition 2 we have assumed that i1 1 ∈ g, where 1 1 denotes the d × d identity matrix. However, it could happen that i1 1 is in g ⊥ instead. In this case, the proposition will hold if ρ ∈ ig ⊥ and A j ∈ ig, as ρ needs to have support on a vector space containing the identity.

B. Multiple symmetry groups
Let us now consider the case when each of the two classes in the dataset have a different symmetry group associated to them, which we denote as G 0 and G 1 . The concepts used in the previous section to obtain group-invariant models, i.e., commutant and orthogonal complement, can also be leveraged to derive conditions under which a model h (k) θ is G 0 -invariant, G 1 -invariant, or both. The following proposition, proved in Appendix C, generalizes Proposition 1 to the case of two symmetry groups.
∈ H 1 be a model in Hypothesis Class 1, and let G 0 and G 1 be the symmetry groups associated with the dataset. The model will be G 0 -and In addition, the model will be G 0 -invariant but not necessarily Conversely, while not stated explicitly in Proposition 3, the model will be Additionally, when G 0 and G 1 are Lie groups, with associated Lie algebras g 0 and g 1 , we can generalize Proposition 2 to the case of two symmetry groups. Then, the following proposition, proved in Appendix C, holds.
∈ H 1 be a model in Hypothesis Class 1, and let G 0 and G 1 be the symmetry Lie groups associated with the dataset, with g 0 and g 1 their respective Lie algebras with i1 1 ∈ g 0 , g 1 . The model will be G 0invariant and G 1 -invariant when ρ ∈ ig 0 , ig 1 and when is a Hermitian operator acting on the j-th copy of ρ, A j is an operator acting on all copies of ρ but the j-th one . In addition, the model will be G 0(1) -invariant but not necessarily G 1(0)invariant when ρ ∈ ig 0 , ig i and A j ∈ ig ⊥ 0 but A j ∈ ig ⊥ 1 .
In Proposition 4 we have assumed that i1 1 belongs to g 0 and g 1 . However, if i1 1 instead belongs to g ⊥ i (with i = 0, 1), then the proposition will hold by replacing g i by g ⊥ i , and conversely.
Propositions 1-4 provide conditions under which one can guarantee that a QML model in Hypothesis Class 1 is Ginvariant. While the results presented in this section are valid for the case when there are two symmetry groups, the previous propositions can be readily extended to more general scenarios (such as multi-class classification), where one has a set {G i } i of symmetry groups. For instance, one could generalize Proposition 3 to show that a model will be invariant under all symmetry groups

IV. LIE GROUP-INVARIANT MODELS
We now apply the general results presented in the previous section to identify G-invariant models that can classify states originating from several paradigmatic quantum datasets whose invariances are captured by Lie groups.
These include the purity dataset (Sec. IV A), the timereversal dataset (Sec. IV B), and the multipartite entanglement dataset (Sec. IV C). Our results are stated in the form of theorems. For pedagogical reasons, we include in the main text the proofs for most of these theorems, as they provide a constructive introduction to our framework.

A. Purity dataset
As a first application, we consider the QML task of classifying n-qubit states according to their purity [49]. Given , we want to discriminate those ρ i that are pure from those that are not. That is, we assign labels to states ρ i according to values of their purities Tr ρ 2 i . The symmetry group G associated with the data in both classes is the group of unitaries U(d). This follows from the fact that unitaries preserve the spectral properties of quantum states, and thus their purity remain unchanged under the action of U(d).

Conventional experiments
Let us first consider the case of conventional experiments (see Definition 2), i.e., when the model h (1) θ in Eq. (2) has access to k = 1 copy of each data at a time. For such a case, we can derive the following theorem. Proof. The strategy of this proof is as follows. First, we identify the possible G-invariant models arising from Propositions 1 and 2. Then, we show that these models cannot be used to perform classification for the purity dataset. Finally, we prove that no other G-invariant model within H 1 (with k = 1 copies) exist.
Recall from Proposition 1 that a model is [55] in C d×d , we know from Schur's Lemma [52] that It follows that if O(θ) is in C(G), it takes the form O(θ) = λ1 1. Moreover, we impose λ ∈ R to ensure the Hermiticity of O(θ). This yields a constant model prediction h (1) θ (ρ) = λ for any ρ. Hence, the G-invariant models of Proposition 1 do not provide any information about the purity of a state, and thus cannot classify the data.
Let us now analyze the models arising from Proposition 2. Since g = u(d), we have where θ (ρ) = 0 for any ρ. This shows that the Ginvariant models arising from Proposition 2 do not provide any information about the purity of a state and cannot classify the data.
So far, we have seen that G-invariant models obtained by applying Propositions 1 and 2 do not allow for classification of the purity dataset. Still, this does not preclude the existence of other G-invariant models within H 1 that may be adequate for classification. However, we now prove that no other G-invariant models exist, beyond those already considered. Given that h , the latter also needs to hold when uniformly averaging h The left-hand-side of the equality is evaluated as where the integral denotes the Haar average over the unitary group. In the second equality, we have used the linearity of the trace and of the integral. Then, in the third equality, we have used the left-invariance of the Haar-measure. Finally, in the fourth equality, we have explicitly computed the integration via the Weingarten calculus [56,57]. From Eq. (11) we can see that the only way for h The previous proposition shows that one cannot classify the data in the purity dataset with a model in H 1 operating in a conventional experiment. In hindsight, one could have foreseen this result. Indeed, computing the purity requires evaluating a polynomial of order two in the matrix elements of ρ, and thus, the linear functions as the ones here considered are deemed to fail. From a QML perspective, h (1) θ is ultimately a linear classifier where the parameterized quantum neural network U (θ) defines a hyperplane such that the expectation value of O(θ) is positive for one class and negative for the other. However, the manifolds of quantum states with different purities are not linearly separable in the state space. This can be better exemplified by single-qubit states in the Bloch sphere, where no plane can be drawn across the sphere which linearly separates pure and mixed states.
In the spirit of kernel tricks [58], one can introduce nonlinearities by allowing the models h (k) θ to coherently access multiple copies of ρ. This is precisely the setting of quantum-enhanced experiments, which we now explore.

Quantum-enhanced experiments
We now consider the case of quantum-enhanced experiments (see Definition 3) where multiple copies of a state in the dataset can be operated over in a coherent manner. As we now see, k = 2 copies is already enough for classifying states according to their purity.  Proof. Recall from Proposition 1 that a model h From the Schur-Weyl duality [59] we know that the k-th order symmetries of U(d) are given by , , , , , , Figure 4. Elements of the Symmetric group. The Symmetric group S k is composed of the k! distinct permutations over k indices. Here we illustrate its representation acting on tensor product systems for the cases of k = 1, 2 and 3 copies of n-qubit states (each represented as a line). For example, the element SW AP ⊗ 1 1 ∈ S3 depicted second from left-to-right, acts on a tensor product state as SW AP ⊗ 1 1 |ψ1 |ψ2 |ψ3 = |ψ2 |ψ1 |ψ3 .
with S k the representation of the Symmetric Group that acts by permuting subsystems of the k-fold tensor product of the input state (depicted in Fig. 4).
As shown in Fig. 4, for the case of k = 2 copies, this group contains only two elements [60] with 1 1 ⊗ 1 1 the identity acting on each of the two copies of ρ, and SW AP the operator swapping these copies. As a consequence, h The latter is diagrammatically presented in Fig. 5. This yields predictions showing that the model will be able to perfectly classify the states in the purity dataset (according to Definition 4) for any choice of a 2 = 0.
We now make several remarks regarding the results in Theorem 2, and regarding our framework in general. First, we note that while Proposition 1 provides a straightforward guideline to obtain G-invariant models, it does not prescribe how to actually build the quantum neural networks U (θ) and the measurement operators O ensuring that O(θ) = U † (θ)OU (θ) is a k-th order symmetry. All that we know is the specific form that the resulting O(θ) needs to have. Thus, it is still necessary to find an adequate ansatz for U (θ) and an appropriate observable O that can be efficiently measured. For instance, it is clear , , This equality is shown in diagrammatic tensor representation, where each line corresponds to a d-dimensional Hilbert space hosting a copy of ρ, and boxes represent unitary operations. b) For Furthermore, we know that the quadratic symmetries of U(d) are spanned by the identity 1 1 ⊗ 1 1 and the SWAP operators.
We depict these on the right. c) Using the diagrammatic tensor representation we verify that that simply choosing U (θ) = 1 1⊗1 1 ∀θ and O = SW AP satisfies the conditions of Theorem 2. This has the issue that one cannot efficiently estimate the expectation value of the SWAP operator -its Pauli decomposition has a number of terms that scales exponentially with n -using a model such as h (2) θ . However, it is well known that by adding an ancilla qubit and by using the Hadamard-test [34] one can efficiently estimate the expectation value of the SWAP operator. In Appendix D, we show how our present formalism can be applied to models that include an ancillary qubit. Surprisingly, the latter allows us to discover a new connection between the Swap Test [34] and the ancilla based algorithm of Ref. [61].

B. Time-reversal dataset
In this section we are interested in classifying states according to whether they are obtained from a time-reversalsymmetric [62] dynamic or from an arbitrary one. That is, the states ρ i of the corresponding dataset Specifically, the states in the dataset have a label y i = 1 if they are generated by evolving some (fixed) real-valued fiduciary state with a time-reversal-symmetric unitary (and thus are real-valued too), and a label y i = 0 if they are generated by evolving the same reference state with a Haar random unitary.
In contrast to the case of the purity dataset previously considered, one can now associate a distinct symmetry group to each of the two classes. On one hand, the states with label y i = 0 have G 0 = U(d) as a symmetry group. On the other hand, the states with label y i = 1 have, as a symmetry group, G 1 = O(d), which is the orthogonal Lie group of degree d. This is because the unitaries in O(d) preserve the time-reversal symmetry of the states (and thus their label). For This group can be obtained by exponentiation of the orthogonal Lie algebra, which consists of d × d skew-symmetric matrices, Here, note that o(d) corresponds to the purely real-valued subspace of the unitary algebra. Its orthogonal complement corresponds to the purely imaginary subspace of the unitary Lie algebra.
Having two symmetry classes allows for the design of a new classification strategy. Namely, one can classify the data using a G 1 -invariant model (but with c, b 1 and b 2 real values determined by the measurement operator and the states in the dataset. If c ∈ [b 1 , b 2 ], then Eq. (18) suffices for perfect classification according to Definition 4. If c ∈ [b 1 , b 2 ], we can still use Eq. (18) for noisy classification (see Definition 4) but there could be a chance of misclassification as one cannot perfectly distinguish between states in different classes yielding the same prediction. Such misclassification events will remain unlikely as long as the probability that h In Appendix E, we present a Lemma that formalizes the previous statement. In any case, for now we assume that a model satisfying Eq. (18) can classify the data in the dataset with probability high enough, and will challenge this assumption in due course.

Conventional experiments
For the case of conventional experiments, i.e., k = 1 copies in Eq. (2), the results in Proposition 3 cannot be used to find G 1 -invariant models classifying the timereversal dataset. Indeed, since the representation of G 1 = O(d) is irreducible, using Schur's Lemma [52] we know that i.e., G 1 has no non-trivial linear symmetries that could be exploited for the purpose of classification. Still, we can use Proposition 4 to find group invariant models. First, we note that the input states ρ i belong to g ⊥ 1 = u C (d) when they are time-reversal-symmetric but to g 0 = u(d) when they are Haar random. Hence, h 0 . This is formalized below. Proof. We aim at finding models that are , distinguishing time-reversal-symmetric states from Haar random ones. According to Proposition 4, the model will be is also contained in ig 1 , since a Lie algebra is closed under the action of its associated Lie group. Because time-reversal-symmetric states ρ i are exclusively contained in ig ⊥ 1 , it follows from Eq. (5) that h (1) θ (ρ i ) = 0, ∀ρ i with label y i = 1 . Moreover, the previous equation is not satisfied for Haar random states, as these will generally have both real and complex parts. As such, h (1) θ (ρ i ) will not necessarily be zero for states with label y i = 0. Hence, the model satisfies Eq. (18) such that it can perform noisy classification (according to Definition 4) for the states in the time-reversal dataset.
So far, we have identified models that yield predicted values of 0 for time-reversal-symmetric states, but yield values in a continuous range for states drawn from the Haar distribution. As such, when taking into account noise in the prediction of the model, any non-time-reversal state with prediction values close to zero may be misclassified. In fact, as proven in Appendix F, Haar random states lead to prediction values that (with probability close to one) lie in a range that becomes exponentially concentrated around zero with the number of qubits n. In turn, it can be shown that to classify states in the dataset with a success probability of at least 2/3, one would need to repeat the experiment a number of times that scales as Ω(2 2n/7 ) [49,50,63].
This raises attention towards a practical aspect in the design of QML models that we have not previously considered: the scaling in the number of experiment repetitions required for accurate classification. Our framework allows us to identify G-invariant models, but we are not guaranteed that such models are practical for large system sizes n. In fact, we have seen that an exponential number of repetitions are needed to make practical use of the models in Theorem 3. This motivates us to further continue the search of G-invariant models in quantum-enhanced experiments in the hope that these might avoid the exponential scaling present in conventional experiments.

Quantum-enhanced experiments
For quantum-enhanced experiments, i.e., k = 2 copies in Eq. (2), we can show that the following theorem holds.
. Elements of the Brauer algebra. A basis for the Brauer algebra B k is composed of 2k!/(2 k k!) possible pairings on a set of 2k elements, where any element may be matched to another. Here we illustrate its representation acting on tensor product systems for the cases of k = 1, 2 and 3 copies.
duality we know that the k-th order symmetries of O(d) are given by the Brauer algebra B k [64], The elements of B k are depicted in Fig. 6. As shown in Fig. 7, for k = 2 the Brauer algebra is spanned by three elements where |Φ + denotes the Bell state on 2n qubits It can be verified that |Φ + Φ + | is indeed a quadratic symmetry for O(d). To see that, recall the ricochet property (also called the transpose trick ), which states that for any linear operator A acting on a d-dimensional Hilbert space Using Eq. (23) one can assert that Fig. 7(b) for a diagrammatic proof).
, , We schematically show this fundamental property on the left. We know that the quadratic symmetries of O(d) are elements of the Brauer algebra B2 whose basis contains three elements: the identity 1 1 ⊗ 1 1, the SWAP operator, and the projector onto the Bell state φ + φ + . We depict these on the right. c) Using the diagrammatic tensor representation we verify that The only element that is in C (2) (O(d)) but not in C (2) (U(d)) is the projector onto the Bell state |Φ + Φ + |. Hence, according to Proposition 3, h Fig. 8(a) we show a circuit that could be used to measure this overlap.
Recall that the time-reversal states ρ i are obtained by evolving a real-valued fiduciary state -taken to be |0 for all ρ i with labels y i = 1. On the other hand, the model output will not be constant for states with labels y i = 0, i.e., for states obtained by evolving |0 ⊗n under a Haar random unitary W i . In this case one has  Here we are free to choose the 2n-qubit initial state |Ψin that will be evolved under the action of W ⊗2 i . In both panels we have indicated with a red dashed box the circuit for implementing a Bell-basis measurement. In particular, an all-zero measurement outcome corresponds to the probability of measuring Φ + .
Theorem 4 shows that measuring the Bell state allows us to do classification. However, this does not solve the scaling issue discussed earlier. Indeed, as proven in Appendix F, the predictions values of the model given in Eq. (25) still concentrate exponentially close to zero as a function of the number of qubits. This implies that we still need an exponential number of experiment repetitions to accurately classify the data. However, as we now show, this problem can be overcome if we slightly modify the task at hand, from the classification of time-reversal-symmetric states to the classification of time-reversal-symmetric dynamics.
For this new task, rather than being given states, we assume instead access to the unitaries used to produce these states. The corresponding dataset has the form which has the same two symmetry groups G 1 = O(d) and G 0 = U(d) as before. As shown in Fig. 8(b), the main advantage of this scenario is that we are now allowed to initialize the 2n-qubit register to any global state |Ψ in , and to simultaneously evolve the first and the second n qubits according to the same unitary W i . To capture this additional freedom, we consider models in a new hypothesis class defined as: Hypothesis Class 2. We define the Hypothesis Class H 2 , computable in a quantum-enhanced experiment, as com-posed of functions of the form (27) where U (θ) is quantum neural network acting on the 2n qubits, O is a Hermitian operator, and |Ψ in is an initial state on 2n qubits.
In this context, we can still use Proposition 3 to show that the following theorem holds.
Theorem 5. Let h θ ∈ H 2 be a model in Hypothesis Class 2, computable in a quantum-enhanced experiment. There always exist quantum neural networks U (θ) and operators O, resulting in O(θ) = |Φ + Φ + | -with |Φ + being the Bell state on 2n qubits -such that h θ is O(d)-invariant, but not U(d)-invariant, that can perfectly classify the dynamics in the time-reversal dataset. The special choice of |Ψ in = |Φ + recovers the algorithm for classifying timereversal-symmetric dynamics presented in [49].

Proof. Recall from Proposition 3 that h θ is invariant under
. Following the proof of Theorem 4, we know that this can be achieved with the choice of O(θ) = |Φ + Φ + |. Moreover, a straightforward calculation shows that if we choose |Ψ in = |Φ + , we have recovering the algorithm in [49]. On the other hand, h θ (W i ) is W i -dependent and will concentrate around zero if y i = 0 (see Appendix). This means that the model outputs a value of 1 if the unitary has label y i = 1, and outputs a value of 0 (with high probability) if the unitary has label y i = 0. Thus, the models in Hypothesis Class 2 can perform perfect classification (according to Definition 4) of time-reversal-symmetric dynamics.
As shown in the proof of Theorem 5, now the model gives non-overlapping predictions for the data in different classes, meaning that we can now perform classification with O(1) experiment repetitions. This is in contrast to the model defined in Theorem 4, which requires an exponential number of experiments for accurate classification. This illustrates how QML models capable of achieving a quantum advantage naturally emerge in our framework as G-invariant models.

C. Multipartite entanglement dataset
In this section, we consider the more involved task of classifying pure quantum states according to the amount of multipartite entanglement they possess. Entanglement has been shown to be a fundamental resource [65,66] in quantum information, quantum computation and quantum sensing [67][68][69][70][71][72][73][74][75]. Hence, its study and characterization is quintessential for quantum sciences.
Here, we recall that entanglement is relatively well understood for bipartite pure quantum states (e.g., via the Schmidt decomposition for pure states [76]), and that group-invariance arguments have been previously used to characterize entanglement in bipartite mixed states [77,78]. However, the same cannot be said for the multipartite entanglement [79][80][81][82]. In this case, the entanglement complexity scales exponentially with the number of parties and there is no unique measure to quantify it. Thus, we employ our framework to not only obtain G-invariant QML models -that can accurately classify multipartite entangled states -but also to better understand the unique nature of multipartite entanglement. In this context, we also recall that the presence of publicly available datasets, such as the NTangled dataset [83], composed of quantum states with varying amounts of multipartite entanglement, makes this an extremely rich application for our framework and for benchmarking QML models.
Let E be a multipartite entanglement measure satisfying E(ρ) ∈ [0, 1], with E(ρ) = 0 if the state is separable, and E(ρ) > 0 if the state contains multipartite entanglement between the n qubits (for instance, see the entanglement measures in Refs. [36,[84][85][86]). The multipartite entanglement dataset is of the form Here, the symmetry group G associated with the data in both classes is the Lie group G = n j=1 U(2), with an associated Lie algebra g = n j=1 su(2). This is due to the fact that local unitaries n j=1 V j do not change the multipartite entanglement in a quantum state.

Conventional experiments
Since computing the entanglement typically requires evaluating a non-linear function of the quantum state [65], it is expected that models in conventional experiments will not be able to classify the states in this dataset. This intuition can be confirmed with the following theorem: and can classify (i.e., provide any relevant information about) the data in the multipartite entanglement dataset.
Proof. First let us verify that Propositions 1 and 2 do not yield any adequate model for classification purposes. To identify the linear symmetries of G required for the application of Proposition 1, we apply the Commutation Theorem for tensor products [87,88], which states that the commutant of a tensor product of operators is the tensor product of the commutants of each operator. Hence where 1 1 2 denotes the 2 × 2 identity. This results in the choice O(θ) = λ1 1 (λ ∈ R) and constant model predictions θ (ρ i ) = λ) that cannot distinguish between states. Additionally, one can verify that the orthogonal complement of g is trivial: with 0 2 the 2 × 2 null matrix, such that models designed under Proposition 2 would also result in uninformative constant value predictions. Hence, using Propositions 1 and 2 to obtain G-invariant models from Hypothesis Class 1 (with k = 1) will lead to trivial models that cannot classify the states in the multipartite entanglement dataset. Following a similar argument as the one developed in the last part of the proof of Theorem 1, one can also verify that no other G-invariant models exist with k = 1. Indeed, if h (1) θ V ρV † is invariant for any V = d j=1 V j , it also has to be invariant when uniformly averaged over every V j in U(2). Performing this averaging, we obtain

Quantum-enhanced experiments
Let us first consider the case of k = 2 copies in Eq. (2). We can show that the following theorem holds.  4 , SW AP (j) } , such that h (2) θ is invariant under the action of n j=1 U(2) and can perfectly classify the data in the multipartite entanglement dataset. Here, 1 1 (j) 4 denotes the 4 × 4 identity matrix on the j-th qubit of each copy of ρ, and SW AP (j) denotes the operator that swaps the j-th qubits of each copy of ρ. There exist special choices of O(θ) which recover all the multipartite entanglement measures proposed in Refs. [36][37][38][39][40][41][42].
Proof. From Proposition 1 we know that h is a quadratic symmetry of G. Here we can again invoke the Commutation Theorem for tensor products to obtain the space of these symmetries: denotes the 4 × 4 identity matrix acting on the j-th qubit of each of the two copies of ρ, and where SW AP (j) denotes the operator that swaps the j-th qubits of the copies of ρ. Note that C (2) (G) is spanned by 2 n elements, meaning that there exists an exponentially large freedom in choosing O(θ). Evidently, some choices of O(θ) will not be useful for characterizing multipartite entanglement. For instance, O(θ) = with 1 1 (j) the identity on all qubits but the j-th ones leads to h (2) θ (ρ) = 2(1 − Tr ρ 2 j ) which is the impurity of ρ j = Tr j [ρ], i.e., the impurity of the reduced state on the j-qubit, and a bipartite entanglement measure across the cut j-th qubit / rest. Averaging over each of the n qubits, i.e., recovers the multipartite entanglement measures of [37,38]. Notably, the result in Eq. (34) can be further generalized as follows. First, let us define S = {1, 2, . . . , n} as the set of integers indexing each qubit, and let P (S) be its power set (i.e., the set of subsets of S, with cardinality |P (S)| = 2 n ).

Defining the operator
for any Q ∈ P (S)\{∅}, leads to the generalized version of the Concurrence measure for multipartite pure states in [39,40]. Even more generally, for any choice of Q, the operator where Q = S\Q, leads to the Concentratable Entanglement family of multipartite entanglement measures introduced in [36] (see Proposition 3 in [36]), where the special case Q = S also leads to the measure of [41]. On the other hand, leads to the n-tangle measure [42] (see Proposition 5 in [36]).
Since several of the previous choices for O(θ) lead to entanglement monotones, the model's output will be different for data in different classes. Hence, one can perfectly classify the data in the multipartite entanglement dataset.
The results in Theorem 7 showcase how Propositions 1-4 can lead to extremely powerful and non-trivial results. By simply imposing the G-invariance condition on the model one is able to naturally find an exponentially large manifold of solutions capable of classifying the states in the multipartite entanglement dataset. The latter leads to the intriguing possibility that C (2) (G) contains solutions leading to new entanglement measures.
Going further, one could also investigate the potential of models acting on more than k = 2 copies, a prospect that has been largely unexplored. Here we know from Propositions 1 that if O(θ) is a k-th order symmetry, then the model h where S j k denotes the Symmetric group acting on the k copies of the j-th qubit. Since the dimension of C (k) (G) scales as (k!) n , i.e., exponentially with k, the manifold of G-invariant models is likely to lead to a rich variety of entanglement measures.

V. DISCRETE GROUP-INVARIANT MODELS
In the previous section we focused solely on situations where the symmetry group associated with the dataset was a unitary representation of some compact Lie group. Still, our formalism can be equally applied in the case of representations of discrete groups. Discrete groups are the relevant mathematical structure, for instance, when the quantum data is invariant under a finite set of permutations. This covers cases involving spatial invariance of condensed matter states on a lattice, or structural invariances in states of molecular systems. To illustrate such potential applications, we now address the task of classifying states belonging to a dataset with symmetry group G = S n , i.e., the Symmetric group consisting of all the n! permutations over a set of n indices.

A. Graph isomorphism dataset
In this section, we consider a dataset related to the socalled graph isomorphism problem, where the goal is to determine if two graphs are isomorphic. This classification task has a rich history in computational sciences [89], and is known to be in the NP complexity class (although it has not been shown to be NP-complete). To solve this problem, several classical algorithms (with quasipolynomial complexity in the graph size [90]), and also quantum heuristics [43][44][45][46] have been proposed. When using a quantum model for graph classification purposes, the first step is to encode graphs onto quantum states. Here we take such encoding to be fixed and start by detailing how it is performed and how the "graph isomorphism" dataset is generated.
Recall that a graph is specified as G = (V, E), where V is a collection of n nodes, and E is a collection of edges. Two graphs G and G are said to be isomorphic (and denoted as G ∼ = G ) if there exists a bijection between the sets of edges belonging to G and G . To build the dataset, we fix two reference non-isomorphic graphs G 0 and G 1 (G 0 ∼ = G 1 ), and generate graphs G i that are isomorphic to either G 0 or G 1 . To each of these graphs we assign labels y i = 0 or y i = 1, if it is isomorphic to G 0 or G 1 , respectively.
Next, we encode these graphs {G i } into quantum states {ρ i }. This is achieved by evolving an initial S n -invariant fiduciary state ρ in (e.g., ρ in = |+ +| ⊗n ) as where W (G i ) = e −itH(Gi) . Here, t > 0 is a fixed evolution time chosen such that the action of W (G 0 ) and W (G 1 ) over ρ in is different, and H(G i ) is an Hamiltonian whose Figure 9. Permutation invariance in the graph isomorphism dataset. Consider the state ρ representing a quantum system of n-spins interacting under some Hamiltonian that follows the topology of an underlying graph G. By conjugating the state with an element P ∈ Sn, one obtains a new quantum state P ρP whose interaction graph G is isomorphic to G. That is, ρ and P ρP have the same label in the dataset.
topology is that of the graph G i . Specifically, we take it to be defined as with Z j and X j denoting the Pauli Z and X operators acting on qubit j, respectively. We note that there exists more general ways to encode and process graph information in quantum states. In particular, the quantum graph convolutional neural network introduced in Ref. [32] generalizes the unitary W (G i ) used here and is specifically tailored to represent quantum systems that have a graph structure (see Appendix G for a detailed description). Taken together, the graph generation and the encoding in Equation (40) allow us to define the graph isomorphism dataset as a collection S = {(ρ i , y i )} N i=1 of states ρ i with labels As shown in Fig. 9, the states ρ i can be thought as representing an n-qubit quantum system whose interaction topology follows that of a graph G i . The symmetry group G associated with both classes in the dataset is the Symmetric group S n , as one can map states within the same class via the action of elements in S n (see Fig. 9). Explicitly, let P be an operator in S n , and define the state ρ i = P ρ i P , which can be expressed as Since ρ in is S n -invariant, then P ρ in P = ρ in , and given that G i ∼ = G i , we conclude that the state ρ i = W (G i )ρ in W † (G i ) shares the same label as ρ i .
Having defined the dataset of interest, we proceed to show that models in conventional experiments suffice to classify the data: ∈ H 1 be a model in Hypothesis Class 1, computable in a conventional experiment. There always exist quantum neural networks U (θ) and operators O, resulting in O(θ) ∈ span(A ⊗n ) with A in U(2), such that h (1) θ is invariant under the action of S n .
Proof. From Proposition 1 we know that h (1) θ will be invariant under the action of S n if O(θ) is a linear symmetry of S n . Using the Schur-Weyl duality leads to [91] C(S n ) = span({A ⊗n | ∀A ∈ U(2)}), (43) meaning that the operator O(θ) has to be a Hermitian linear combination of n-fold tensor products of single-qubit unitaries. Notably, the dimension of this solution manifold grows polynomially with n as the dimension of C(S n ) can be shown to follow the Tetrahedral numbers.
Finally, it remains to identify an adequate O(θ) among the space of operators that was just defined. This choice should be taken such that the model predictions are (maximally) distinct for states belonging to different classes, that is, h For an adequate parameterization of O(θ), this search could be turned into an optimization task. Given that, by construction, the model predicts the same value for any states ρ i belonging to the same class, this optimization would only require a single representative state for each class. Otherwise, one could employ heuristically-defined or physically-motivated operators. In particular, we highlight that the operators studied in [43][44][45][46] to distinguish non-isomorphic graphs belong to the family of O(θ) yielded by Theorem 8.
Overall, Theorem 8 examplifies how our present framework can be readily applied to datasets with discrete symmetries. While studied here for the case of G = S n , this can be specialized to subgroups of S n , such as the group of reflexions, translations, or more subtle discrete symmetries which naturally arise in condensed-matter models and quantum chemistry problems.

VI. EQUIVARIANT QUANTUM NEURAL NETWORKS
While the framework laid here provides the ultimate form that a G-invariant model h θ should adopt, it does neither prescribe how to actually parameterize the quantum neural networks U (θ), nor tell us how to choose the measurement operator O that realizes O(θ) = U † (θ)OU (θ) such that it complies with our theorems.
For this purpose, it is convenient to consider the action of the quantum neural network and the measurement process separately, and note that these can be generally though as concatenated maps. More generally, a constructive way to achieve G-invariance of general QML models starts by decomposing the model as i.e., as a composition of M maps E (m) θ (m) , with integer labels m = 1, . . . , M , each parameterized by a subset of parameters θ (m) ⊂ θ. For instance, one such map could represent the action of a QNN (or of its individual constituents, i.e., its layers), the final measurement operation, steps of postprocessing, or even the process of encoding classical data into quantum states in the first place.
Although imposing invariance at the map level effectively results in global invariance of the model, this is quite restrictive. A more relaxed approach towards the construction of group invariant models involves the concept of equivariance [7][8][9]92] which is now defined.
Definition 7 (G-equivariance). Given a group G with an action on spaces A and B, a function E : A → B is called G-equivariant if it commutes with the action of the group for all elements V ∈ G and inputs x ∈ A.
That is, a function is G-equivariant if group-shifting the input (x → V ·x) produces a group-shifted output (E(x) → V · E(x)). It can be verified that (i) G-equivariance of the intermediary maps (m < M ) along with (ii) G-invariance of the final map (m = M ) is sufficient to ensure G-invariance of the composed model. Intuitively, equivariance permits the propagation of the action of elements of G up to the final map, so that the symmetries get preserved. Accordingly, a model h θ belonging to Hypothesis Class 1 could be split in terms of the transformation of the input state and the final measurement. That is, θ (ρ) = U (θ)ρ ⊗k U † (θ) and E (2) (ρ) = Tr Oρ ⊗k . Thus, the model h θ will be G-invariant if E (2) is invariant and E (1) θ is equivariant. As before, invariance of E (2) can be achieved by choosing the measurement operator O to be a Hermitian operator satisfying one of the Propositions 1-4. On the other hand, equivariance of E (1) θ can be achieved in a way very close in essence to Proposition 1. Specializing Definition 7 to the case of unitary maps acting on k copies of ρ, we see that equivariance of such QNN (or of its layers) is equivalent to requiring that That is, a unitary U (θ) is G-equivariant if U (θ) ∈ C (k) (G) (it belongs to the k-th commutant of G). For instance, for graph classification, the quantum graph convolutional neural network, presented in Appendix G, can be verified to be be equivariant under the action of S n . Overall, approaching G-invariance through the decomposition in Eq. (44) has the advantage of modularityequivariant maps could be more easily identified and reused across different models -and also allows for the study of more general models than the ones in Hypothesis Class 1. For instance additional steps of post-processing, or of encoding of classical data into quantum states, can be described as additional maps to be composed, and readily fit in such framework. Finally, although in this manuscript we have exclusively focused on classification tasks, i.e., where the model outputs scalars, one can consider the more general setting of a model producing operator valued outputs (i.e., in the case of quantum generative modeling [33,93,94]). In such a case, one would be interested in global equivariance of the model, rather than invariance.

VII. CONCLUSIONS
In this work, we presented a theoretical framework to design QML models that, by construction, respect the symmetries of a group G associated to the dataset. This approach has several benefits [7]: it is more data efficient, it reduces the model's search space (less parameters), it often leads to better generalization, and the classification accuracy is robust under perturbations drawn from the symmetry group. Our main contributions are as follows. First, in Propositions 1-4 we leveraged properties from representation theory to determine the conditions that lead to G-invariance. These results constitute guidelines for designing group-invariant models and were used to show how models such as those in Hypothesis Class 1 can be made Ginvariant and accurately solve certain supervised learning tasks. We then showcased the power of our framework for several QML tasks, where we find how embedding symmetry information into the model allows us to recover in an elegant and formal way several algorithms from the literature that were heuristically obtained, or that were obtained through trial-and-error.
As a first application, we addressed the task of classifying pure states from mixed states in conventional and quantum-enhanced experiments (i.e., experiments with and without access to a quantum memory). For this case, the symmetry group is the unitary group, since applying a unitary to a quantum state preserves its spectral properties. Theorem 1 showed that there exist no conventional experiments that are G-invariant and that can classify the data in the purity dataset. However, by allowing the QML model to coherently act on two copies of each state in the dataset we showed in Theorem 2 that there exist models that are G-invariant and can classify the data. These are based on taking the expectation value of the SWAP operator, which naturally appeared through the Schur-Weyl duality as an element of the Symmetric group.
The second task we considered was that of classifying time-reversal symmetric states from Haar random states. In Theorems 3 and 4 we showed that models in both conventional and quantum-enhanced experiments can be used to distinguish such states. Surprisingly, we recovered the well-known Bell basis measurement scheme for detecting time-reversal, where the Bell measurement operator appeared naturally as an element of the basis of the Brauer algebra. Moreover, by the seemingly innocuous change of allowing the model to access the unitaries that prepare the states in the dataset (rather than the states themselves), we can obtain in Theorem 5 the model used in Ref. [49] to show that quantum-enhanced experiments can classify the data with exponentially less experiments than conventional experiments. This connects our work with recent research showing that QML models are capable of exponential advantages in some tasks of data classification.
We then applied our framework to a dataset composed of pure states with different amounts of multipartite entanglement. Here, the symmetry group preserving entanglement is the n-fold direct product of the local unitary group. This example proved the power of our framework as we showed that all the entanglement measures of Refs. [36][37][38][39][40][41][42] are special cases of the family of G-invariant models defined in Theorem 7. Moreover, we conjectured that allowing the QML model to access more than two copies of each quantum state and measuring the expectation value of local permutation operators can lead to new entanglement measures. Interestingly, we recently became aware of the work in Ref. [95] where it was shown such expectation values can indeed detect the presence entanglement.
Additionally, we showed how our results extend beyond continuous Lie groups, by studying a problem of classification in a quantum graph isomorphism dataset. That is, we addressed the task of determining if a given graphencoded quantum state belongs to one isomorphism class or the other. In this case, the symmetry group associated with the dataset is the Symmetric group. In Theorem 8, we identified G-invariant models capable of classifying the data in such graph-isomorphism dataset.
Our results take one of the first steps towards a general theory of QML models with sharp geometric priors based on the dataset symmetries. Since our work is inspired by the theory and success of geometric deep learning, we envision that soon enough the field of geometric quantum machine learning will be a thriving and exciting field.

VIII. OUTLOOK
Here we overview some questions left unanswered by our results, and propose different paths forward.

A. Equivariance
As detailed in Section VI, the concept of equivariance in quantum neural networks may play a central role when building models that respect the symmetries of a given dataset. While a few examples of equivariant quantum neural networks have been proposed, such as the Quantum Convolutional Neural Network [21] which respects translational symmetry, the Quantum Convolutional Graph Neural Network [32] which respects S n -symmetry in graphs, the U(d)-equivariant ansatz of [96], or the graph automorphism group-invariant ansatz in [97], it is worth noting that these are the exception to the rule. Most quantum neural networks in the literature are not equivariant, and do not use information about symmetries in their design. Hence, much work remains to be done in the path towards general equivariant architectures, especially to guarantee that they have circuit depth and connectivity requirements compatible with near-term quantum hardware.

B. Trainability: Expressibility and gradient magnitudes
Arguably, one of the main threats to the trainability of QNNs are Barren Plateaus (BPs), a phenomenon by which gradients along the parameter landscape become exponentially concentrated around zero as the system size grows [15][16][17][18][19][20]. In the presence of BPs, an exponential number of measurement shots is required to correctly identify a minimizing direction on the landscape. Given such limitations, understanding the conditions that lead to their presence has been the subject of extensive work [17,22,23,25,98].
Naively, one would be tempted to choose QNNs to be highly expressive so that good approximations of the relevant unitary transformation can be achieved. Nevertheless, Ref. [15] unveiled a connection between the expressibility of an ansatz and the magnitudes of the gradients: highly expressive ansätze were shown to exhibit BPs, suggesting that expressibility should be limited to give room for trainability. Later on, Ref. [25] pointed towards the Lie closure of the gate generators of an ansatz as a measure of its ultimate expressibility. Most importantly, when the dimension of such Lie closure grows exponentially with the system size (as is the case for problem-agnostic architectures such as the harware efficient ansatz [16,17,99]) there exists some critical number of layers beyond which barren plateaus are known to dominate the parameter landscapes. Finally, we note that the size of the Lie closure has also been related to the number of parameters needed to overparametrize a QNN [26]. Therefore, given a fixed number of parameters we expect, in general, less expressive ansatz to have more favourable landscapes. Overall, all these results point towards the importance of reducing as much as possible (in a sensible way) the expressivity of QNNs.
In this context, building models with strong geometric priors such as equivariant QNNs constitute a sensible choice. By constraining the expressibility of the ansatz to the relevant region only, these symmetry-based proposals emerge as goldilocks candidates for trainability-aware ansatz design. While the exact improvement in tranability will be certainly problem dependent, there is already evidence that equivariant QNNs do indeed lead to better performance and trainability in several archetypal near-term algorithms [97].

C. Generalization
Complementary to its trainability, the ability of a model to generalize to unseen data is key to its applicability in realistic scenarios. While errors evaluated on a training dataset are the main metric when training a model, its practical success should be gauged when applied to new testing data. The generalization error quantifies the gap between training and testing errors. In the realm of QML, recent results [24] have shown that such generalization error is upper bounded by a quantity scaling as T /N where T denotes the number of parameters, and N the number of training data. As such, given a fixed training dataset, reducing the expressibility of a model (by means of equivariant QNNs with appropriate geometric priors), and thus the number of free parameters, is expected to yield better model generalization.

D. Quantum advantage
The gold standard for quantum machine learning models, and for quantum algorithms in general, is being able to solve a given task faster that any classical method. As exemplified by our main results (see Sec. IV B), the concept of G-invariance is not tied to that of computational advantage, as there exists G-invariant models capable, but also incapable, of achieving a quantum advantage. Hence, it will be fundamental to determine the key features that lead to models with favourable scalings.

E. More general models and learning scenarios
In this work we considered QML models of the form in Hypothesis Class 1. However, these are not the most general models one can have. For instance, the ability to perform non-trivial post-processing on the measurement outcomes [61] or employing randomized measurement techniques [100] can greatly increase the model's power and performance [13]. We expect that the principles exposed here can be applied to more general settings opening up the possibility of obtaining G-invariance with methods beyond those described in Propositions 1-4. Moreover, while the concepts of k-th order symmetries and orthogonal complements played a key role in our derivation of G-invariant models, we expect that other properties will be needed to understand group-invariance in more general settings.
Finally we highlight that we have mainly focused on supervised tasks of binary classification. Nevertheless, the ideas of G-invariance should also be applied to more general supervised learning scenarios (including regression problems), or to unsupervised learning scenarios.

IX. ACKNOWLEDGMENTS
We thank Robert Zeier for helpful and insightful discussions. ML where Tr j is the trace over all qubits except those in the j-th copy of ρ. Equation (B3) shows that h (k) θ = 0 for all states ρ having support exclusively on ig. Since a Lie algebra is closed under the action of its Lie group, then it follows that V ρV † ∈ ig for all V ∈ G. Thus, with a similar argument, we have for all V ∈ G, where we used again Definition 6. The latter shows that h (k) θ is G invariant. Finally, we can generalize the previous results to the case when O(θ) ∈ span({A j ⊗ A j } j ). Now we can expand which completes the proof.
We here finally note that if i1 1 ∈ g ⊥ , then we can obtain G-invariance when ρ ∈ ig ⊥ and A j ∈ ig. The proofs follows similarly to that previously presented. Now we have Here we need to use the fact that Tr ρV † A j V ∈ ig for all V ∈ G (a Lie algebra is closed under the action of its associated Lie group). Moreover, we have used Definition 6 to show that Tr ρV † A j V = 0.
Appendix D: Ancilla-based models for the purity dataset In the main text we considered the task of classifying the data in the purity dataset with models in Hypothesis Class 1, i.e., with models of the form h θ (ρ) = Tr U (θ)ρ ⊗k U † (θ)O . As we saw in Theorem 1 of the main text, there are no such models with k = 1 that allow for classification. On the other hand, Theorem 2 shows that models with k = 2 can indeed classify the data according to their purity. In this case, O(θ) has to be the SWAP operator (up to some additive and multiplicative constants).
The previous raises the issue of how to efficiently evaluate the expectation value of the SWAP operator. Taking inspiration from the Hadamard Test, which computes the expectation value of a unitary by controlling its action with an ancillary qubit, we envision a new family of ancilla-based hypothesis. Here, one appends to the QNN an extra qubit that is used along the two copies of ρ, so that the 2n + 1 qubit state (|0 0| ⊗ ρ ⊗ ρ) is fed into a quantum neural network that acts globally on all qubits, and only measures the ancilla qubit. This defines the following Hypothesis class.
Hypothesis Class 3. We define the Hypothesis Class H 3 , computable in a quantum-enhanced experiment, as composed of functions of the form where U (θ) is a quantum neural network acting on 2n + 1-qubits, and O A = (O ⊗ 1 1 ⊗ 1 1) with O a one-qubit Hermitian operator acting on the ancilla qubit.
The models in Hypothesis Class H 3 now should be invariant under the action of 1 1 ⊗ V ⊗ V for any V in U(d). In the spirit of Proposition 1 and defining O A (θ) = U † (θ)O A U (θ), we know that this can be readily achieved when . This results in the following Theorem.
Theorem 9. Let h θ ∈ H 3 be a model in Hypothesis Class 3, computable in a quantum-enhanced experiment. There always exist quantum neural networks U (θ) and operators O A , resulting in O A (θ) = A ⊗ S, where A |0 = |0 and S ∈ span({1 1 ⊗ 1 1, SW AP }) with non-zero component in SW AP , such that h θ is invariant under the action of U(d) and can perfectly classify the data in the purity dataset. The special choice of O A (θ) = Z ⊗ SW AP corresponds to the operator measured in the Swap Test [34] and in the ancilla based algorithm of [61].
Proof. We first recall that the models in H 3 will be G-invariant if h θ (V ρV † ) = h θ (ρ) for all V ∈ U(d), i.e., when This holds when Eq. (D3) is satisfied for the choice of O A (θ) = A ⊗ S with A an operator acting on the ancillary qubit and S an operator in C (2) (U(d)) = span(S 2 ) with S 2 = {1 1 ⊗ 1 1, SW AP } a representation of the Symmetric group of two elements. Then, replacing O A (θ) by A ⊗ S in the left-hand-side of (D2) leads to In the second inequality we have used the fact that, S commutes with V ⊗ V , while in the third line we have simply separated the trace over the different subsystems. Since replacing O A (θ) by A ⊗ S in the left-hand-side of (D2) leads to Tr[|0 0|] Tr[ρ ⊗ ρS], we can see that the model will be G-invariant iff A |0 = |0 , i.e., if A has |0 as an eigenvector with eigenvalue equal to one. Finally, we remark that a direct calculation with the circuits in Fig. 1 verifies that both circuits satisfy U † 1 (Z⊗1 1⊗1 1)U 1 = U † 2 (Z ⊗ 1 1 ⊗ 1 1)U 2 = Z ⊗ SW AP . Hence, we recover the operator measured in the Swap Test [34] and in the ancilla based algorithm of [61]. Ancilla-based circuits for computing the purity. a) The circuit U1 corresponds to the canonical SWAP test which was compiled into gates native to the IBM's superconducting qubit devices [102]. Here H and T denotes the Hadamard and π/8 phase gates respectively. At the end of the circuit one measures the expectation value of the Pauli Z operator on the ancilla qubit. b) The circuit for U2 corresponds to the ancilla-based algorithm for computing purity discovered in [61] trough a machine learning subroutine designed to minimize the number of CNOTs required. (see also [103]). Here W = T † H. At the end of the circuit one measures the expectation value of the Pauli Z operator on the ancilla qubit. Both circuits satisfy the conditions in Theorem 2 as U † Let us analyze here the remarkable fact that the special choice A = Z leads to the exact operator measured in two distinct ancilla-based circuits computing the purity: in the Swap Test [34] and in the algorithm discovered in [61]. For completeness, these two circuits are shown in Fig. 1 for the case when ρ is a single qubit state. While both circuits end-up computing the purity of ρ they do so by implementing distinct unitaries. This can be seen by evaluating the Schmidt rank across the ancilla/data qubits cut of the two circuits. This Schmidt rank is found to be 2 for the unitary U 1 displayed in Fig. 1(a), and 3 for the unitary U 2 displayed in Fig. 1(b). This indicates that both circuits are fundamentally different in the sense that there is no local operations that map one to the other [61]. Still, one can verify that where A = Z. Hence, our results shed new light to the connection between these two circuits.
Appendix E: Classifying with Eq. (18) In the main text we argued that when the classification task has two symmetry groups, associated with the two different classes, then one can classify the data if there exists a G 1 -invariant model h When c ∈ [b 1 , b 2 ] one can readily use this model for unambiguous classification: when the model returns a value of c (different from c) one would assign a label of y = 1 (y = 0) which corresponds to the true label of the state to be classified. However, when c ∈ [b 1 , b 2 ] we can still perform classification, but at the cost of misclassifying some of the states. Indeed, there will now be cases where one measures a value of h (k) θ (ρ i ) = c despite the fact that the underlying state has true label y i = 0, but assign it a label y = 1. The probability of such misclassification event is quantified in the following. Lemma 2. Let P (0|c) be the probability of misclassification which happens when the true label of a state ρ i is y i = 0 given a model value h (k) θ (ρ i ) = c. Accordingly, let P (c|0) be the probability that the model takes a value of c when the data has label y i = 0. Assuming equal probability of sampling states belonging to each of the two classes, we have P (0|c) = P (c|0) 1 + P (c|0) .
(E2) Lemma 2 shows that the probability of misclassification will remain small as long as the probability that h (k) θ (ρ i ) = c given y i = 0 is small. Let us now provide a proof for Lemma 2.
Finally, recalling that when y i = 1 the model outcome is always c, we have P (c|1) = 1 such that we recover (E2).
Going further, we can bound this probability of misclassification if we know the expectation value and variance of h (k) θ (ρ i ) for states with label y i = 0. First, note that P (c|0) 0, such that, according to (E2), P (0|c) P (c|0). Next, let us define X to be the random variable corresponding to the QML model output, i.e., X = h (k) θ (ρ i ) for a random state ρ i . We denote as X 0 = E yi=0 [h (k) θ (ρ i )] the expectation value of this variable conditioned on the sampled states to have label y i = 0. Assuming that this expectation value is greater than c (without loss of generality) and defining δ = X 0 − c > 0, we find via Cantelli's inequality that where Var 0 [X] = X 2 0 − X 2 0 is the variance of X when sampling states with label y i = 0. It follows that P (0|c) P (c|0) P (X − X δ) showing that for large separation δ relative to the variance Var 0 [X], the misclassification error will be small. Finally note that, when taking into account additive errors when estimating the output of the model, one would assign a label 1 for model values estimated in a range C = [c − , c + ]. In this case misclassification would arise when the model value is estimated in C despite the true label of the state being 0, and Equations (E2) and (E6) could readily be extended to this scenario.