A Unified Framework for Quantum Supervised Learning

Quantum machine learning is an emerging field that combines machine learning with advances in quantum technologies. Many works have suggested great possibilities of using near-term quantum hardware in supervised learning. Motivated by these developments, we present an embedding-based framework for supervised learning with trainable quantum circuits. We introduce both explicit and implicit approaches. The aim of these approaches is to map data from different classes to separated locations in the Hilbert space via the quantum feature map. We will show that the implicit approach is a generalization of a recently introduced strategy, so-called \textit{quantum metric learning}. In particular, with the implicit approach, the number of separated classes (or their labels) in supervised learning problems can be arbitrarily high with respect to the number of given qubits, which surpasses the capacity of some current quantum machine learning models. Compared to the explicit method, this implicit approach exhibits certain advantages over small training sizes. Furthermore, we establish an intrinsic connection between the explicit approach and other quantum supervised learning models. Combined with the implicit approach, this connection provides a unified framework for quantum supervised learning. The utility of our framework is demonstrated by performing both noise-free and noisy numerical simulations. Moreover, we have conducted classification testing with both implicit and explicit approaches using several IBM Q devices.


I. INTRODUCTION
Quantum computation has been intensively studied over the past few decades and is expected to outperform its classical counterpart in certain computational tasks [1][2][3]. In this novel approach for computation, information is stored in the quantum states of an appropriately chosen and designed physical system, which resides in a complex Hilbert space H, and quantum bits (qubits) are used as the underlying building blocks and processing units. The power of a quantum computer is in its ability to store and process information coherently in the tensor-product Hilbert space [1] with entanglement being a characteristic byproduct or even a potential resource for quantum information processing [4,5]. Quantum computations have been shown to provide dramatic speedup in solving some important computational problems, such as factorization of a large number via Shor's algorithm [6] and the unstructured search using Grover's algorithm [7], which are two prominent examples among many that have been discovered.
At the same time, machine learning (ML) has become a powerful tool in modern computation. For example, ML has been successful in computer vision [8][9][10], natural language processing [11], and drug discovery [12]. Building on this history, a natural application of quantum computers also may provide substantial speedup [13][14][15][16]. Several previous works have revealed potential quantum advantages in the field of unsupervised learning [17][18][19][20][21]. For example, in Ref. [20], the authors provide quantum algorithms for clustering problems, which could, in principle, yield an exponential speedup. In Ref. [21], the authors introduce a quantum version of k-means clustering, namely, q-means, and present an efficient quantum procedure.
Using quantum computation in supervised learning also has garnered increased attention [22][23][24]. For example, the authors of Ref. [25] present a quantum version of support vector machines (SVM) that showed possible exponential speedup. For near-term applications, variational strategies have been proposed to classify real-world data [22,[26][27][28][29]. Classification is among the standard problems in supervised learning [30,31], and variational methods using short-depth quantum circuits with trainable parameters have given rise to a quantum-classical hybrid optimization procedure. Such frameworks have proven to be capable of performing complex classification tasks [26,27,29,[32][33][34], and many more likely will appear. It is probable that such variational methods will be able to learn complex representations while still being robust to noise in near-term quantum devices (e.g., noisy intermediate-scale quantum [NISQ]) [29,35,36].
"Traditional" quantum supervised learning (QSL) models rely on the encoding of classical data x into some quantum state |ψ(x) . This state then undergoes a parameterized quantum circuit U (θ). At the end, the state is measured. The outcome of the measurement usually is interpreted as the output of the learning model. Although the procedures of previously proposed works [15,[27][28][29] appear similar, the motivation underlying their strategies seems varied. For example, the quantum circuit learning algorithm proposed in [29] is inspired by classical neural networks. Meanwhile, in [28,37], the authors exploit and formally establish the connection between quantum computation and the kernel method, where they interpret the step of encoding classical data x into the quantum state as a quantum feature map. Thus, a clear picture emerges: classical data x are embedded into some quantum state |x , i.e., a data point in Hilbert space H. (This space also is called arXiv:2010.13186v2 [quant-ph] 16 Feb 2021 the quantum feature space, analogous to the feature space in classical ML.) Then, a decision boundary is learned by training the variational circuit to adapt the measurement basis, which is analogous to the classical approach where a decision boundary is learned to separate classes.
Metric learning is a well-known method in the classical ML context [38]. The aim is to learn an appropriate distance function over data points. This method recently has been extended to the quantum context by Lloyd et al. ([39]). Instead of focusing on training the variational layer that adapts the measurement basis, the authors propose to train the embedding circuit and proffer a remarkable notion of "well-separation" of data points. Per their argument, a significant amount of computational power spent on processing classically embedded data can be eased using such a strategy.
Aside from such an advantage, we pose that the ability to use a quantum circuit to represent data in a complex Hilbert space and the idea of "well-separation" have further remarkable consequences. We argue that previous "traditional" QSL methods [15,[27][28][29] essentially achieve certain "well-separation" of data points. Here, we provide a unified, generic framework and categorize approaches to two different types, implicit and explicit, which are described in detail, backed up by numerical simulations, and tested on real quantum devices. The goal is to train the embedding circuit to produce clusters of data from different classes. In the implicit approach, the "centers" of these clusters are random. In the explicit approach, the cluster "centers" are constrained to lie in or nearby some predetermined subspaces of the Hilbert space. We show that both explicit and implicit approaches exhibit promising classification ability. Particularly, the method proposed by Lloyd et al. ([39]) is a binary version of the implicit approach. We point out that the explicit approach can conceptually unify "traditional" QSL methods, such as [15,27,29]. These two approaches then constitute our unified framework for QSL.
The following summarizes the contributions of this work: • We introduce two approaches for QSL, implicit and explicit, that constitute a generic embedding-based framework.
• We show that the implicit approach is the generalization of the metric quantum learning method proposed in [39]. Such generalization allows us to manage the multi-class classification problem. The number of separated classes (or labels) is independent of qubits used in the quantum circuit. Therefore, it sheds light on constructing a universal quantum classifier.
• We demonstrate that the explicit approach can conceptually unify other models for QSL. Along with the generalization provided by the implicit approach, our work provides a complete unification of QSL frameworks.
• We implement both learning approaches on NISQ devices and compare the results with noisy simulations. We demonstrate the framework's success and clarify the cases where the results on real devices and noisy simulations do not agree well.
The structure of the paper is as follows: Section II A presents the main conceptual tool of our framework. In Sections II B 1 and II C 1, we discuss the implicit and explicit approaches, present results from numerical experiments and runs from real devices, and provide numerical evidence that the implicit approach is especially robust with small training size. Some discussions regarding our framework's prospects in the near-term era are presented in Section IV. Section V concludes the primary work. Appendix B provides an additional example to illustrate the unification. Appendix C discusses a remarkable consequence of focusing on the embeddings part instead of the measuring part in QSL, which could avoid a systematic issue of misclassification of the one-versus-all strategy.

A. Basic concept
We first introduce the basic concept of classification, which can be illustrated by a simple map: where ∈ R L is called a classifying vector of some input data x, and θ refers to the network's or circuit's parameters. {f i } is generally of the form: where |x is the corresponding quantum state of classical data x, and M i , in general, is some Hermitian operator. We generally assume that N-dimensional classical data x is mapped to |x via a k-qubits parameterized circuit. The specific formula for M i depends on either the implicit or explicit approach (to be discussed later).
The value of f i depends on circuit parameters θ and the input feature x.
In the supervised learning problem with L separated labels, we are given a training set together with corresponding labels X × Y = {x, i}, where i ∈ {0, 1, .., L − 1} is the label of the data point x. We need to predict the label for some other unseen data x. In the quantum setting, we simply use its representation by a quantum state |x instead of the classical data x. The number of components {f i }, or equivalently the dimension of f (x, θ), is denoted by L. The value f i (for convenience, assumed to be in the range 0 ≤ f i ≤ 1) quantifies the likelihood that some input x have any of labels {i}. In this sense, the method is somehow similar to the classical neural network, where the information is fed forward from the input layer to the output layer. There have been numerous works that explore the relation between quantum computation and neural networks [26,27,[40][41][42][43]. The key relation extends from the building block of the quantum circuit model: quantum gates. These gates carry out unitary transformation on the input quantum state, which is a vector in some Hilbert space H. In the graphical representation (distinctively illustrated in Ref. [27]), the action of a quantum gate on the input state |ψ produces an output state |ρ and can be represented as a fully connected two-layer network. Hence, a full quantum circuit generally can be represented by such a fully connected network with a certain number of layers. Measuring quantum states then corresponds to a non-linear activation function.
Thus, to make prediction, we "forward" x to such a classifying vector f (x, θ) and assign to it a label according to the highest value of {f i }. The accuracy of correct assignment depends on circuit parameters θ. Now, we provide a strategy to train the circuit. For each label i, assume there are N i training points with such a label, and there are a total of N data points, where N = i N i . Let y i be the real, so-called label vector, of class i (which has dimension L) with components {y j i } L j=1 = δ ij , where δ ij is the Kronecker delta function. Let f j i be the classifying vector of the j-th data. We minimize the following cost or loss function: where x j i is the j-th data point in class i. We finally note that any reasonable form of the loss function should work.
State Overlaps: Given two pure quantum states represented by density matrices ρ ≡ |ρ ρ| and φ ≡ |φ φ|, the overlaps, i.e., a similarity measure, on these two states is given by Tr(ρφ) = | ρ|φ | 2 . State overlaps play an important role in our subsequent construction of the implicit and explicit approaches. In the general case of mixed states, the SWAP test quantum procedure [1] can be used to evaluate Tr(ρφ) up to an additive error . In the special case of pure states, the inversion test [39] can be used to evaluate the overlaps between two quantum states | φ| · |ρ | 2 , provided the circuit U to create either |φ or |ρ can be efficiently inverted. Both schemes require only shallow circuits. In our subsequent experiments, we also will implement both schemes for classification on real devices.

Construction
In this approach, the data from the same class, after going through the quantum circuit Φ(x, θ), produce clusters (closed data points) in the Hilbert space H. Clusters corresponding to different classes should become maximally separated after minimizing the cost function (see Fig. 1

& 2).
To describe our supervised learning problem, we assume that for each label i, there are N i training points that will be transformed to quantum states {|x j i } Ni j=1 . The formula for M i in this case is: which is exactly an ensemble of quantum states: This ensemble may be interpreted as the collection of the corresponding training points from class i on H, and it can be obtained by sampling from the training set {x j i }| Ni j=1 . The classifying vector f (x, θ) now becomes: . .
We will focus on optimizing those quantities {f i } and using the values to assign a label to x.

FIG. 2:
Illustration of the implicit approach. After the training procedure, the "distance" between any clusters (represented by dotted lines) becomes maximal. The "center" of each cluster is not fixed as the training process will move them to produce maximally separated clusters. Note: the data points in this picture are not related to the Iris dataset in our subsequent experiment or the data points in Fig. 3 After applying Eq. (4), the cost function becomes We note that each cross terms is a sum of the modulus square of overlaps. Hence, in the training procedure, we can use either the SWAP or inversion test to evaluate the cost. Consider a binary classification problem (L = 2). The cost function is: (9) The optimization will be minimizing the Hilbert-Schmidt distance between two data clusters, which is highlighted in Ref. [39]. Therefore, we have shown this implicit approach is a generalization of the binary method discussed in Ref. [39].

Training with QRAM
If the quantum random access memory (QRAM) [44] is available, the cost for the training and testing procedure will be reduced by a factor of ∼ i≤j N i N j and i N i , respectively, where N i is the number of the data points in each training set i. The calculation of cost function and data classification can be done in time O (1). The data can be loaded corresponding to class i to a quantum state FIG. 3: Illustration of the explicit approach. The "center" of each cluster is fixed. For class 0 (red points), the position of the center is v 0 = (0,0,1). For class 1 (green points), the position is v 1 = (1,0,0). For class 2, the position is v 2 = (0,1,0). Notably, this separation is not exactly the same as desired in Eq. (10) because the Hilbert space associated with a single qubit has dim = 2, and it only can be decomposed into, at most, two orthogonal subspaces. Nevertheless, this figure illustrates the idea of the explicit approach.
where the index j represents the address of the memory where x j i is residing. We use |"x j i " to denote the classical data (not the embedded feature state) loaded to a quantum register.
A third register is initialized in |0 . . . 0 and is used to implement the quantum feature. The application of R y (x) (refer to Fig. 6) on this register with its argument being x J i can be done by performing the rotation conditioned on the first register with classical values x j i , i.e., a conditional rotation c − R y (x). In this way, at the expense of a more complicated circuit to implement the conditional rotation, we obtain the entangled state . Tracing over the first and second registers (i.e., without doing anything on them afterwards), the third register is in the With the QRAM, we do not need to repeat the circuits (with different rotations) to sample every data point individually from the training set to obtain an effective σ i (about N i times).
Using the controlled-SWAP gate on this third register and another other embedded state |x for an unknown data point (i.e., the SWAP test), we can directly measure their fidelity x|σ i |x . However, computing the pairwise overlaps in Eq. (8) without the QRAM will require using the SWAP test N i N j times. As such, it would be useful to design an efficient quantum subroutine of low-depth circuits to evaluate the cost in Eq. (7), directly exploiting the QRAM and reducing the iterative evaluation steps.

Classifying over a large number of classes
In most current quantum ML models, the measurement outcome usually is interpreted as the outcome of learning models. In a binary classification problem, onequbit measurement suffices to classify the data as there are only two possible outcomes, and one can draw inferences from such a measurement. For example, if the probability of obtaining class zero P(outcome = 0) ≥ 0.5, we then assign the data to class 0. Otherwise, we assign it to class 1. For multi-class classification, multi-qubit measurement needs to be employed. A circuit with k qubits can classify up to 2 k different labels. Our implicit approach surpasses this because we only aim to get the classifying vector f . The dimension of f , or the number of classes, can be quite large. Real-world supervised learning problems may contain overwhelmingly numerous classes, such as in face recognition. Thus, our framework may prove useful for these practical tasks.
Still, even with a single qubit, multi-classification can be done using this approach (see Fig. 2). Relevant work has been carried out in Ref. [45], showing that a single qubit is sufficient to construct a universal classifier and is able to deal with multidimensional input data and multi-label output of supervised learning problems. To handle a multidimensional input, the authors propose a data re-uploading strategy. They achieve the multiclassification by introducing bias term λ in the measurement outcome. Our approach differs from [45] because we use the parameterized quantum circuit to represent the data in the Hilbert space and exploit its "vastness." Data from different classes are "aligned" in separate locations. To handle multidimensional data, one may opt to follow the same strategy as in Ref. [45] or engage a different embedding routine. Addressing the problem of efficient data encoding is beyond the scope of this work.

Construction
The implicit approach emphasizes training the embedding circuit to produce separated clusters on H. However, the "centers" of these clusters are somewhat random. As long as relative distances among these clusters are maximal and those among data points within the same cluster are minimal, the method achieves its goal.
With the explicit approach, the cluster "positions" are designed to be fixed and separated into orthogonal subspaces. The main intuition is that with enough qubits, the Hilbert space is vast, complex, and can accommodate many smaller subspaces where data clusters can reside. These subspaces are well defined and well separated. If the data from different classes are "approximately" mapped to their proper subspaces, they are well separated by construction. The approximation here means that the embedded data might not exist completely within the desired subspace, instead possibly only in its vicinity. Then, we can "measure" the distance from a data point in H to different subspaces. Thus, classification of such a data point can be done accordingly. Again, consider the supervised learning problem with L labels, we decompose H into: FIG. 6: Unit Embedding Circuit Φ. In implementation, this unit is repeated 4 times. Hence, the total number of trainable parameters is 20. In the end, the feature layer is repeated once more. The repetition of both feature and parameter layer has been used in Ref. [45] as a data re-uploading strategy that yielded better classification ability.
assuming that L ≤ dim(H). We can always achieve this condition by adding more qubits to the circuits. Our aim is to approximately map a data point accordingly to its "label subspace" {H i }. Without loss of generality, let dim(H i ) = k and H i be spanned by Let a set of operators associated to label i, or equivalently, the subspace H i , be: {|ψ j i ψ j i |} k j=1 . The likelihood of given data |x having some label i may be quantified by the projection of |x onto the corresponding "label" subspace. Therefore, M i could have the form: which essentially is the targeted ensemble density operator (up to a normalization) associated with label i. The same strategy is then followed as we minimize the cost in Eq. (4) and assign some unseen data x according to the value of f i . For example, we consider the binary supervised learning problem with L = 2 and 1-dim dataset X = X A ∪X B , where X A and X B are the training sets with label 0 and 1, respectively, as depicted in Fig. 4. For simplicity, we use one qubit in the embedding circuit (refer to the 1qubit toy model in [39]). Hence, dim H = 2. We then make the decomposition: where H 0 and H 1 are spanned by |0 and |1 , respectively. We note the classifying vector f has the form: Following the same procedure helps obtain the cost value where ρ A and ρ B are state ensembles of the two respective training sets A and B and σ z is the Pauli-Z matrix. Minimization of this cost C with respect to the circuit's parameters will give an embedding Φ(x, θ) that maps the data from A to the vicinity of |0 in H and the data from B to the vicinity of |1 in H, as illustrated in Fig. 4. An alternative picture also can be drawn from the described 1-dim dataset. If we choose the label space to be {H 0 , H 1 }, then the classifying vector in Eq. (12) can be obtained by simply performing measurement on the embedded state |x in the z basis. Choosing a different label space, e.g., by decomposing H = H + ⊕ H − , where H + and H − are spanned by (|0 + |1 )/ √ 2 and (|0 − |1 )/ √ 2, respectively, measurement in the x basis (i.e., the observable σ x ) would need to be used instead.
The classifying vector f in this latter case is a direct result from such a σ x measurement. Instead of the north and south pole on the Bloch sphere (Fig. 4), the quantum circuit training would then make data points clustering around "+x pole" and "−x pole," wherex is the unit vector point along the positive x direction.

Connection to "traditional" QSL models
By closely examining f in Eq. (12), the value of f 0 = Tr(|x x| · |0 0|) = | 0| · |x | 2 turns out to be the probability of obtaining state |0 when measuring the state |x in computational basis. Most current quantum classifiers [27,28] rely on these measurement outcomes after applying a general circuit Φ(x, θ) = W (θ)U (x) to some initial state |0 for classification. Hence, under the view of embeddings, by choosing the appropriate label space (specifically, the standard computational basis state), there is an intrinsic connection between this explicit approach and other traditional models [27,29,37]. More precisely, "traditional" approaches can be unified by this explicit approach.
Such unification offers a two-fold advantage: the evaluation of overlaps between the input data |x and |0 or |1 , as in Eq. (12), can be done simply by letting x undergo the embedding circuit Φ(x, θ) once and performing measurements instead of invoking the embedding circuit twice (in the SWAP test subroutine). Additionally, the cost evaluation in Eq. (13) does not necessarily need to be done in an iterative manner. In Refs. [46,47], the authors provide elegant and efficient methods to encode the cost evaluation directly into quantum circuits. Hence, the training time can be reduced.

III. NUMERICAL SIMULATIONS AND REAL-DEVICE EXPERIMENTS
With each approach, we train on the ideal simulator then use the optimized circuit to test on the ideal simulator, noisy simulator, and several real devices.

A. Implicit Approach experiment
Datasets: For illustration purposes, we target the Iris dataset [48,49] with L = 3 labels (Fig. 5). There are 50 data points in each class for a total of N = 150 data points. Ten data points are taken from each class to serve as the training data, and the remaining 40 are used for testing. Aside from classification, our aim is to demonstrate the formation of clusters in the featured Hilbert space H.
Quantum Embeddings: We use the same so-called QAOA-like ansatz as in [39] (Fig. 6) for embedding. The unit circuit Φ(x, θ) is composed of a feature layer U (x) followed by the parameter layer W (θ). Hence, we have Φ(x, θ) = W (θ)U (x), and the model is compact. A possible useful design of this embedding unit is to mix the feature parameters x and tunable parameters θ to reduce the depth while maintaining the efficiency (see [45]). For instance, instead of R y (x 1 ), one can consider R y (θ 1 , x 1 ) or, generally, R y (g(θ 1 , x 1 )), where g is some function.
Training Stage: Because there are N = 3 labels in our problem, the cost function is: The training procedure is as follows: • Data from the training set are mapped to quantum states. (a) Top panel: Initial distribution of data in H, in which the parameters in the variational quantum circuit are randomized. (b) Bottom panel: After 100 training epochs, the data from the same class form a cluster on H as overlaps between their quantum states are high (brighter color). The visualization clearly shows that class 0 (red points) are more separated from the other two classes. Meanwhile, classes 1 (blue points) and 2 (green points) are less separated from each other. The observation is in agreement with the testing results because all testing points from class 0 are predicted with absolute accuracy, and false predictions mainly come from classes 1 and 2.
• Define and use SWAP test subprogram to evaluate Tr(σ 2 i ) and Tr(σ i σ j ). Later, we also use the inversion test.
• Minimize the cost C in Eq. (14) over circuit parameters.
Our simulation uses the PennyLane software package [50] and the optimization of C over circuit parameters is done using the RMSprop [51] optimizer with a learning rate of 0.01. Figure 7 shows the training curve. As minimization takes place, the embedded data in H are expected to form clusters, while those from different classes separate from each other. This is confirmed, as shown in Fig. 8, in the comparison of the overlap of embedded data before and after the training. In particular, the overlaps between the embedded data from different classes become small after the training. This is especially the case for the overlap of class 0 with both class 1 and class 2. Thus, we have verified the well-separation of the embedded data from different classes.

Results of noiseless simulations
After obtaining the optimized circuit parameters, we use the optimal circuit to perform a test on classifying the remaining unused data (i.e., the test dataset). The overall accuracy obtained is 92.5%. Notably, only 30 data points (10 for each class) are used as the training set, which corresponds to 20% of the total data points. This demonstrates that the classifier can classify unseen data with high accuracy, despite being trained with a relatively small training dataset (more details in Sec. III C).

Testing results from noisy simulations
In real quantum hardware, noise and errors are important factors that reduce accuracy. To examine our method in the presence of noise, we test our classification with noisy models acquired from IBM Q "backends." The device noise model is generated from their device calibration and accounts for gate error probability, gate length, T 1 and T 2 relaxation, and dephasing times, as well as the readout error probability. For convenience, Table I shows the average gate errors for the four backends considered in this work.
We test the classification with the SWAP test circuit via the noisy simulations, and the results are tabulated in Table II. The accuracy seems to be unaffected by the noise, and the values from the noisy simulator using the four noisy models from the respective devices are 92.5%, 90.83%, 92.5%, and 92.5%. The circuit parameters used are obtained from the noiseless optimization. The reason for not using noisy simulators to obtain the parameters is because we will perform the same testing on the actual hardware. Hence, it would be impractical and too time consuming to perform the training directly on the hardware as the jobs queue could be long and execution of the training circuits would have to be split over many jobs.

Run on quantum computers
With the noisy simulation, we also test our ideally trained model on real quantum "backends." Table II summarizes the detailed results, including simulations and real devices. For the same classification, we run two different methods to obtain overlaps: the SWAP test, which uses five qubits, and the inversion test that uses only two qubits.
Accuracy with the SWAP test ranges from 28% to 75% on various backends. However, inversion test accuracy remains stable around 90%. This clearly shows substantial performance differences between the SWAP and inversion tests. Likely, the main factor that accounts for such discrepancy is the controlled SWAP gate (c-SWAP in Fig. 16) in the SWAP test circuit. The number of c-SWAP gates required for two n-qubit states scales as O(n). Each c-SWAP gate then is decomposed into many CNOT gates (which are noisy) as shown in Fig. 9. Despite the noisy simulations yielding accuracy around 90%, runs on the actual machines suffer accumulated errors not captured in the noise model used in the simulations. On the other hand, the inversion test does not need the c-SWAP gate and, hence, requires fewer CNOTs-but at the cost of doubling the quantum circuit depth. In our classification model, there is a trade-off between using either the SWAP or inversion test. As our small-size experiments have shown, the inversion test should be used for better classification on NISQ machines. However, it requires the ability to invert the embedding circuit and can only evaluate the overlaps between two data points (of pure states). Hence, classifying unseen data (by obtaining the classifying vector f ) must be done in an iterative manner. Conversely, the SWAP test can handle mixed states, and the classification can be sped up with a QRAM. In addition, SWAP test performance can be further improved by using methods introduced in [52]. Such an approach is hardware-dependent (as the authors examined on IBM Q and Rigetti separately), and it requires fewer CNOT gates. 1

B. Explicit Approach experiment
Datasets: To illustrate this approach, we target the dataset make circles with N = 2 labels (Fig. 10). Fifteen points are taken from each class to serve as the training data. For testing, we generate an additional 50 points for each class.
Training Stage: We choose two subspaces spanned by |00 and |11 , respectively, as label spaces and train the   runs on actual backends for both the Iris and make circles datasets. The result on the top of each block corresponds to noisy simulation (n.s.), and the bottom ones correspond to real device (r.d.). "ST" denotes SWAP Test, and "IT" represents Inversion Test. circuit to map data from class 0 (blue points) to H 00 and from class 1 (red points) to H 11 . Given some input data x, the classifying vector is then: .
The cost function becomes: FIG. 10: The Make circles dataset with N = 2 labels used in our explicit approach for classification.
A and B refer to class 0 and class 1, respectively. Figure 11 presents the training curve in the noiseless simulation. The overall accuracy is 96%. As in the implicit case, we visualize the overlaps between training points before and after training with 100 epochs (Fig. 12). This confirms the well-separation of embedded data from different classes in our explicit approach.

Noisy simulations
As in the previous implicit case, we also use the trained parameters from the noiseless simulation for the quantum embedding but perform the noisy simulation to classify the testing dataset. The accuracy of these noisy simulations is tabulated in Table II. Noise details can be found in Table I.

Run on quantum computers
We perform the classification experiments of this explicit approach on real quantum backends ibmq 16 melbourne, ibmq 5 yorktown, ibmq bogota, and ibmq rome and compare their accuracy to the noisy simulation with the noise model from the same backend. Table II summarizes the results.
The accuracy from the noisy simulators and real machines turns out to agree well with each other, achieving values above 90%. This indicates that the explicit approach is less affected by the noise compared to the implicit approach using the SWAP test. This is reasonable as the explicit approach requires less resources, such as fewer two-qubit gates, than the implicit approach with the SWAP test. We simply let the input data run through the circuit and perform computational-basis measurements as the classification only depends on the probability of obtaining |00 and |11 .

C. Training over small samples
We observe that both approaches produce surprisingly good testing accuracy despite training with small data After training process, data from different classes become separated. Compared with the implicit approach, the "well-separation" is less apparent. In other words, clusters are less tight. This may be reasonably explained by the way two methods work. In the implicit approach, the optimization procedure focuses on directly separating data points from different classes. Meanwhile in the explicit approach, data get separated indirectly, i.e, through the pre-defined subspaces. Hence, in theory, the implicit approach has certain advantages over the explicit method in terms of attaining complete separation. In practice, both approaches have strong classification ability, demonstrated in our experiments.
points. This provides motivation to determine whether or not one approach is more robust than the other in terms of learning capacity. Another motivator is to investigate how testing accuracy varies with training size. We choose the make moons dataset with L = 2 labels to perform the numerical experiment (Fig. 13).
The procedure is as follows: • For each class, we generate a fixed set of 50 points (hence, there are 100 points in total), serving as testing instances.
• For each class, we then choose randomly 5, 10, 20, and 25 points, serving as training instances. For each number of training instances, we average over multiple training to obtain testing accuracy.
• We train the quantum circuit using implicit and explicit approaches separately and compare the testing accuracy after 100 epochs of training. Evidently, the implicit approach performs well-even with a small training size-which implies a certain advantage in this approach. As the training size increases, both approaches tend to achieve equal performance.

IV. PROSPECT IN THE NISQ ERA
To maximally enhance the performance of any quantum algorithm or, generally, a quantum procedure, we also need to take into account the hardware structure,  e.g., the connectivity of qubits in the system and specific qubits chosen. Figure 14 shows the topology of the machine ibmq bogota used in our work.
FIG. 14: Topology of ibmq bogota. The picture was acquired at the time of our experiment. Topologies of other machines used in our experiment can be found in Fig. 15. Table II shows the result of testing the Iris dataset on ibmq bogota with the inversion test done using qubits labeled "3" and "4." We also carried out the same "testification" using qubits "1" and "2." The testification result 65.83% is dramatically lower than using qubits 3 and 4 (91.67%). Such deviation can be reasonably argued from the noise rates of the qubit pair involved and their CNOT gates. As depicted in Fig. 14, qubits 3 and 4, as well as the connection between them, have much lower error rates compared to those of qubits 1 and 2. Hence, in practice, any quantum procedure needs to be hardware-aware to reach its maximum efficiency. Of course, our experiment requires very few numbers of qubits and simple gates. As such, we can simply choose specific qubits to obtain better accuracy. More complicated quantum circuits generally require more careful qubit specification. The quantum hardware topology also varies, e.g., ibmq 16 melbourne and ibmq5 yorktown versus ibmq bogota. Such differences can affect the decomposition of a multi-qubit gate into available one-and two-qubit (in particular) gates. A hardware-aware compiler that optimizes the selection also is a necessity for future large-scale tasks. This hardware-specification optimization is important practically and requires additional development. Given an arbitrary quantum backend's topology and description of some quantum procedures, such as a circuit's length, width, and number of 1-qubit or 2-qubit gates, it may be worth determining if a systematic procedure exists that can decide which qubits-and in what orders-should be used in order to maximize the performance.
Other works also have demonstrated the ability and feasibility of using actual quantum computers to classify real-world data [28,53]. In addition to providing a unified framework for QSL, we have performed simulations and cloud-based real-device experiments. These experiments on real quantum backends have extended the prospect of applying quantum computers for ML one step further, demonstrated explicitly in our work showing current noisy quantum computers can achieve high accuracy on classifying data. The low-accuracy results obtained via the SWAP test routine may be improved by using the inversion test routine, and we have emphasized that the inversion test is more appropriate in the NISQ era. Real quantum systems undoubtedly are more complicated, and their noise on long circuits (especially those with many CNOTs) may result in worse accuracy than noisy simulations. In our experiments, the error rate of 1qubit gates is ∼ 10 −3 or smaller, and the 2-qubit CNOT is ∼ 10 −2 for current hardware [54]. Full implementation of quantum error correction remains elusive. Along with development of precise, high-fidelity gates, efforts have been made in error mitigation methods [55][56][57][58] to obtain useful outcomes. Some of these mitigation methods require repetition of the same circuit but with different overall error rates by possibly stretching the gate pulses. This allows observables to be extrapolated to the gate noiseless limit [55,56]. Error mitigation measurement also is necessary to infer correct readout outcomes [57,58]. The experiments done as part of our work do not employ any mitigation technique. As such, our results can be further improved with these techniques, especially results from the SWAP test using gate mitigation. Additionally, our classification model has been shown to do well with a small training pool (compared to the testing set), achieving very high accuracy. Hence, one can reasonably expect that the model can be trained on real machines to achieve comparable performance.

V. CONCLUSION
In our framework for quantum supervised learning, the main conceptual tool of our method is the idea that the input data x is "forwarded" to the classifying vector f , and their classification can be done accordingly. A hybrid optimization step then proceeds to train the circuit. After being trained, the embedding circuit can map the data from the input space X to the proper subspaces in H.
Our work emphasizes that the quantum feature map, equipped with a learning procedure, is an especially powerful tool for supervised learning. With the implicit ap- proach, the number of separated classes (labels) in a supervised learning problem ideally can be arbitrarily high.
Thus, it provides a means to construct a universal quantum classifier. Compared to the explicit approach, the learning capacity of this approach has been demonstrated with a small training pool, which is also encouraging. Moreover, we show that the explicit approach can intrinsically unify other traditional QSL models (detailed in Appendix B). The fact that our framework can be divided into explicit and implicit approaches demonstrates its flexibility, affording the option to choose how data are embedded and analyzed on the Hilbert space H. These two approaches constitute a unified framework for supervised learning methods using a quantum computer.
Along with classification, we note that the trained quantum circuit possibly can be employed as a subroutine of other quantum ML algorithms as we know with high confidence that embedded data from different classes are well separated from each other (trained with the implicit approach) or approximately well contained in some subspaces in H (trained with the explicit approach).  Figure 16 provides a description of two alternative methods to evaluate the overlaps between two data points. The first is the control-SWAP gate, acting on five qubits, and the second is the inversion test.

FIG. 16: Circuit representation for SWAP test (top) and inversion test (bottom).
There is an abuse of notation Φ: in both the SWAP and inversion test circuits, the actual embeddings circuit Φ i,j already includes the repetition of the unit embedding in Fig. 6.

Appendix B: More on the Explicit Approach
In Section II C 1, we have illustrated the intrinsic relation between the explicit approach and "traditional" quantum supervised learning (QSL) models via the measurement outcomes. Here, we offer an alternative ex-planation. We consider a binary classification problem (two classes, A and B) and a single-qubit quantum circuit (Fig. 17).
In traditional approaches, the first block U (x) embeds classical information x into a quantum state in the Hilbert space H. Then, the variational layer W (θ) is trained to distinguish those embedded states. For instance, x can be assigned to class A if the probability of measuring 0 is greater than the probability of measuring 1 (P 0 > P 1 ). The training step should focus on maximizing the probability of measuring 0 for data in class A and 1 otherwise.
Recall that our embedding-based framework exploits the ability of a quantum circuit to represent data in a complex space. An alternatively simple perspective is evident if we interpret the whole circuit as an encoding of the classical data x.
Without loss of generality, we assume classical data x are mapped to a quantum state |ψ (refer to Fig. 18). When we measure this state, the probability of measuring 0 is cos 2 (θ), which means that if such a probability is high, the state |ψ will be close to the "north pole" |0 . If we follow the explicit approach and decompose H = H 0 ⊕ H 1 , where H 0 , H 1 are spanned by |0 , |1 , respectively, we end up maximizing the probability of measuring 0 for data from class A and measuring 1 for data from class B. Pictorially, those data from class A will form a cluster around |0 , and data from class B will cluster around |1 . After optimization, these clusters are "well-separated." To classify unseen data, we can use the optimized circuit to map them to some states and measure. The measurement probability may be understood as a "closeness" to either one of the two data cluster "centers" from classes A and B. This example also illustrates that our embedding-based framework, or, more specifically, the explicit approach, conceptually unifies other traditional QSL models.
Appendix C: One-versus-all strategy The one-versus-all strategy (Fig. 19) often has been used to transform a binary classifer to a multi-class classifer, especially for those models that, in essence, can deal with only binary classification. Here, we review this strategy and discuss its drawbacks.

FIG. 19:
One-versus-all Strategy. There are three classes (blue, orange, and red) and the corresponding decision boundary (blue, orange, and red lines). All classes are linearly separable for simplicity.
The underlying mechanism of the one-versus-all strategy is it assumes there are only two classes (or labels) in the supervised learning problem that learn the corresponding decision boundary. For example, in Fig. 19, all blue circles and red triangles can be treated as one class that learns the decision boundary to distinguish them from the orange rectangles, as well as the decision boundary between the blue circles and the rest and the red triangles from the others.
The caveat of the one-versus-all strategy is clear from Fig. 19: the black stars (unseen data) struggle to find a class (label). A similar issue appears in QSL, where the data are embedded by a fixed circuit in the Hilbert space H and the subsequent variational circuit is trained to draw the decision boundary. Notably, our framework can naturally surpass this issue as the representation of the data in H is learned. Then, a measure is employed to compare data directly as in the implicit approach or indirectly like the explicit approach. Hence, a label for unseen data is always guaranteed.