Quantum Mean Embedding of Probability Distributions

The kernel mean embedding of probability distributions is commonly used in machine learning as an injective mapping from distributions to functions in an infinite dimensional Hilbert space. It allows us, for example, to define a distance measure between probability distributions, called maximum mean discrepancy (MMD). In this work, we propose to represent probability distributions in a pure quantum state of a system that is described by an infinite dimensional Hilbert space. This enables us to work with an explicit representation of the mean embedding, whereas classically one can only work implicitly with an infinite dimensional Hilbert space through the use of the kernel trick. We show how this explicit representation can speed up methods that rely on inner products of mean embeddings and discuss the theoretical and experimental challenges that need to be solved in order to achieve these speedups.


INTRODUCTION
In machine learning, kernel methods are used to implicitly evaluate inner products in high dimensional feature spaces.Popular algorithms such as the support vector machine [1,2] or principal component analysis [3], which are linear methods, can be expressed solely in terms of inner products between data points.These methods become more expressive if the data is first mapped onto a high dimensional feature space.Instead of evaluating the inner product explicitly in the feature space, whose cost scales linearly with the feature space dimension, a more efficient evaluation can be done implicitly in the original space using a positive definite kernel function.This is known as the kernel trick [4].Since it does not require an explicit feature map, the kernel trick even allows us to work with infinite dimensional feature spaces, e.g., using a Gaussian kernel.The downside of most kernel-based methods is that they scale polynomially with the size of the data sets.This problem has been tackled in the realm of quantum computation and exponential speedups have been conjectured [5,6].Such speedups are, however, still highly controversial [7,8].Only recently has the cost of a single kernel evaluation been the target of quantum computing research [9][10][11].Speedups might be possible, since the cost of explicitly evaluating inner products of quantum states only grows logarithmically with the system size [12], as opposed to linear on a classical computer.Schuld and Killoran further conjecture the usage of continuous variable quantum systems for working with classically intractable, i.e., hard to compute, kernels in infinite dimensions [10], but it is unclear whether problems exist for which such kernels can lead to an improvement.Furthermore, the recent suggestions do not address the polynomial scaling of kernel methods with the sample size, leaving the application of quantum computing in large-scale kernel methods a challenging problem.The idea of explicitly representing an infinite dimensional feature vector as a quantum state opens a way to tackle this problem.While it is impossible classically to sum two infinite dimensional vectors, a quantum mechanical superposition of two states can be constructed explicitly, even for infinite dimensional systems, see, e.g., [13].On the other hand, the evaluation of inner products in an infinite dimensional quantum Hilbert space is independent of the number of states in a superposition.We identify methods involving the kernel mean embedding [14][15][16] as a branch of machine learning techniques that suffer from the fact that on a classical computer the cost of the evaluation of inner products of sums of feature maps is not independent of the number of data points involved.The contribution of this letter is to adopt the notion of kernel mean embedding to quantum mechanics, point out how quantum mechanics can lead to speedups, and make transparent what the challenges are in order to realize this in an experiment.The letter is organized as follows.We start by introducing the kernel mean embedding from a classical perspective, point out the main problem it has in big data applications, and present its relevance in current machine learning research through some real-world applications.We then define the quantum mean embedding as a modified version of the kernel mean embedding, which makes it suitable for investigation in the context of quantum computation, and show that this modification still allows for the usage in conventional applications.We present how the quantum mean embedding can be used, in principle, to overcome the problems faced when working with the kernel mean embedding on a classical computer.Since this cannot be done on nowadays hardware, we discuss the necessary quantum routines in the CHAL-LENGES section.Finally, we sum up with a discussion of our results.

KERNEL MEAN EMBEDDING
Let X be a locally compact and Hausdorff space.A function k : X × X → C is called a positive definite kernel function, or kernel function for brevity, if for all n ∈ N, x 1 , ..., x n ∈ X , and c 1 , ..., c n ∈ C, it holds that n i,j=1 c * i c j k(x i , x j ) ≥ 0 [4].For every kernel function there exists a unique reproducing kernel Hilbert space (RKHS) H k such that k(•, x) ∈ H k for all x ∈ X and the reproducing property f (x) = f, k(•, x) H k holds for all f ∈ H k and x ∈ X .We call the mapping φ : X → H k given by φ(x) := k(•, x) the canonical feature map of k, i.e., k(x, y) = φ(y), φ(x) [17].Let P be a probability measure over X .The kernel mean embedding (KME) of P is defined as [14,15] The embedding µ P exists and is a function in which converges to the true embedding of P in the Hilbert space metric at a rate of n − 1 2 [16].The kernel function k is said to be characteristic if the map µ : P → µ P is injective [18,19].In other words, working with a characteristic kernel enables us to represent (all properties of) a probability distribution by a function in the RKHS, which is why the notion of characteristic kernels plays an important role in kernel methods [20].The notion of characteristic kernels is closely related to the notion of universal kernels [21].Here we call a kernel universal if the corresponding RKHS is dense in the space of continuous functions over X that vanish at infinity, which corresponds to c 0universality [20].Simon-Gabriel and Schölkopf [20] show that for universal kernels, the embedding (1) is injective even when extended to finite signed measures.Popular kernels, which are universal, include the Gaussian kernel k(x, y) = exp(− x − y 2 /2σ 2 ) and Laplacian kernel k(x, y) = exp(− x − y 1 /σ), where σ is a bandwidth parameter [19,22].The expressiveness of characteristic kernels comes at a price.Since there exist distributions with infinite moments, the corresponding RKHS must have infinite dimensions to ensure no information loss.Consequently, it is impossible for a classical computer to represent and manipulate µ X directly.However, if we only care about inner products of mean embeddings, which is usually the case in most algorithms, we can resort to the "kernel trick" and replace inner products with kernel evaluations [4].That is, given i.i.d.samples X = {x 1 , . . ., x n } from P and Y = {y 1 , . . ., y n } from Q [23], we can evaluate The inevitable drawback of this trick is that algorithms based on K(X, Y ) have a runtime complexity that scales at least quadratically with the number of data points n.This is the limiting factor of the applications presented in the next section.

Applications and Limitations
We highlight essential applications of the kernel mean embedding and the limitations of its use in classical computers.
Learning on Probability Distributions.Classical machine learning algorithms were originally developed for training data consisting of points in some vector space.In several domains such as astronomy and high-energy physics, however, data are represented naturally as probability distributions, e.g., clusters of galaxies and groups of collision events.The kernel mean embedding (1) allows us to generalize these algorithms to the space of probability distributions [24][25][26][27].For example, [24] proposed an algorithm called support measure machine (SMM) which generalizes the SVM [1] to the space of probability distributions by means of the following kernel function (4) which is well defined over a space of probability distributions.For certain classes of distributions and kernel functions, the kernel (4) can be evaluated analytically [24,Table 1].This form of kernel function has been used extensively in many machine learning applications, see, e.g., [16] for a review.Given i.i.d samples X = {x 1 , . . ., x n } from P and Y = {y 1 , . . ., y n } from Q, K(P, Q) can be approximated by The main drawback of ( 5) is that, given samples X 1 , . . ., X N from N input distributions, each of size n, the runtime complexity of evaluating the kernels K(X i , X j ) for all i, j = 1, . . ., N is O(N 2 n 2 ).This is prohibitive for many real-world applications of learning problems on probability distributions.Maximum Mean Discrepancy (MMD).The MMD is a discrepancy measure between any two distributions P and Q [28,29].It is given by the distance of the corresponding mean embeddings of the distributions [29, Lemma 4] and can be expressed solely in terms of inner products of mean embeddings (assuming a real kernel): If the kernel is characteristic, the following implication holds: MMD [H k , P, Q] = 0 ⇔ P = Q [29, Theorem 5].Given i.i.d.samples X = {x 1 , ..., x n } drawn from P and Y = {y 1 , ..., y n } drawn from Q, it is possible to design a biased, but consistent, estimator of the MMD by simply evaluating (6) with the embeddings µ X and µ Y [29, Eq. ( 5)].Here one uses the kernel trick to evaluate the inner products to get whose cost is determined by that of evaluating K(X, X), K(X, Y ), and K(Y, Y ).Deep Learning.The applications of kernel mean embeddings in deep learning have gained a lot of attention in the past few years.Notably, the MMD has been used as an objective function for training deep generative models [30][31][32].For a deep generative model G θ parametrized by a parameter vector θ, the idea is to learn θ by minimizing the MMD [H k , P, Q θ ] 2 , where P is the data distribution and Q θ is the distribution induced by the generative model G θ .Again, the downside of the MMD in this area is its computational cost as we usually have to deal with huge amount of data [33].
Limitations.All of the above applications require the estimation of terms like K(X, Y ), which scale quadratically with the sample size n, and hence become prohibitive for large n.To enable large-scale learning with kernel mean embeddings, a common approach is to approximate µ X by a finite dimensional representation, e.g., using random Fourier features [34] or the Nyström method [35], after which it can be manipulated directly in a classical computer without resorting to the kernel trick.For a d dimensional approximation, the cost drops to O(n + d), which is linear in n.The downside is that the embedding defined in terms of this representation can no longer be injective, which is an essential requirement in most applications of the KME.Recent work [10,11] showed how one can in principle evaluate a d dimensional approximation of the kernel function using only O(log d) qubits.However, the quadratic scaling when using an infinite dimensional feature map has not been addressed so far in the quantum community.
In the next section we introduce the quantum mean embedding and show how this in principle allows us to explicitly work with the mean embedding even for an infinite dimensional feature map.

QUANTUM MEAN EMBEDDING
Let H be the Hilbert space of a quantum system and ϕ : X → H, x → |ϕ(x) a quantum feature map that assigns a quantum state |ϕ(x) , i.e., a normalized function in H, to each point in the input domain x ∈ X [36].This defines a kernel k(x, x ) = ϕ(x)|ϕ(x ) [10,11] with the constraint k(x, x) = 1 for all x ∈ X , due to the normalization of quantum states [37].Let P be a probability distribution over the input domain.We define the quantum mean embedding (QME) as where the normalization N P ensures the physicality of the state and is given by the norm of the corresponding kernel mean embedding (1): The QME exists for all probability distributions due to the constraint k(x, x) = 1.A subtle difference between the KME and the QME are the spaces in which the embeddings live.While the KME is a function in the RKHS H k and uniquely defined by the kernel k, the QME depends on the quantum systems Hilbert space H and the choice of the feature map ϕ.
Even though the embeddings live in different spaces, for any two probability distributions P and Q we have That is, their inner products have a fixed relation independent of H. Hence, the important difference is that the QME maps every probability distribution on the unit sphere in a Hilbert space, whereas the KME does not enforce this, see FIG. 1.In the following theorem we show that if the kernel is universal we do not lose information about a probability measure when using the QME.
Theorem 1. Injectivity of the QME Let ϕ : X → H, x → |ϕ(x) be a mapping such that k(x, y) = ϕ(x)|ϕ(y) is a universal kernel for the space of continuous functions over X that converge to zero at infinity C 0 (X ).Let P be the space of Borel probability measures over the measurable space (X , A), where A denotes the Borel sigma algebra.For a universal kernel k the QME (8), is injective over P, i.e., |ν P = |ν Q ⇔ P = Q for any P, Q ∈ P. The proof is included in the supplementary information.
For a finite sample X we define an empirical QME as where the normalization N X is given by the norm of µ X in (2): As discussed before, for infinite dimensional feature maps, the KME cannot be described explicitly and only used via inner products.The advantage of the QME is that it is possible, in principle, to explicitly create |ν X in the lab, even for infinite dimensional cases.Here it is important to note that an experimenter only needs to create a state that is proportional to n i=1 |ϕ(x i ) .The prefactor (12) is enforced by the laws of physics and is not required for the state preparation.Given this explicit representation, it allows us to decouple the cost of the inner product evaluation from the sample size n, see FIG. 2.
Conjecture 1. Suppose we are given a routine that prepares states of the form (11) with cost O(n) for a feature map ϕ.In addition we are given a routine that can evaluate inner products of arbitrary states in H in constant time.Then for two samples X = {x 1 , ..., x n } and Y = {y 1 , ..., y n } one can evaluate K(X, Y ), defined in (3), with cost O(n), whereas a classical computer scales with O(n 2 ).
Proof.By assumption we can prepare |ν X and |ν Y with linear cost in n.Furthermore we can evaluate ν X |ν Y in constant time, given the individual states.Together the cost of evaluating the term ν X |ν Y scales at most with O(n).The normalizations N X and N Y can also be estimated with cost O(n), see CHALLENGES section.
The quantum approach separates the creation of the QME from the inner product estimation.It requires two subroutines.First, on the left, an experimental setup Eϕ that creates the QME efficiently.Second, on the right, a circuit to estimate inner products of arbitrary states in H whose runtime is independent of the states.Here we chose the swap test, which uses an ancillary qubit.This approach detaches the estimation of the inner product from the sample size.
Using the relation (10) between the KME and the QME, we can calculate Given an efficient evaluation of K(X, Y ), it is possible to speed up the methods presented earlier, which rely on inner products of the KMEs.We discuss the assumptions of Conjecture 1 in the CHALLENGES section.Apart from using the QME to speed up the evaluation of inner products of the KMEs, it follows from the proof of Theorem 1 that the QME is also important on its own, as it can uniquely represent probability distributions.However, it is unclear to what extend the applications of the KME could be rephrased solely in terms of inner products of the QME instead of taking the detour over the K(X, Y ), where we additionally need to determine the normalizations.This has not been investigated in the machine learning literature so far.

CHALLENGES Efficient preparation of |νX
In order to harvest a potential quantum speedup it is necessary to create the QME efficiently, i.e., with resources and time linear in the sample size.We phrase this as the first challenge: Given a quantum feature map ϕ, find an experimental strategy, denoted E ϕ , such that for an arbitrary input sample X = {x 1 , ..., x n }, with n ∈ N, it creates |ν X , using resources that scale at most linear in n.In case of coherent states as feature map (see supplementary information), superpositions similar to |ν X have already been experimentally realized for specific cases and are known as "cat-states" [13,38,39].However, it is an open question how these approaches scale, even theoretically, for superposing a large number of states, see [40] for an overview on similar experimental approaches.In general, the rigorous study of resources required to construct superpositions of quantum states and the connections to entanglement are subject of current research [41,42].Particularly for the case of superpositions of nonorthogonal states, as it is the case for our proposed embedding, the theory becomes more involved, see III.K.4. of [42].Note that we explicitly allow for an experimental setup E ϕ that is specific to the given quantum feature map ϕ, i.e., a specific kernel function.This is necessary because a universal machine that builds a superposition of completely arbitrary and unknown quantum states cannot exist [43,44].Furthermore, we emphasize that this work does not require a quantum random access memory (qRAM) [45].Firstly, because a qRAM would entangle each single state with a state of an additional data register, whereas in this work a pure superposition is required without any entanglement.Secondly, we aim for a polynomial speedup, hence there is no necessity for a logarithmic scaling of the preparation.

Estimation of inner products
At the core of the advantage in using the QME is the estimation of the inner product of two arbitrary quantum states in H. Formally, this can be done by using the swap test routine of [46], see right side of FIG. 2. The swap test works independently of the input states, which for our purpose we denote by |ν X , |ν Y ∈ H.These inputs are each in one register and a single ancilla qubit in the state |0 in an additional register.The test itself consists of a Hadamard transformation H on the qubit, followed by a controlled swap of the two states conditioned on the state of the qubit, and another Hadamard transformation on the qubit.This circuit maps the initial state see [46, Eq. ( 4)].At the end, the qubit is measured in the computational basis.This results in outcome 1 with probability If we cannot guarantee the positivity of ν X |ν Y , we need a phase sensitive estimation of inner products, as discussed in the supplementary information of [10].Crucially, the swap test works independently of the size of the samples X and Y .For finite dimensional systems, Cincio et al. [12] recently proposed an implementation that scales linearly with the number of qubits and hence logarithmically with the dimension of the Hilbert space.But this approach does not translate to systems of infinite dimension.The infinite dimensional case has been studied in [47][48][49].However, they do not give an explicit solution and we are not aware of any experimental realization of a universal swap test for the infinite dimensional case.This marks the second challenge arising from this letter.

Estimation of the normalization NX
At the stage of preparing superpositions in the form of (11) on a quantum device, it is not necessary to know the value of the normalization N X .However, if the goal is to estimate K(X, Y ) with the help of a quantum device, then knowledge of the normalizations is needed, see (13).The naive approach using its definition (12), takes O(n 2 ) operations and would prohibit the polynomial advantage.
To evade this, we can evaluate N X by estimating the inner product with a reference state |ψ ref = |ϕ(x ref ) for some reference value x ref ∈ X .To this end, we analytically calculate using O(n) operations.Now given the preparation of |ν X and of |ψ ref we can experimentally evaluate the inner product ψ ref |ν X and from this obtain the normalization N X = c ψ ref |ν X −1 .Obviously, in order to make this well defined, we need to choose the reference function such that ψ ref |ν X = 0.This strategy relies on the challenges phrased in the previous two paragraphs but apart from this does not pose an extra difficulty by itself.We emphasize again that due to Theorem 1 it should be possible to come up with algorithms that directly work with the QME and hence make the estimation of the normalization superfluous.

CONCLUSION
In this work, we adapted the concept of kernel mean embeddings to quantum mechanics, by defining what we call quantum mean embedding.While the kernel mean embedding maps a probability distribution to a function in a reproducing kernel Hilbert space, the quantum mean embedding can only map onto the unit sphere of a Hilbert space, a necessity that arises due to the normalization of quantum states.Despite this additional constraint, we showed that the quantum mean embedding is still injective if the induced kernel is universal.Since the quantum mean embedding can, in principle, be created in the lab, it allows for a polynomial speedup when computing inner products between mean embeddings of empirical distributions.We highlighted the relevance of this task by describing use cases in recent machine learning applications.We made explicit which requirements need to be fulfilled by the quantum hardware in order to harvest the polynomial advantage.This work opens multiple paths for further research.On the quantum side, the experimental creation of superpositions of a large number of states and the estimation of inner products thereof.Furthermore, the quantum mean embedding is a new way of encoding probability distributions in quantum states, which allows us to use the results known from the kernel theory.For machine learning research, it is an open question what the possible applications of the embedding of probability distributions onto the unit sphere in the reproducing kernel Hilbert space could be.J.K. would like to thank C.J. Simon-Gabriel for his advice on universal and characteristic kernels.
1. Schematic comparison of the classical KME and the QME: The KME maps probability distributions P onto functions in the RKHS H k .The QME additionally enforces that the mapping is onto the unit ball (denoted by the circle) in the RKHS.Theorem 1 shows the injectivity of the QME for universal kernels.For visualization we choose H = H k .
2 and outcome 0 with probability p 0 = (1 + | ν X |ν Y | 2 )/2.Repetitive application of this routine allows for an estimation of p 0 and p 1 from which one can infer | ν X |ν Y | 2 = 2p 0 −1.When using a Gaussian kernel, we know a priori that ν