Unsupervised classification of quantum data

We introduce the problem of unsupervised classification of quantum data, namely, of systems whose quantum states are unknown. We derive the optimal single-shot protocol for the binary case, where the states in a disordered input array are of two types. Our protocol is universal and able to automatically sort the input under minimal assumptions, yet partially preserving information contained in the states. We quantify analytically its performance for arbitrary size and dimension of the data. We contrast it with the performance of its classical counterpart, which clusters data that has been sampled from two unknown probability distributions. We find that the quantum protocol fully exploits the dimensionality of the quantum data to achieve a much higher performance, provided data is at least three-dimensional. For the sake of comparison, we discuss the optimal protocol when the classical and quantum states are known.


I. INTRODUCTION
Quantum-based communication and computation technologies promise unprecedented applications and unforeseen speed-ups for certain classes of computational problems.In origin, the advantages of quantum computing were exemplary showcased through instances of problems that are hard to solve in a classical computer, such as integer factorization [1], unstructured search [2], discrete optimization [3,4], and simulation of many-body Hamiltonian dynamics [5].In recent times, the field has ventured one step further: quantum computers are now also envisioned as nodes in a network of quantum devices, where connections are established via quantum channels, and data are quantum systems that flow through the network [6,7].The design of future quantum networks in turn brings up new theoretical challenges, such as devising universal information processing protocols optimized to work with generic quantum inputs, without the need of human intervention.
Quantum learning algorithms are by design well suited for this class of automated tasks [8].Generalizing classical machine learning ideas to operate with quantum data, some algorithms have been devised for quantum template matching [9], quantum anomaly detection [10,11], learning unitary transformations [12] and quantum measurements [13], and classifying quantum states [14][15][16][17].These works fall under the broad category of supervised learning [18,19], where the aim is to learn an unknown conditional probability distribution Pr(y|x) from a number of given samples x i and associated values or labels y i , called training instances.The performance of a trained learning algorithm is then evaluated by applying the learned function over new data x i called test instances.In the quantum extension of supervised learning [20], the training instances are quantum-say, copies of the quantum state templates, or a potential anomalous state, or a number of uses of an unknown unitary transformation.The separation between training and testing steps is sometimes not as sharp: in reinforcement learning, training occurs on an instance basis via the interaction of an agent with an environment, and the learning process itself may alter the underlying probability distribution [21].
In contrast, unsupervised learning aims at inferring structure in an unknown distribution Pr(x) given random, unlabeled samples x i .Typically, this is done by grouping the samples in clusters, according to a preset definition of similarity.Unsupervised learning is a versatile form of learning, attractive in scenarios where appropriately labeled training data is not available or too costly.But it is also-generically-a much more challenging problem [22,23].To our knowledge, a quantum extension of unsupervised learning in the sense described above has not yet been considered in the literature.In this paper, we take a first step into this branch of quantum learning by introducing the problem of unsupervised binary classification of quantum states.We consider the following scenario: a source prepares quantum systems in two possible pure states that are completely unknown; after some time, N such systems have been produced and we ask ourselves whether there exists a quantum de-vice that is able to cluster them in two groups according to their states (see Fig. 1).This scenario represents a quantum clustering task in its simplest form, where the single feature defining a cluster of quantum systems is that their states are identical.While clustering classical data under this definition of cluster-a set of equal data instances-yields a trivial algorithm, merely observing such simple feature in a finite quantum data set involves a nontrivial stochastic process and gives rise to a primitive of operational relevance for quantum information.Moreover, in some sense our scenario actually contains a classical binary clustering problem: if we were to measure each quantum system separately, we would obtain a set of N data points (the measurement outcomes).The points would be effectively sampled from the two probability distributions determined by the quantum states and the choice of measurement.The task would then be to identify which points were sampled from the same distribution.Reciprocally, we can interpret our quantum clustering task as a natural extension of a classical clustering problem with completely unstructured data, where the only single feature that identifies a cluster is that the data points are sampled from a fixed, but arbitrary, categorical probability distribution (i.e., with no order nor metric in the underlying space).The quantum generalization is then to consider (non-commuting) quantum states instead of probability distributions.
We require two important features in our quantum clustering device: (i) it has to be universal, that is, it should be designed to take any possible pair of types of input states, and (ii) it has to provide a classical description of the clustering, that is, which particles belong to each cluster.Feature (i) ensures general purpose use and versatility of the clustering device, in a similar spirit to programmable quantum processors [24].Feature (ii) allows us to assess the performance of the device purely in terms of the accuracy of the clustering, which in turn facilitates the comparison with classical clustering strategies.Also due to (ii), we can justifiably say that the device has not only performed the clustering task but also "learned" that the input is (most likely) partitioned as specified by the output description.Note that relaxing feature (ii) in principle opens the door to a more general class of sorting quantum devices, where the goal could be, e.g., to minimize the distance (under some norm) between the global output state and the state corresponding to perfect clustering of the input.Such devices, however, fall beyond the scope of unsupervised learning.
Requiring the description of the clusters as a classical outcome induces structure in the device.To generate this information, a quantum measurement shall be performed over all N systems with as many outcomes as possible clusterings.Then, the systems will be sorted according to this outcome (see Fig. 1).Depending on the context, e.g., on whether or not the systems will be further used after the clustering, different figures of merit shall be considered in the optimization of the device.In this paper we focus on the clustering part: our goal is to find the quantum measurement that maximizes the success probability of a correct clustering.
Features (i) and (ii) allow us to formally regard quantum clustering as a state discrimination task [25][26][27][28][29][30], albeit with important differences with respect to the standard setting.In quantum state discrimination [25], we want to determine the state of a quantum system among a set of known hypotheses (i.e., classical descriptions of quantum states).We can phrase this problem in machine learning terminology as follows.We have a test state (or several copies of it [29]) and we decide its label based on infinite training data.In other words, we have full knowledge about the meaning of the possible labels.Supervised quantum learning algorithms for quantum state classification [14][15][16][17] consider the intermediate scenario with limited training data.In this case, no description of the states is available.Instead, we are provided with a finite number of copies of systems in each of the possible quantum states, and thus we have only partial classical knowledge about the labels.Extracting the label information from the quantum training data then becomes a key step in the protocol.Following this line of thought, the problem we consider in this paper is a type of unsupervised learning, that is, one with no training.There is no information whatsoever about what state each label represents.
We obtain analytical expressions for the performance of the optimal clustering protocol for arbitrary values of the local dimension d of the systems in the cases of finite number of systems N and in the asymptotic limit of many systems.We show that, in spite of the fact that the number of possible clusterings grows exponentially with N , the success probability decays only as O(1/N 2 ).Furthermore, we contrast these results with an optimal clustering algorithm designed for the classical version of the task.We observe a striking phenomenon when analyzing the performance of the two protocols for d > 2: whereas increasing the local dimension has a rapid negative impact in the success probability of the classical protocol (clustering becomes, naturally, harder), it turns out to be beneficial for its quantum counterpart.
We also see, through numerical analysis, that the quantum measurement that maximizes the success probability is also optimal for a more general class of cost functions that are more natural for clustering problems, including the Hamming distance.In other words, this provides evidence that our entire analysis does not depend strongly on the chosen figure of merit, but rather on the structure of the problem itself.
Measuring the systems will in principle degrade the information encoded in their states, hence, intuitively, there should be a trade-off between how good a clustering is and how much information about the original states is left in the clusters.Remarkably, our analysis reveals that the measurement that clusterizes optimally actually preserves information regarding the type of states that form each cluster.This feature adds to the usability of our device as a universal quantum data sorting proces-sor.It can be regarded as the quantum analogue of a sorting network (or sorting memory) [31], used as a fixed network architecture that automatically orders generic inputs coming from an aggregated data pipeline.The details of this second step are however left for a subsequent publication.
The paper is organized as follows.In Section II, we formalize the problem and derive the optimal clustering protocol and its performance.In Section III, we consider a classical clustering protocol and contrast it with the optimal one.We present the proofs of the main results of our work and the necessary theoretical tools to derive them in Section IV.We end in Section V discussing the features of our quantum clustering device and other cost functions, and giving an outlook on future extensions.

II. CLUSTERING QUANTUM STATES
Let us suppose that a source prepares quantum systems randomly in one of two pure d-dimensional states |φ 0 and |φ 1 with equal prior probabilities.Given a sequence of N systems produced by the source, and with no knowledge of the states φ 0/1 , we are required to assign labels '0' or '1' to each of the systems.The labeling can be achieved via a generalized quantum measurement that tries to distinguish among all the possible global states of the N systems.Each outcome of the measurement will then be associated to a possible label assignment, that is, to a clustering.
Consider the case of four systems.All possible clusterings that we may arrange are depicted in Fig. 2 as strings of red and blue balls.Since the individual states of the systems are unknown, what is labeled as "red" or "blue" is arbitrary, thus interchanging the labels leads to an equivalent clustering.For arbitrary N , there will be 2 N −1 such clusterings.Fig. 2 also illustrates a natural way to label each clustering as (n, σ).The index n counts the number of systems in the smallest cluster.The index σ is a permutation that brings a reference clustering, defined as that in which the systems belonging to the smallest cluster fall all on the right, into the desired form.To make this labeling unambiguous, σ is chosen from a restricted set S n ⊂ S N , where S N stands for the permutation group of N elements and e denotes its unity element.We will see that the optimal clustering procedure consists in measuring first the value of n, and, depending on the outcome, performing a second measurement that identifies σ among the relevant permutations with a fixed n.
Thus, unsupervised clustering has been cast as a multihypothesis discrimination problem, which can be solved for an arbitrary number of systems N with local dimension d.Below, we outline the derivation of our main result: the expression of the maximum average success probability achievable by a quantum clustering protocol.In the limit of large N and for arbitrary d (not necessarily constant with N ), we show that this probability behaves  < l a t e x i t s h a 1 _ b a s e 6 4 = " X A f i N 9 t q a N 3 h r H w M f v a A w 9 0 < l a t e x i t s h a 1 _ b a s e 6 4 = " C 0 J q H r n g 2 5 D p a e f c 7 c t + 3 M 5 y s 6 0 < l a t e x i t s h a 1 _ b a s e 6 4 = " S B + 3 < l a t e x i t s h a 1 _ b a s e 6 4 = " 7 5 < l a t e x i t s h a 1 _ b a s e 6 4 = " 2. All possible clusterings of N = 4 systems when each can be in one of two possible states, depicted as blue and red.The pair of indices (n, σ) identifies each clustering, where n is the size of the smallest cluster, and σ is a permutation of the reference clusterings (those on top of each box), wherein the smallest cluster falls on the right.The symbol e denotes the identity permutation, and (ij) the transposition of systems in positions i and j.Note that the choice of σ is not unique.
Naturally, P s goes to zero with N , since the total number of clusterings increases exponentially and it becomes much harder to discriminate among them.What may perhaps come as a surprise is that, despite this exponential growth, the scaling of P s is only of order O(1/N ). 2urthermore, increasing the local dimension yields a linear improvement in the asymptotic success probability.As we will later see, whereas the asympotic behavior in N is not an exclusive feature of the optimal quantum protocol-we observe the same scaling in its classical counterpart, albeit only when d = 2-the ability to exploit extra dimensions to enhance distinguishability is.
Let us present an outlined derivation of the optimal quantum clustering protocol.Each input can be described by a string of 0's and 1's x = (x 1 • • • x N ), so that the global state of the systems entering the device is The clustering device can generically be defined by a positive operator valued measure (POVM) with elements {E x }, fulfilling E x ≥ 0 and x E x = 1 1, where each operator E x is associated to the statement "the measured global state corresponds to the string x".We want to find a POVM that maximizes the average success probability P s = 2 1−N dφ 0 dφ 1 x tr (|Φ x Φ x | E x ), where we assumed that each clustering is equally likely at the input, and we are averaging over all possible pairs of states {|φ 0 , |φ 1 } and strings x.Since our goal is to design a universal clustering protocol, the operators E x cannot depend on |φ 0,1 , and we can take the integral inside the trace.The clustering problem can then be regarded as the optimization of a POVM that distinguishes between effective density operators of the form It now becomes apparent that ρ x = ρ x, where x is the complementary string of x (i.e., the values 0 and 1 are exchanged).
The key that reveals the structure of the problem and allows us to deduce the optimal clustering protocol resides in computing the integral in Eq. (2).Averaging over the states leaves out only the information relevant to identify a clustering, that is, n and σ.Certainly, identifying x ≡ (n, σ), we can rewrite ρ x as By applying Schur lemma, one readily obtains the first line, where 1 1 sym k is a projector onto the completely symmetric subspace of k systems, c n is a normalization factor, and U σ is a unitary matrix representation of σ.The second line follows from using the Schur basis (see Section IV A), in which the states ρ n,σ are block-diagonal.Here λ labels the irreducible representations-irreps for short-of the joint action of the groups SU(d) and S N over the vector space (d, C) ⊗N , and is usually identified with the shape of Young diagrams (or partitions of N ).A pair of parentheses, () [brackets, {}], surrounding the subscript λ, e.g., in Eq. ( 3), are used when λ refers exclusively to irreps of SU(d) [S N ]; we stick to this convention throughout the paper.Note that averaging over all SU(d) transformations erases the information contained in the representation subspace (λ).It also follows from Eq. ( 3) and the rules of the Clebsch-Gordan decomposition that (i) only two-row Young diagrams (partitions of length two) show up in the direct sum above, and (ii) the operators Ω n,σ {λ} are rank-1 projectors (see Appendix B).They carry all the information relevant for the clustering, and are understood to be zero for irreps λ outside the support of ρ n,σ .
With Eq. ( 3) at hand, the optimal clustering protocol can be succinctly described as two successive measurements-we state the result here and present an optimality proof in Section IV A. The first measurement is a projection onto the irrep subspaces λ, described by the set {1 1 (λ) ⊗ 1 1 {λ} }.The outcome of this measurement provides an estimate of n, as λ is one-to-one related to the size of the clusters.More precisely, we have from (i) that λ = (λ 1 , λ 2 ), where λ 1 and λ 2 are nonnegative integers such that λ 1 +λ 2 = N and λ 1 ≥ λ 2 .Then, given the outcome λ = (λ 1 , λ 2 ) of this first measurement, the optimal guess turns out to be n = λ 2 .Very roughly speaking, the "asymmetry" in the subspace λ = (λ 1 , λ 2 ) increases with λ 2 .We recall that λ = (N, 0) is the fully symmetric subspace of (d, C) N .Naturally, ρ 0,σ has support only in this subspace, as all states in the data are of one type.As λ 2 increases from zero, more states of the alternative type are necessary to achieve the increasing asymmetry of λ = (λ 1 , λ 2 ).Hence, for a given λ 2 , there is a minimum value of n for which ρ n,σ can have support in the subspace λ = (λ 1 , λ 2 ).This minimum n is the optimal guess.
Once we have obtained a particular λ = λ * as an outcome (and guessed n), a second measurement is performed over the subspace {λ * } to produce a guess for σ.Since the states ρ n,σ are covariant under S N , the optimal measurement to guess the permutation σ is also covariant, and its seed is the rank-1 operator Ω n,e {λ * } , where λ * = (N − n, n).Put together, these two successive measurements yield a joint optimal POVM whose elements take the form where (n, σ) is the guess for the cluster and ξ n λ * is some coefficient that guarantees the POVM condition The success probability of the optimal protocol can be computed as from which the asymptotic limit Eq. ( 1) follows (see Appendix C).
Before closing this section we would like to briefly discuss the case when some information about the possible states |φ 0 and |φ 1 is available.A clustering device that incorporates this information into its design should succeed with a probability higher than Eq. ( 5), at the cost of universality.To explore the extent of this performance enhancement, we study the extreme case where we have full knowledge of the states |φ 0 and |φ 1 .We find that in the large N limit the maximum improvement is by a factor of N .The optimal success probability scales as (see Section IV B for details).

III. CLUSTERING CLASSICAL STATES
To grasp the significance of our quantum clustering protocol, a comparison with a classical analogue is called for.First, in the place of a quantum system whose state is either |φ 0 or |φ 1 , an input would be an instance of a d-dimensional random variable sampled from either one of two categorical probability distributions, P = {p s } d s=1 and Q = {q s } d s=1 .Then, given a string of samples s = (s 1 • • • s N ), s i ∈ {1, . . ., d}, the clustering task would consist in grouping the data points s i in two clusters so that all points in a cluster have a common underlying probability distribution.
Second, in analogy with the quantum protocol, our goal would be to find the optimal universal (i.e., independent of P and Q) protocol, that performs this task.Here, optimality means attaining the maximum average success probability, where the average is over all N -length sequences x of distributions P and Q from which the string s is sampled, and over all such distributions.
It should be emphasized that this is a very hard classical clustering problem, with absolute minimal assumptions, where there is no metric in the domain of the random variables and, in consequence, no exploitable notion of distance.Therefore, one should expect the optimal algorithm to have a rather low performance and to differ significantly from well-known algorithms for classical unsupervised classification problems.
As a further remark, we note that a choice of prior is required to perform the average over P and Q.We will assume that the two are uniformly distributed over the simplex on which they are both defined.This reflects our lack of knowledge about the distributions underlying the string of samples s.
Under all these specifications, the classical clustering problem we just defined naturally connects with the quantum scenario in Section II as follows.We can interpret s as a string of outcomes obtained upon performing the same projective measurement on each individual quantum state |φ xi of our original problem.Furthermore, such local measurements can also be interpreted as a decoherence process affecting the pure quantum states at the input, whereby they decay into classical probability distributions over a fixed basis.We might think of this as the semiclassical analogue of our original problem, since quantum resources are not fully exploited.
Let us first lay out the problem in the special case of d = 2, where the underlying distributions are Bernoulli, and we can write P = {p, 1 − p}, Q = {q, 1 − q}.Given an N -length string of samples s, our intuition tells us that the best we can do is to assign the same underlying probability distribution to equal values in s.So if, e.g., s = (00101 • • • ), we will guess that the underlying sequence of distributions is x = (P P QP Q • • • ) [or, equivalently, the complementary sequence x = (QQP QP • • • )].Thus, data points will be clustered according to their value 0 or 1.The optimality of this guessing rule is a particular case of the result for d-dimensional random variables in Appendix F.
The probability that a string of samples s, with l zeros and N − l ones, arises from the guessed sequence x is given by .
The average success probability can then be readily computed as P cl s = 2 x,s δ x,x Pr(x) Pr(s|x) (recall that x depends on s), where Pr(x) = 2 −N is the prior probability of the sequence x, which we assume to be uniform.The factor 2 takes into account that guessing the complementary sequence leads to the same clustering.It is now quite straightforward to derive the asymptotic expression of P cl s for large N .In this limit x will typically have the same number of P and Q distributions, so the guess x will be right if l = N/2.Then, This expression coincides with the quantum asymptotic result in Eq. ( 1) for d = 2.As we now see, this is however a particularity of Bernoulli distributions.
The derivation for d > 2 is more involved, since the optimal guessing rule is not so obvious (see Appendix F for details).Loosely speaking, we should still assign samples with the same value to the same cluster.By doing so, we obtain up to d preliminary clusters.We next merge them into two clusters in such a way that their final sizes are as balanced as possible.This last step, known as the partition problem [33], is weakly NP-complete.Namely, its complexity is polynomial in the magnitudes of the data involved (the size of the preliminary clusters, which depends on N ) but non-polynomial in the input size (the number of such clusters, determined by d).This means that the classical and semiclassical protocols cannot be implemented efficiently for arbitrary d.In the asymptotic limit of large N , and for arbitrary fixed values of d, we obtain There is a huge difference between this result and Eq. ( 1).Whereas increasing the local dimension provides an asymptotic linear advantage in the optimal quantum clustering protocol-states become more orthogonal-it has the opposite effect in its classical and semiclassical analogues, as it reduces exponentially the success probability.
In the opposite regime, i.e., for d asymptotically large and fixed values of N , the optimal classical and semiclassical strategies provide no improvement over random guessing, and the clustering tasks become exceedingly hard and somewhat uninteresting.This follows from observing that the guessing rule relies on grouping repeated data values.In this regime, the typical string of samples s has no repeated elements, thus we are left with no alternative but to randomly guess the right clustering of the data and P cl s ∼ 2 1−N .To complete the picture, we end up this section by considering known classical probability distributions.Akin to the quantum case, one would expect an increase in the success probability of clustering.An immediate consequence of knowing the distributions P and Q is that the rule for assigning a clustering given a string of samples s becomes trivial.Each symbol s i ∈ {1, . . ., d} will be assigned to the most likely distribution, that is, to P (Q) if p si > q si (p si < q si ).It is clear that knowing P and Q helps to better classify the data.This becomes apparent by considering the example of two three-dimensional distributions and the data string s = (112).If the distributions are unknown, such sequence leads to the guess x = (P P Q) [or equivalently to x = (QQP )].In contrast, if P and Q are known and, e.g., p 1 > q 1 and p 2 > q 2 , the same sequence leads to the better guess x = (P P P ).The advantage of knowing the distribution, however, vanishes in the large N limit, and the asymptotic performance of the optimal clustering algorithm is shown to be given by Eq. ( 9).The interested reader can find the details of the proof in Appendix G.

IV. METHODS
Here we give the full proof of optimality of our quantum clustering protocol/device, which leads to our main result in Eq. ( 1).The proof relies on representation theory of the special unitary and the symmetric groups.In particular, the Schur-Weyl duality is used to efficiently represent the structure of the input quantum data and the action of the device.We then leverage this structure to find the optimal POVM and compute the minimum cost.Basic notions of representation theory that we use in the proof are covered in the Appendices A and B. We close the Methods section proving Eq. ( 6) for the optimal success probability of clustering known quantum states.

A. Clustering quantum states: unknown input states
In this Section we obtain the optimal POVM for quantum clustering and compute the minimum cost.First, we present a formal optimality proof for an arbitrary cost function f (x, x ), which specifies the penalty for guessing x if the input is x .Second, we particularize to the case of success probability, as discussed in the main text, for which explicit expressions are obtained.

Generic cost functions
We say a POVM is optimal if it minimizes the average cost where η x is the prior probability of input string x, and is the probability of obtaining measurement outcome (and guess) x given input x; and an average is taken over all possible pairs of states {|φ 0 , |φ 1 }, hence x and its complementary x define de same clustering.A convenient way to identify the different clusterings is by counting the number n, 0 ≤ n ≤ N/2 , of zeros in x (so, strings with more 0s than 1s are discarded) and giving a unique representative σ of the equivalence class of permutations that turn the reference string (0 n 1 n), n = N −n, into x.We will denote the subset of these representatives by S n ⊂ S N , and the number of elements in each equivalence class by b n .A simple calculation gives us b n = 2(n!) 2 if n = n, and b n = n!n! otherwise.
As discussed in the main text, the clustering problem above is equivalent to a multi-hypothesis discrimination problem, where the hypotheses are given by and we have used Schur lemma to compute the integral.
Here, U σ is a unitary matrix representation of the permutation σ, 1 1 sym k is a projector onto the completely symmetric subspace of k systems, and The states (11) are block-diagonal in the Schur basis, which decouples the commuting actions of the groups SU(d) and S N over product states of the form of |Φ x .More precisely, Schur-Weyl duality states that the representations of the two groups acting on the common space (d, C) ⊗N are each other's commutant.Moreover, it provides a decomposition of this space into decoupled subspaces associated to irreducible representations (irreps) of both SU(d) and S N .We can then express the states ρ x , where x is specified as (n, σ) [x = (n, σ) for short], in the Schur basis as In this direct sum, λ is a label attached to the irreps of the joint action of SU(d) and S N and is usually identified with a partition of N or, equivalently, a Young diagram.As explained in the main text, a pair of parenthesis surrounding this type of label, like in (λ), mean that it refers specifically to irreps of SU(d).Likewise, a pair of brackets, e.g., {λ}, indicate that the label refers to irreps of S N .In accordance with this convention, Schur-Weyl duality implies that Ω n,σ {λ} = U λ σ Ω n,e {λ} (U λ σ ) † , where U λ σ is the matrix of the irrep λ that represents σ ∈ S N , and e denotes the identity permutation (for simplicity, we omit the index e when no confusion arises).In other words, the family of states ρ n,σ is covariant with respect to S N .One can easily check that Ω n,σ {λ} is always a rank-1 projector (see Appendix B).In Eq. ( 12) it is understood that Ω n,σ {λ} = 0 outside of the range of ρ n,σ .With no loss of generality, the optimal measurement that discriminates the states ρ n,σ can be represented by a POVM whose elements have the form shown in Eq. (12).Moreover, we can assume it to be covariant under S N [34].So, such POVM elements can be written as where Ξ n {λ} is some positive operator.The resolution of the identity condition imposes constraints on them.The condition reads n,σ where we have used the factor b n to extend the sum over S n to the entire group S N and applied Schur lemma.
Taking the trace on both sides of the equation, we find the POVM constraint to be where ν λ is the dimension of 1 1 {λ} or, equivalently, the multiplicity of the irrep λ of SU(d) [see Eq. (B5)].So far we have analyzed the structure that the symmetries of the problem impose on the states ρ n,σ and the measurements.We have learned that for any choice of operators Ξ n {λ} that fulfill Eq. ( 15), the set of operators (13) defines a valid POVM, but it need not be optimal.So, we now proceed to derive optimality conditions for Ξ n {λ} .Those are provided by the Holevo-Yuen-Kennedy-Lax [35,36] necessary and sufficient conditions for minimizing the average cost.For our clustering problem in Eq. (10) they read They must hold for all x, where Γ = x W x E x = x E x W x , and W x = x f (x, x )η x ρ x .We will assume that the prior distribution η x is flat and that the cost function is nonnegative and covariant with respect to the permutation group, i.e., f (x, x ) = f (τ x, τ x ) for all τ ∈ S N .Then, W τ x = U τ W x U † τ and we only need to ensure that conditions ( 16) and ( 17) are met for reference strings, for which x = (n, e).In the Schur basis, their corresponding operators, which we simply call W n , and the matrix Γ take the form where we have used Schur lemma to obtain Eq. ( 19) and defined k λ ≡ n N !tr ω n {λ} Ξ n {λ} /(b n ν λ ).Note that Γ is a diagonal matrix, in spite of the fact that ω n {λ} are, at this point, arbitrary full-rank positive operators.
With Eqs. ( 18) and ( 19), the optimality conditions ( 16) and ( 17) can be made explicit.First, we note that the subspace (λ) is irrelevant in this calculation, and that there will be an independent condition for each irrep λ.Taking into account these considerations, Eq. ( 16) now reads This equation tells us two things: (i) since the matrices ω n {λ} and Ξ n {λ} commute, they have a common eigenbasis, and (ii) Eq. ( 20) is a set of eigenvalue equations for ω n {λ} with a common eigenvalue k λ , one equation for each eigenvector of Ξ n {λ} .Therefore, the support of Ξ n {λ} is necessarily restricted to a single eigenspace of ω n {λ} .Denoting by ϑ n λ,a , a = 1, 2, . . ., the eigenvalues of ω n {λ} sorted in increasing order, we have k λ = ϑ n λ,a for some a, which may depend on λ and n, or else Ξ n {λ} = 0.The second Holevo condition (17), under the same considerations regarding the block-diagonal structure, leads to This condition further induces more structure in the POVM.Given λ, Eq. ( 21) has to hold for every value of n.
In particular, we must have min n ϑ n λ,1 ≥ k λ .Therefore, min n ϑ n λ,1 ≥ ϑ n λ,a for some a, or else Ξ n {λ} = 0. Since Ξ n {λ} cannot vanish for all n because of Eq. ( 15), we readily see that where n(λ) = argmin n ϑ n λ,1 , Π 1 (ω n {λ} ) is a projector onto the eigenspace of ω n {λ} (not necessarily the whole subspace) corresponding to the minimum eigenvalue ϑ n λ,1 , and ξ n λ is a suitable coefficient that can be read off from Eq. ( 15): where . This completes the construction of the optimal POVM.
For a generic cost function, we can now write down a closed, implicit formula for the minimum average cost achievable by any quantum clustering protocol.It reads where s λ is the dimension of 1 1 (λ) or, equivalently, the multiplicity of the irrep λ of S N [see Eq. (B6)].The only object that remains to be specified is the function n(λ), which depends ultimately on the choice of the cost function f (x, x ).

Success probability
We now make Eq.( 24) explicit by considering the success probability P s as a figure of merit, that is, we choose f (x, x ) = 1 − δ x,x , hence P s = 1 − f .We also assume that the source that produces the input sequence is equally likely to prepare either state, thus each string x has the same prior probability, η x = 2 1−N ≡ η.In this case, W n takes the simple form where µ λ are positive coefficients and we recall that the expression in parenthesis corresponds to ω n {λ} in Eq. (18).From this expression one can easily derive the explicit forms of ϑ n λ,1 and n(λ).We just need to consider the maximum eigenvalue of the rank-one projector Ω n {λ} , which can be either one or zero depending on whether or not the input state ρ n,σ has support in the irrep λ space.So, among the values of n for which ρ n,σ does have support there, n(λ) is one that maximizes c n .Since c n is a decreasing function of n in its allowed range (recall that n ≤ N/2 ), n(λ) is the smallest such value.
For the problem at hand, the irreps in the direct sum can be labeled by Young diagrams of at most two rows, or, equivalently, by partitions of N of length at most two (see Appendix B), hence λ = (λ 1 , λ 2 ), where λ 1 + λ 2 = N and λ 2 runs from 0 to N/2 .Given λ, only states ρ n with n = λ 2 , . . ., N/2 have support on the irrep λ space, as readily follows from the Clebsch-Gordan decomposition rules.Then, Eq. ( 26) gives the optimal guess for the size, n, of the smallest cluster.The rule is in agreement with our intuition.The irrep (N, 0), i.e., λ 2 = 0, corresponding to the fully symmetric subspace, is naturally associated with the value n = 0, i.e., with all N systems being in the same state/cluster; the irrep with one antisymmetrized index has λ 2 = 1, and hints at a system being in a different state than the others, i.e., at a cluster of size one; and so on.
We now have all the ingredients to compute the optimal success probability from Eq. ( 24).It reads where we have used the relation λ s λ ν λ µ λ = 1 that follows from tr x η x ρ x = 1, and the expressions of ν λ and s λ from Eqs. (B5) and (B6) in Appendix B.

B. Clustering quantum states: known input states
If the two possible states |φ 0 and |φ 1 are known, the optimal clustering protocol must use this information.It is then expected that the average performance will be much higher than for the universal protocol.It is natural in this context not to identify a given string x with its complementary x (we stick to the notation in the main text), since mistaking one state for the other should clearly count as an error if the two preparations are specified.In this case, then, clustering is equivalent to discriminating the 2 N known pure states , where with no loss of generality we can write for a convenient choice of basis.Here c = | φ 0 |φ 1 | is the overlap of the two states.
The Gram matrix G encapsulates all the information needed to discriminate the states of the set.It is defined as having elements G x,x = Φ x |Φ x .It is known that when the diagonal elements of its square root are all equal, i.e., √ G x,x ≡ S for all x, then the square root measurement is optimal [37,38] and the probability of successful indentification reads simply P s = S 2 .Notice that we have implicitly assumed uniformly distributed hypotheses.For the case at hand, where is the Gram matrix of {|φ 0 , |φ 1 }.Thus, As expected, the diagonal terms of √ G are all equal, and the success probability is given by We call the reader's attention to the fact that one could have attained the very same success probability by performing an individual Helstrom measurement [25], with basis on each state of the input sequence and guessed that the label of that state was the outcome value.In other words, for the problem at hand, global quantum measurements do not provide any improvement over individual fixed measurements.
In order to compare with the results of the main text, we compute the average performance for a uniform distribution of states |φ 0 and |φ 1 , i.e., the average where we have inserted the identity 1 = and used the invariance of the measure dφ under SU(d) transformations.The marginal distribution is µ(c Appendix E).Using this result, the asymptotic behavior of the last integral is As expected, knowing the two possible states in the input string leads to a better behavior of the success probability: it decreases only linearly in 1/N , as compared to the best universal quantum clustering protocol, which exhibits a quadratic decrease.
To do a fairer comparison with universal quantum clustering, guessing the complementary string x instead of x will now be counted as success, that is, now the clusterings are defined by the states For this variation of the problem, the optimal measurement is still local, and given by a POVM with elements where , and where we recall that {|ψ 0 , |ψ 1 } is the (local) Helstrom measurement basis in Eq. ( 33).Note that {E x } are orthogonal projectors.
To prove the statement in the last paragraph, we show that the Holevo-Yuen-Kennedy-Lax conditions, Eq. ( 16), hold (recall that the Gram matrix technique does not apply to mixed states).For the success probability and assuming equal priors, these conditions take the simpler form where we have dropped the irrelevant factor η = 2 1−N .Condition ( 38) is trivially satisfied.To check that condition (39) also holds, we recall the Weyl inequalities for the eigenvalues of Hermitian n × n matrices A, B [39]: for j = 0, 1, . . ., n − i, where the eigenvalues are labeled in increasing order ϑ 1 ≤ ϑ 2 ≤ • • • ≤ ϑ n .We use Eq. ( 40) to write (note that effectively all these operators act on the 2 Ndimensional subspace spanned by {|0 , |1 } ⊗N ).As will be proved below, Γ > 0, which implies that ϑ 1 (Γ) > 0.
To show the positivity of Γ, which was assumed in the previous paragraph, we use Eqs.( 28) and (33) where Notice that a 1 > b 1 and a 2 > |b 2 |.Thus, if 0 ≤ c < 1, we have ϑ k > 0 for k = 1, 2, . . ., 2 N .The special case c = 1 is degenerate.Eq. ( 39) is trivially saturated, rendering P s = 2 1−N , as it should be.The maximum success probability can now be computed recalling that P s (c) = 2 1−N tr Γ.We obtain where the first term corresponds to guessing correctly all the states in the input string, whereas the second one results from guessing the other possible state all along the string.One can easily check that the average over c of the second term vanishes exponentially for large N , and we end up with a success probability given again by Eq. ( 35).Finally, we would like to mention that one could consider a simple unambiguous protocol [40][41][42][43] whereby each state of the input string would be identified with no error with probability P s (c) = 1 − c, i.e., the protocol would give an inconclusive answer with probability 1 − P s = c.Therefore, the average unambiguous probability of sorting the data would be V. DISCUSSION Unsupervised learning, which assumes virtually nothing about the distributions underlying the data, is already a hard problem [22,23].Lifting the notion of classical data to quantum data (i.e., states) factors in additional obstacles, such as the impossibility to repeatedly operate with the quantum data without degrading it.Most prominent classical clustering algorithms heavily rely on the iterative evaluation of a function on the input data (e.g., pairwise distances between points in a feature vector space, as in k-means [44]), hence they are not equipped to deal with degrading data and would expectedly fail in our scenario.The unsupervised quantum classification algorithm we present is thus, by necessity, far away from its classical analogues.In particular, since we are concerned with the optimal quantum strategy we need to consider the most general collective measurement, which is inherently single-shot: it yields a single sample of a stochastic action, namely, a posterior state and an outcome of a quantum measurement, where the latter provides the description of the clustering.The main lesson stemming from our investigation is that, despite these limitations, clustering unknown quantum states is a feasible task.The optimal protocol that solves it showcases some interesting features.
It does not completely erase the information about a given preparation of the input data after clustering.This is apparent from Eq. ( 4), since the action of the POVM on the subspaces (λ) is the identity.After the input data string in the global state |Φ x is measured and outcome λ * is obtained (recall that λ * gives us information about the size of the clusters), information relative to the particular states φ 0/1 remains in the subspace (λ * ) of the global post-measured state.Therefore, one could potentially use further the posterior (clustered) states down the line as approximations of the two classes of states.This opens the door for our clustering device to be used as an intermediate processor in a quantum network.This notwithstanding, the amount of information that can be retrieved after optimal clustering is currently under investigation.
It outbeats the classical and semiclassical protocols.If the local dimension of the quantum data is larger than two, the dimensionality of the symmetric subspaces spanned by the global states of the strings of data can be exploited by means of collective measurements with a twofold effect: enhanced distinguishability of states, resulting in improved clustering performance (exemplified by a linear increase in the asymptotic success probability), and information-preserving data handling (to some extent, as discussed above).This should be contrasted with the semiclassical protocol, which essentially obliterates the information content of the data (as a von Neumann measurement is performed on each system), and whose success probability vanishes exponentially with the local dimension.In addition, the optimal classical and semiclassical protocols require solving an NP-complete problem and their implementation is thus inefficient.In contrast, we observe that the first part of the quantum protocol, which consists in guessing the size of the clusters n, runs efficiently on a quantum computer: this step involves a Schur transform that runs in polynomial time in N and log d [45,46], followed by a projective measurement with no computational cost.The second part, guessing the permutation σ, requires implementing a group-covariant POVM.The complexity of this step, and hence the overall computational complexity of our protocol, is still an open question currently under investigation.
It is optimal for a range of different cost functions.There are various cost functions that could arguably be better suited to quantum clustering, e.g., the Hamming distance between the guessed and the true clusterings, or likewise, the trace distance or the infidelity between the corresponding effective states ρ n,σ and ρ n ,σ .They are however hard to deal with analytically.The question arises as to whether our POVM is still optimal for such cost functions.To answer this question, we formulate an optimality condition that can be checked numerically for problems of finite size (see Appendix D).Our numerics show that the POVM remains optimal for all these examples.This is an indication that the optimality of our protocol stems from the structure of the problem, independently of the cost function.
It stands a landmark in multi-hypothesis state discrimination.Analytical solutions to multi-hypothesis state discrimination exist only in a few specific cases [26-28, 30, 38, 47].Our set of hypotheses arises arguably from the minimal set of assumptions about a pure state source: it produces two states randomly.Variants of this problem with much more restrictive assumptions have been considered in Refs.[11,48,49].
Our clustering protocol departs from other notions of quantum unsupervised machine learning that can be found in the literature [50][51][52][53].In these references, data coming from a classical problem is encoded in quantum states that are available on demand via a quantum random access memory [54].The goal is to surpass classical performance in the number of required operations.In contrast, we deal with unprocessed quantum data as input, and aim at performing a task that is genuinely quantum.This is a notably harder scenario, where known heuristics for classical algorithms simply cannot work.
Other extensions of this work currently under investigation are: clustering systems whose states can be of more than two types, where we expect a similar two-step measurement for the optimal protocol; and clustering of quantum processes, where the aim is to classify instances of unknown processes by letting them run on some input test state of our choice (see Ref. [11] for related work on identifying malfunctioning devices).In this last case, an interesting application arises when considering causal relations as the defining feature of a cluster.A clustering algorithm would then aim to identify, within a set of unknown processes, which ones are causally connected.Identifying causal structures has recently attracted attention among the quantum information community [55].
are each other's commutants.It follows that this reducible representation decomposes into irreps λ, so that their joint action can be expressed as where R λ and U λ σ are the matrices that represent R and U σ , respectively, on the irrep λ.To resolve any ambiguity that may arise, we write λ in parenthesis, (λ), when it refers to the irreps of SU(d), or in brackets, {λ}, when it refers to those of S N .Eq. (B1) tells us that the dimension of (λ), s λ , coincides with the multiplicity of {λ}, and conversely, the dimension of {λ}, ν λ , coincides with the multiplicity of (λ).
This block-diagonal structure provides a decomposition of Hilbert space H ⊗N = (d, C) ⊗N into subspaces that are invariant under the action of SU(d) and S N , as H ⊗N = λ H λ , and in turn, H λ = H (λ) ⊗H {λ} .The basis in which H ⊗N has this form is known as Schur basis, and the unitary transformation that changes from the computational to the Schur basis is called Schur transform.
To conclude this Appendix, let us recall the rules for reducing the tensor product of two SU(d) representations as a Clebsch-Gordan series of the form where dim(1 1 λ ) is the multiplicity of irrep λ .The same rules also apply to the reduction of the outer product of representations of S n and S n into irreps of S n , where n = n + n .In this case one has Note the different meanings of ⊗ in the last two equations (it is however standard notation).The rules are most easily stated in terms of the Young diagrams that label the irreps.They are as follows: 1.In one of the diagrams that label de irreps on the left hand side of Eq. (B2) or Eq.(B3) (preferably the smallest), write the symbol a in all boxes of the first row, the symbol b in all boxes of the second row, c in all boxes of the third one, and so on.
2. Attach boxes with a to the second Young diagram in all possible ways subjected to the rules that no two a's appear in the same column and that the resulting arrangement of boxes is still a Young diagram.Repeat this process with b's, c's, and so on.
3. For each Young diagram obtained in step two, read the 1st row of added symbols from right to left, then the second row in the same order, and so on.The resulting sequence of symbols, e.g., abaabc . . ., must be a lattice permutation, namely, to the left of any point in the sequence, there are not fewer a's than b's, no fewer b's than c's, and so on.Discard all diagrams that do not comply with this rule.
The Young diagrams λ that result from this procedure specify the irreps on the right hand side of Eqs.(B2) and (B3).A same diagram can appear a number M of times, in which case λ has multiplicity dim(1 1 λ ) = M .

Particularities of quantum clustering
Since the density operators [cf.Eq. ( 11)] and POVM elements [cf.Eq. ( 13)] associated to each possible clustering emerge from the joint action of a permutation σ ∈ S N and a group average over SU(d), it is most convenient to work in the Schur basis, where the mathematical structure is much simpler.A further simplification specific to quantum clustering of two types of states, is that the irreps that appear in the block-diagonal decomposition of the states (and, hence, of the POVM elements) have at most length 2, i.e., they are labeled by bipartitions λ = (λ 1 , λ 2 ), and correspond to Young diagrams of at most two rows.This is because the ρ n,σ arise from the tensor product of two completely symmetric projectors, 1 1 sym n , 1 1 sym n , of n and n systems [cf.Eq. ( 11)].They project into the irrep λ = (n, 0) and λ = (n, 0) subspaces, respectively.According to the reduction rules above, in the Schur basis the tensor product reduces as This proves our statement.
However, it is worth mentioning that this difficulty might actually be overcome.The fundamental objects needed for testing Eq. (D5) are the operators Ω n {λ} .Their computation would, in principle, not require the full Schur transform, as they can be expressed in terms of generalized Racah coefficients, which give a direct relation between Schur bases arising from different coupling schemes of the tensor product space.It is indeed possible to calculate generalized Racah coefficients directly without going through a Clebsch-Gordan transform [60], and should this method be implemented, clustering problems of larger sizes might be tested.However, an extensive numerical analysis was not the aim of this paper.

Appendix E: Prior distributions
In the interest of making the paper self-contained, in this appendix we include the derivation of some results about the prior distributions used in the paper.
Let S d = {p s ≥ 0| This equation agrees with Eq. (E2).This means that all the moments of the distribution induced from the uniform distribution of pure states coincide with the moments of a flat distribution of CDs on S d .Since the moments uniquely determine the distributions with compact support [61] (and S d is compact) we conclude that they are identical.
As a byproduct, we can compute the marginal distribution µ(c 2 ), where c is the overlap of |φ with a fixed state |ψ .Since we can always find a basis such that |ψ is its first element, we have c = | 1|φ |.Because of the results above, the marginal distribution is given by in agreement with Ref. [62].
Appendix F: Optimal clustering protocol for unknown classical states In this appendix we provide details on the derivation of the optimal protocol for a classical clustering problem, analogue to the quantum problem discussed in the main text.The results here also apply to quantum systems when the measurement performed on each of them is restricted to be local, projective, d-dimensional, and fixed.We call this type of protocols semiclassical.
Here, we envision a device that takes input strings of N data points s = (s 1 s 2 • • • s N ), with the promise that each s i is a symbol out of an alphabet of d symbols, say the set {1, 2, . . ., d}, and has been drawn from either roulette P , or from roulette Q, with corresponding categorical probability distributions P = {p s } d s=1 and Q = {q s } d s=1 .To simplify the notation, we use the same symbols for the roulettes and their corresponding probability distributions, and for the stochastic variables and their possible outcomes.Also, the range of values of the index s will always be understood to be {1, 2, . . ., d}, unless specified otherwise.The device's task is to group the data points in two clusters so that all points in either cluster have a common underlying probability distribution (either P or Q).We wish the machine to be universal, meaning that it shall operate without knowledge on the distributions P and Q. Accordingly, we will choose as figure of merit the probability of correctly classifying all data points, averaged over every possible sequence of roulettes x = (x 1 x 2 • • • x N ), x i ∈ {P, Q}, and over every possible distribution P and Q.The latter are assumed to be uniformly distributed over the common probability simplex S d on which they are defined.Formally, this success probability is where x is the guess of x emitted by the machine, which by the universality requirement, can only depend on the data string s.The sums are carried out over all 2 N possible strings s and sequences of roulettes x.The factor of two in the second equality takes into account that P and Q are unknown, hence identifying the complementary string x leads to the same clustering.By emitting x, the device suggests a classification of the N data points s i in two clusters.In the above equation we have used the notation of Appendix E for the integral over the probability simplex.An expression for the optimal success probability can be obtained from the trivial upper-bound For two specific distributions P and Q, the probability that a given roulette sequence x gives rise to a particular data string s is Pr(s|x; P, Q) = s p ns s q ms s where n s (m s ) is the number of occurrences of symbol s in s [i.e., how many s i ∈ s satisfy s i = s] arising from roulettes of type P (Q).For later convenience, we define M s = n s + m s , which gives the total number of such occurrences.Note that {M s } is independent of x, whereas {n s } and {m s } are not.Performing the integral over P and Q we have where we have used Eq.(E2) and in the first equality we have assumed that the two types of roulette P and Q are equally probable, hence each possible sequence x occurs with equal prior probability equal to 2 −N .We have also introduced the notation d ≡ d − 1 to shorten the expressions throughout this appendix.Note that all the dependence on x is through the occurrence numbers m s and n s .According to (F2), for each string s we need to maximize the joint probability Pr(s, x) in (F4) over all possible sequences of roulettes x.We first note that, given a total of M s occurrences of a symbol s in s, Pr(s, x) is maximized by a sequence x whereby all these occurrences come from the same type of roulette.In other words, by a sequence x such that either m s = M s and n s = 0 or else m s = 0 and n s = M s .
In order to prove the above claim, we single out a particular symbol r that occurs a total number of times µ = M r in s.We focus on the dependence of Pr(s, x) on the occurrence number t = m r (so, n r = µ − t) by writing where the coefficients a, b, and c are defined as and are independent of t.The function f (t) can be extended to t ∈ R using the Euler gamma function and the relation Γ(t + 1) = t!.This enables us to compute the second derivative of f (t) and show that it is a convex function of t in the interval [0, µ].Indeed, where H n (t) are the generalized harmonic numbers.For positive integer values of t they are is the Riemann zeta function, allows to extend the domain of H n (t) to real (and complex) values of t.
The positivity of f (t) follows from the positivity of both f (t) and the two differences of harmonic numbers in the second line of Eq. (F9).Note that H 2 (x) is an increasing function of x.Since, obviously, b + t > t, and c−t > s n s = s (M s −m s ) ≥ µ−t [as follows from the definition of c in Eq. (F8)], we see that the two differences are positive.
The convexity of f (t) for t ∈ [0, µ] implies that the maximum of f (t) is either at t = 0 or t = µ.This holds for every value of M r and every symbol r in the data string, so our claim holds.In summary, the optimal guessing rule must assign the same type of roulette to all the M s occurrences of a symbol s, i.e., it must group all data points that show the same symbol in the same cluster.This is in full agreement with our own intuition.
The description of the optimal protocol that runs on our device is not yet complete.We need to specify how to reduce the current number of clusters down to two, since at this point we may (and typically will) have up to d clusters; as many as different symbols.The reduction, or merging of the d clusters can only be based on their relative sizes, as nothing is known about the underlying probability distributions.This is quite clear: Let P be the subset of symbols (e.g., the subset of {1, 2, . . ., d}) for which n s = M s , and let Q be its complement, i.e., Q contains the symbols for which m s = M s , and P = Q.The claim we just proved tells us that in order to find the maximum of Pr(s, x) it is enough to consider sequences of roulettes x that comply with the above conditions on the occurrence numbers. 4For those, the joint probability Pr(s, x) can be written as where a now simplifies to 2 −N d ! 2 s M s !. Thus, it just remains to find the partition {P, Q} that maximizes this expression.It can be also be written as where we have defined x = s∈Q M s .The maximum of this function is located at x = N/2, and one can easily check that it is monotonic on either side of its peak.Note that, depending on the values of the occurrence numbers {M s }, the optimal value, x = N/2, may not be attained.
In such cases, the maximum of Pr(s, x) is located at x * = N/2 ± ∆, where ∆ is the bias The subset Q that minimizes this expression determines the optimal clustering.In summary (and not very surprisingly), the optimal guessing rule consists in first partitioning the data s in up to d groups according to the symbol of the data points, and secondly, merging those groups (without splitting them) in two clusters in such a way that their sizes are as similar as possible.We have stumbled upon the so-called partition problem [33], which is known to be weakly NPcomplete.In particular, a large set of distinct occurrence counts {M s } rapidly hinders the efficiency of known algorithms, a situation likely to occur for large d.It follows that the optimal clustering protocol for the classical problem cannot be implemented efficiently in all instances of the problem.
To obtain the maximum success probability P cl s , Eq. (F2), we need to sum the maximum joint probability, given by (F11) with x = x * , over all possible strings s.Those with the same set of occurrence counts {M s } give the same contribution.Moreover, all the dependence on {M s } is through the bias ∆.Therefore, if we define ξ ∆ to be the number of sets {M s } that give rise to a bias ∆, then the corresponding number of data strings is ξ ∆ N !/ s M s !.We thus can write This is as far as we can go, as no explicit formula for the combinatorial factor ξ ∆ is likely to exist for general cases.However, it is possible to work out the asymptotic expression of the maximum success probability for large data sizes N .We first note that a generic term in the sum (F13) can be written as the factor 2 2d +1 ξ ∆ d ! 2 N !/(2d + N )! times a binomial distribution that peaks at ∆ = 0 for large N .Hence, the dominant contribution in this limit is From the definition of ξ ∆ , given above Eq.(F13), and that of ∆ in Eq. (F12), we readily see that ξ 0 is the number of ordered partitions (i.e., the order matters) of N in d addends or parts 5 (the occurrence counts M s ) such that a subset of these addends is an ordered partition of N/2 as well.
Young diagrams come in handy to compute ξ 0 .First, we draw pairs of diagrams, [λ, λ ], each of N/2 boxes and such that λ ≥ λ (in lexicographical order; see Appendix A), and l(λ) + l(λ ) ≡ r + r ≤ d, i.e., the total number of rows should not exceed d.Next, we fill the boxes with symbols s i (representing possible data points) so that all the boxes in each row have the same symbol.We readily see that the number of different fillings gives us ξ 0 .An example is provided in Fig. 4 for clarity.(G1) where the term in the second line arises because assigning the wrong probability distribution to all data points in s gives a correct clustering.In order to compare with our results for unknown classical states, we average the success probability over a uniform distribution of categorical probability distributions.This yields where the integration over the simplex S d , shared by P and Q, is defined in Appendix E.
To perform the integral in Eq. (G2) we need to partition S d × S d in different regions according to whether p s ≤ q s or p s > q s for the various symbols.By symmetry, the integral can only depend on the number r of symbols for which p s ≤ q s (not in its particular value).Hence, r = 1, . . ., d − 1 labels the different types of integrals that we need to compute to evaluate P cl s .Notice that we have the additional symmetry r ↔ d − r, corresponding to exchanging p s and q s for all s.Since the value of these integrals does not depend on the specific value of s, we can choose all p s with s = 1, 2, . . ., r to satisfy p s > q s and all p s with s = r + 1, r + 2, . . ., d to satisfy p s ≤ q s .To shorten the expressions below, we define where the binomial is the number of equivalent integral regions for the given r.II.The success probability P cl s for d = 3 and data string lengths N = 2, . . ., 6 in the cases of known and unknown distributions P and Q.For unknown distributions, the values are computed using Eq.(F13) in Appendix F. For known distributions, the values are given by Eq. (G7).The table shows that knowing P and Q increases the success probability of clustering.

Low data dimension
We can now discuss the lowest dimensional cases, for which explicit closed formulas for I d r can be derived.For d = 2 one has This result coincides with that of unknown probability distributions given in Eq. (F13) with ξ ∆ = 1.This is an expected result, as the optimal protocol for known and unknown probability distributions is exactly the same: assign to the same cluster all data points that show the same symbol s.Therefore, knowing the probability distribution does not provide any advantage for d = 2.For d > 2, however, knowledge of the distributions P and Q helps classifying the data points.If d = 3, the success probability (G5) can be computed to be In Table II we compare five values of P cl s in Eq (G7), when N = 2, 3, . . ., 6, with those for unknown distributions P and Q given by Eq. (F13).As expected, the success probability is larger if P and Q are known.The source of the increase is illustrated by the string s = (112), which would be labeled as P P Q (or QQP ) if P and Q were unknown.However, if they are known and, e.g., p 1 > q 1 and p 2 > q 2 , the string will be more appropriately labeled as P P P .
Arbitrary data dimension.Large N limit For increasing N , however, the advantage of knowing P and Q becomes less significant and vanishes asymptotically.This can be checked explicitly for d = 2, 3 by expanding Eqs.(G6) and (G7) in inverse powers of N .In this regime the average is dominated by distributions for which p r ≈ 1 and q r ≈ 0. Since in a typical string approximately half of the data will come from the distribution P and the other half from Q, the optimal clustering protocol will essentially coincide with that for unknown distributions, i.e., it will collect the data points showing

FIG. 1 .
FIG. 1. Pictorial representation of the clustering device for an input of eight quantum states.States of the same type have the same color.States are clustered according to their type by performing a suitable collective measurement, which also provides a classical description of the clustering.
e y + M o w g m c w j l 4 c A 1 1 u I M G N I H B E J 7 h F d 4 c 4 b w 4 7 8 7 H o r X g 5 D P H 8 A f O 5 w / E 3 4 1 r < / l a t e x i t > = e

r y r 1 W
h 5 H E U 7 g F M 7 B g 2 u o w x 0 0 o A k M B D z D K 7 w 5 j 8 6 L 8 + 5 8 L F o L T j 5 z D H / g f P 4 A 3 p O P 0 Q = = < / l a t e x i t > n = 1 e y + M o w g m c w j l 4 c A 1 1 u I M G N I H B E J 7 h F d 4 c 4 b w 4 7 8 7 H o r X g 5 D P H 8 A f O 5 w / G Y 4 1 s < / l a t e x i t > = (34) < l a t e x i t s h a 1 _ b a s e 6 4 = " b 6 F a e x k U 6 F 1 t 5 D o Q T Q w d j m N S b B c = " > A A A B 8 n i c b V B N S w M x E M 3 6 W e t X 1 a O X Y B H q p e z a g r 0 I B S 8 e K 9 g P 2 C 4 l m 2 b b 0

d
s=1 p s = 1} denote the standard (d − 1)-dimensional (probability) simplex.Every categorical distribution (CD) P = {p s } d s=1 is a point in S d .The flat distribution of CDs is the volume element divided by the volume of S d , the latter denoted by V d .Choosing coordinates p 1 , . . ., p d−1 , the flat distribution is d−1 s=1 dp s /V d ≡ dP .Let us compute the moments of the flat distribution; as a byproduct, we will obtain V d .We have becomes straightforward by iterating the change of variables p r → x, where p r = (1 − r−1 s=1 p s )x, r = d − 2, d − 3, . . ., 2, 1].In particular, setting n s = 0 for all s in Eq. (E1), we obtain V d = 1/(d − 1)!.Then S d dP d s=1 p ns s = (d − 1)! d s=1 n s !(d − 1 + N )! , (E2) where N = d s=1 n s .Next, we provide a simple proof that any fixed von Neumann measurement on a uniform distribution of pure states in (d, C) gives rise to CDs whose probability distribution is flat.As a result, the classical and semiclassical strategies discussed in the main text have the same success probability.Take |φ ∈ (d, C) and let {|s } d s=1 be an orthonormal basis of (d, C).By performing the corresponding von Neumann measurement, the probability of an outcome s is p s = | s|φ | 2 .Thus, any distribution of pure states induces a distribution of CDs {p s = | s|φ | 2 } d s=1 on S d .Let us compute the moments of the induced distribution, namely, N ) is the dimension of (projector on) the symmetric subspace of (d, C) ⊗N and we have used Schur lemma.A basis of the symmetric subspace is n = (n 1 , n 2 , . . ., n d ).Note that there are N +d−1 d−1 different strings n (weak compositions of N in d parts), which agrees with D sym N = s (N,0) [recall Ed. (B6)], as it should be.Since 1 1 sym N = n |v n v n |, we can easily compute the trace in Eq. (E3) to obtain dφ s, x) is the joint marginal distribution of s and x.This bound is attained by the guessing rule x = argmax x Pr (s, x) .

FIG. 4 .
FIG. 4. Use of Young diagrams for computing ξ0.In the example, N = 8 and d = 4.The fraction before each pair gives the number of different fillings and hints at how it has been computed.

p d− 1 −q d− 2 p d− 1 dq d− 1 ×
these definitions p d = q d = 1, d s=r+1 q s = 1 − q r , and likewise d s=r+1 p s = 1 − p r .The integrals that we need to compute are then (1+p r −q r ) N + (1+q r −p r ) N , (G4) and we note that, as anticipated, I d r = I d d−r .The average probability of successful clustering then reads