Counting the learnable functions of structured data

Cover's function counting theorem is a milestone in the theory of artificial neural networks. It provides an answer to the fundamental question of determining how many binary assignments (dichotomies) of $p$ points in $n$ dimensions can be linearly realized. Regrettably, it has proved hard to extend the same approach to more advanced problems than the classification of points. In particular, an emerging necessity is to find methods to deal with structured data, and specifically with non-pointlike patterns. A prominent case is that of invariant recognition, whereby identification of a stimulus is insensitive to irrelevant transformations on the inputs (such as rotations or changes in perspective in an image). An object is therefore represented by an extended perceptual manifold, consisting of inputs that are classified similarly. Here, we develop a function counting theory for structured data of this kind, by extending Cover's combinatorial technique, and we derive analytical expressions for the average number of dichotomies of generically correlated sets of patterns. As an application, we obtain a closed formula for the capacity of a binary classifier trained to distinguish general polytopes of any dimension. These results may help extend our theoretical understanding of generalization, feature extraction, and invariant object recognition by neural networks.


I. INTRODUCTION
Machine learning and deep learning demonstrate astonishing results in applications [1][2][3], sometimes beyond our theoretical reach. This provides a formidable challenge for theorists who wish to develop a framework for their understanding [4,5]. A landmark achievement in learning theory is Cover's function counting theorem, which counts the number of binary classification functions, or "dichotomies", that can be realized by given architectures [6]. This foundational result allowed to quantify the complexity of a learning model and the advantage gained in using non-linear kernels, provided a benchmark for the performance of both artificial and natural neural networks, and is a handy tool for several applications [7][8][9][10][11][12].
Other commonly used methods in this area come from statistical physics (pioneered by E. Gardner [13,14]). With respect to these, Cover's method has the advantage of offering a simple geometric insight and of being valid at finite number of dimensions, while statistical physics methods typically apply in the "thermodynamic limit" of infinite dimensions. Yet, despite its benefits and relative simplicity, Cover's analytical technique has so far eluded efforts to extend it [8].
Uncorrelated random patterns are commonly taken as a simplifying assumption for the theoretical investigation of artificial neural networks. Yet, it is becoming apparent that providing a theoretical framework that includes structure in the input data is essential. This need is emerging in different contexts: (a) The invariant repre- * Corresponding author: marco.gherardi@mi.infn.it sentation of perceptual stimuli by brains (e.g., the coherent perception of differently rotated and rescaled objects in vision, or the recognition of the same sound in different acoustic environments in audition) prompted the formalization of perceptual manifolds as extended patterns [12,[15][16][17][18][19][20][21][22]. Perceptual manifolds are the regions in input space corresponding to all variations of a stimulus that do not modify the object's identification. (b) The discovery of spatial maps in rodent brains [23] motivated extensions of associative memory models to attractors that are not point-like but occupy a region in configuration space [24]. (c) The problem of local generalization and robustness to noise, a main theme of machine learning, can be cast as a problem of non-pointlike patterns [25][26][27]. (d) The description of the input patterns as modular combinations of elementary features (a well studied aspect of empirical datasets [28,29]), was shown to induce a multi-layer structure in certain network architectures [30].
Here, we develop a theory that extends Cover's approach to non point-like patterns, by counting only those dichotomies that assign the same label to different variants of the same input. Our theory (i) enables the exact computation of the (average) number of dichotomies of structured data, (ii) gives direct access to quantities at finite size, and (iii) naturally disentangles combinatorial and geometric aspects, thus lending itself to further generalizations.

II. NUMBER OF ADMISSIBLE DICHOTOMIES
The central quantity obtained by Cover's function counting method is the number C n,p of linearly-realizable dichotomies of p points ξ 1 , . . . , ξ p in n dimensions. A dichotomy of this set is a function φ mapping each point ξ i

FIG. 1. (Top)
A dichotomy is identified by a hyperplane (in grey), separating differently labeled data points, (e.g., mammals from birds, mapped respectively to 1 and 0). Data are structured in multiplets of k input variants (here k = 3). Each input variant is a point in R n , and each multiplet is characterized by the k(k − 1)/2 overlaps between its points (here ρ12, ρ23, and ρ13). A dichotomy is admissible only if it is constant on each multiplet, i.e., if the separating hyperplane does not intersect any polytope (triangles here). (Bottom) Given a structured data set (here p = 4 multiplets of k = 3 points in n = 2 dimensions), we count the number C to its {0, 1} binary label (see Fig. 1). A linearly-realizable dichotomy is identified by a vector w ∈ R n : where θ is the Heaviside theta function. The hyperplane perpendicular to the vector w separates the space into two half-spaces, where the points mapped to 0 and 1 lie respectively. There are 2 p dichotomies, but only C n,p of them are linearly realizable. We focus on linearly realizable dichotomies, and will therefore omit this specification when it is clear from the context. It turns out that C n,p does not depend on the ξ i 's, as long as they are in general position (meaning that no subset of n points is linearly dependent) [6]. Structure in the data may thus appear not to affect C n,p at all. However, in general we do not wish to admit all possible dichotomies. For instance, among the hand-written digits in MNIST we could choose to admit dichotomies separating "1" and "I", but not two similar-looking "0"s. Our definition of structure is based on such a restriction: a data set is qualified as structured whenever only a subset of all possible dichotomies is considered admissible. C n,p will then be the number of admissible dichotomies that can be realized linearly.
Here we focus on a rather general definition of admissibility, inspired by the literature cited above. We consider datasets of kp points, structured as p multiplets of k points each. A dichotomy φ is admissible if different points ξ in the same multiplet are classified coherently, i.e., if φ(ξ) is constant on each multiplet. We will restrict the points ξ to lie on the unit sphere S n−1 , meaning that ξ 2 = 1, but this technical requirement can be easily relaxed. (A useful consequence of this is that setting the overlap between two points determines their distance.) The ensemble we consider fixes all the overlaps between the points in a multiplet, equally for all multiplets, but the relative positions and orientations of the multiplets are unspecified. The quantities we will compute are averages over all possible positions and orientations of the multiplets.
Because of the convexity of linear separability, separating the multiplets is equivalent to separating the polytopes whose vertices are the points in the multiplets. (These polytopes play the role of the perceptual manifolds of Ref. [12].) For instance, k = 2 corresponds to segments, k = 3 to triangles, k = 4 to tetrahedra.

III. SINGLE POINTS (k = 1)
Let us first outline Cover's original computation. Imagine starting with p points and adding the (p + 1)th point ξ p+1 to ξ 1 , . . . , ξ p . For each dichotomy φ of the p points ξ 1 , . . . ξ p one of two possibilities is satisfied: either (i) φ can be realized by a hyperplane passing through ξ p+1 (equivalently, φ can be realized by a vector w such that ξ p+1 · w = 0), or (ii) it can not. If (i) is true, then w can be rotated infinitesimally to yield both ξ p+1 · w ≷ 0; otherwise, the half-space where ξ p+1 lies is fixed. Therefore, for each dichotomy φ of ξ 1 , . . . , ξ p satisfying (i) there are 2 different dichotomies φ 1 and φ 2 of ξ 1 , . . . , ξ p , ξ p+1 agreeing with φ on the common points [i.e., such that φ 1,2 (ξ i ) = φ(ξ i ) for i = 1, . . . , p]. If the number of dichotomies satisfying (i) is M , then the number of those satisfying (ii) is C n,p − M , and one can write C n,p+1 = 2M + C n,p − M . The condition (i) is in the form of a single linear constraint, therefore M is the number of dichotomies of p points in n − 1 dimensions, M = C n−1,p . Thus C n,p satisfies the recursion with boundary conditions C n>0,1 = 2 (a single point can be classified either way) and C 0,p = 0. The solution to Eq. (2) can be obtained by observing that the contribution of the boundary value C n−i,1 to C n,p is given by the number of directed paths {γ j } j=1,...,p , with γ j ∈ N, that start from γ 1 = n−i and end in γ p = n, where at each step γ j+1 can be either γ j or γ j + 1. The number of such paths is simply the binomial coefficient p−1 i . Summing over the boundary gives where it is assumed that p−1 i = 0 whenever i > p − 1. Let us consider the fraction c n,p of linearly realizable dichotomies c n,p = C n,p /2 p . For finite n and p, the capacity α c can be defined as the ratio p/n at which half of all dichotomies can be realized: c n,nαc = 1/2. From the explicit expression (3) one sees that c n,p = 1 if p ≤ n, c n,p → 0 for p → ∞, and c n,2n = 1/2, which pinpoints the well-known capacity α c = 2.
The first step towards the general problem is the case where data are structured as pairs of points. Alongside the set of points ξ = ξ 1 , . . . , ξ p , let us consider another setξ = ξ 1 , . . . ,ξ p . The multiplets discussed above are the doublets {ξ i ,ξ i }. Each doublet is such that the overlap between the two partners is fixed: for all i. The admissible dichotomies φ are those for which φ(ξ i ) = φ(ξ i ) for all i; their total number is 2 p . The recursion step now corresponds to the addition of the (p + 1)th doublet {ξ p+1 ,ξ p+1 }. Repeating Cover's reasoning for the pointξ p+1 alone gives a number of dichotomies equal to Q n,p = C n,p + C n−1,p . This is the number of dichotomies of the set {ξ 1 ,ξ 1 , ξ 2 ,ξ 2 , . . . , ξ p ,ξ p ,ξ p+1 } that are admissible on the first p doublets [meaning that φ(ξ i ) = φ(ξ i ) for all i = 1, . . . , p]. A number R n,p of such dichotomies are realizable by a hyperplane passing through the point ξ p+1 . These are all admissible, thanks to the freedom in the choice of φ(ξ p+1 ) by an infinitesimal adjustment of the hyperplane. Among the other Q n,p − R n,p dichotomies, on average, a fraction Ψ 2 will happen to assign the same label to ξ p+1 andξ p+1 . Ψ 2 can be computed as the fraction of hyperplanes keeping ξ p+1 andξ p+1 in the same half-space; the calculation is carried out in the Appendix. Importantly, Ψ 2 is a function of the overlap ρ alone: Note that Ψ 2 (ρ) = 1 − Ψ 2 (−ρ) as expected from its definition. The foregoing argument brings to estimate the total number of admissible dichotomies as In order to compute R n,p it suffices to repeat Cover's reasoning with respect to the pointξ p+1 , this time in n − 1 dimensions because of the constraint imposed by the hyperplane passing through ξ p+1 , thereby obtaining Finally the recursion for C n,p reads The boundary conditions are now slightly different than those for the case k = 1 in Eq. (2). In fact, in n = 1 dimension the number of admissible dichotomies of a single pair of points (p = 1) is 2 only when both points lie on the same half-line, otherwise it is 0; on average, it is 2Ψ 2 (ρ). The boundary conditions are then To find the solution of the recursion (8), similarly to the single point case, consider all the directed paths {γ j } j=1,...,p propagating from the boundary to C n,p , where γ j+1 at each step can be γ j , γ j + 1, or γ j + 2. Contrary to the one point case, different paths with the same endpoints can now give different contributions to C n,p , since the three types of steps correspond to three different factors (Ψ 2 , 1, and 1 − Ψ 2 respectively). The contribution K i,p of a path from γ 1 = n − i to γ p = n is where the multinomial coefficient is defined as (with the obvious analytical extension for negative factorials). Summation over the non-zero boundary i = 0, . . . , n − 1 yields the number of admissible dichotomies It is easy to see (by the multinomial theorem) that C n,p = 2 p if p ≤ n/2; this locates the usual Vapnik-Chervonenkis dimension [31], d VC = n, as the total number of points is 2p. An estimate for the capacity, valid for large n, can be obtained by approximating Eq. (12) as The capacity α c is such that i.e., it corresponds to the value of n for which the sum of K i,p takes half its maximum value. The quantity K i,p can be interpreted as the partition function of an ensemble of directed random walks {γ j } j=1,...,p of p − 1 steps, with the same boundary conditions as for k = 1, and the following transition probabilities: P γ j → γ j = Ψ 2 /2, P γ j → γ j + 1 = 1/2, P γ j → γ j + 2 = (1 − Ψ 2 )/2. The normalization factor 2 at the denominator is the sum of the weights Ψ 2 , 1, and 1 − Ψ 2 . The capacity therefore corresponds to the median of the distribution function of the walk's endpoint i. We approximate the median with the meanī which evaluates toī = (3/2 − Ψ 2 )(p − 1), and finally we obtain This result, with Ψ 2 given by Eq. (5), was found in [32] by means of replica calculations, and appeared more recently in other contexts in [21,27]. Our derivation is somewhat more elementary, and naturally highlights the role of the geometric quantity Ψ 2 (ρ). Figure 2 compares the analytical formulas (12) and (16) with numerical results obtained by training a linear classifier with random doublets at varying dimension n, number of points p, and overlap ρ. Equation (12) matches perfectly as expected. Equation (16) is surprisingly precise even at very small sizes; deviations are less than 1% already for n = 5.

V. POLYTOPES (MULTIPLETS, GENERIC k)
Let us now move to the general case where the data are structured in multiplets of k points. We consider dichotomies of k sets of points ξ µ = {ξ µ 1 , . . . , ξ µ p }, with µ = 1, . . . , k. The ith multiplet is the set ξ i = {ξ 1 i , . . . , ξ k i }. A dichotomy φ is admissible if the images of all k partner points in each multiplet are equal: φ ξ µ i = φ ξ ν i for all µ, ν = 1, . . . , k, separately for all i = 1, . . . , p. For clarity, we denote the number of admissible dichotomies by C (k) n,p , as shown in Fig. 1.
A recursion relation for C (k) n,p can be obtained by carefully extending the method used for the doublet case. At the (p + 1)th step, we consider the multiplet ξ p+1 , composed of the k points ξ 1 p+1 , . . . , ξ k p+1 . Let us exclude momentarily the point ξ 1 p+1 , and suppose we know how to apply Cover's method to the set of k − 1 points This would give an expression, let us call it Q k−1 (C (k) n,p , C The fact that Q k−1 is a function of C (k) n−l,p with l = 0, . . . , k − 1 will be clear in the following. Intuitively, the case k = 1 involves only l = 0 and l = 1, the case k = 2 adds l = 2 because it uses the expression for k = 1 in n − 1 dimensions, and the same pattern repeats inductively up to k − 1 points.
The quantity Q k−1 represents the number of dichotomies of the set ξ 1 ∪ ξ 2 ∪ · · · ∪ ξ p ∪ξ p+1 that are admissible on the first p multiplets [meaning that φ(ξ µ i ) = φ(ξ ν i ) for all µ, ν = 1, . . . , k and all i = 1, . . . , p] and admissible on the k − 1 points inξ p+1 [meaning that φ(ξ µ p+1 ) = φ(ξ ν p+1 ) for all µ, ν = 2, . . . , k]. A number R k−1 n,p of these dichotomies are realizable by a hyperplane passing through the excluded point ξ 1 p+1 , and are therefore all admissible. Of the remaining Q k−1 (. . .) − R k−1 n,p ones, a fractionΨ k assign the same value to ξ 1 p+1 and to the points inξ p+1 , and are therefore admissible on the whole multiplet ξ p+1 . Therefore, While Ψ 2 was a probability (over all possible hyperplanes),Ψ k is a conditional probability, namely the prob-ability that a uniform vector w on the sphere S n−1 does not separate the multiplet ξ p+1 , conditioned on the event that w does not separate the setξ p+1 : The dependence ofΨ k on the relative positions of the points is discussed in the Appendix, where it is shown that (i) the calculation ofΨ k can be reduced from ndimensional to k-dimensional integrals, and (ii)Ψ k depends on n only through the k(k − 1)/2 overlaps ρ µν between the points in a multiplet, which we fix for all multiplets: This property allows us to treatΨ k as a constant in the recursions, thus simplifying the computations. Note that, since it is a conditional probability,Ψ can be written as a ratio of probabilities: where Ψ k depends on k(k − 1)/2 overlaps between k points, and denotes the fraction of hyperplanes not separating the k points. This definition, together with the identity Ψ 1 = 1, implies that the geometric quantity computed above for k = 2 is Ψ 2 (ρ) =Ψ 2 (ρ). The number R k−1 n,p can be obtained by applying again Cover's method with respect to the setξ p+1 this time in n − 1 dimensions because the hyperplane is constrained to pass through ξ 1 p+1 . Hence Finally, from Eqs. (19) and (23), the recursion for C where the functions Q k (having k + 1 arguments) satisfy the recursive functional relation with the boundary Q 1 (x n , x n−1 ) = x n + x n−1 given by the form of Eq. (2) for a single point. The recursion in k can be solved, thus yielding again a recursion for C (k) n,p+1 in n and p only. Let us call θ k (l) the coefficients in the solved recursion: Equation (25) then becomes with boundaries θ 1 (0) = θ 1 (1) = 1 and θ k (−1) = θ k (k + 1) = 0. For instance, setting k = 2 in Eqs. (26) and (27) recovers the recursion for doublets, Eq. (8), as expected. For k = 3 one obtains In the process of deriving the foregoing recursion relations we considered the points ξ µ p+1 in a particular order, therefore explicitly breaking invariance under permutations within the multiplets. We restore the invariance a posteriori, by prescribing that allΨ l (with l ≤ k) be symmetrized with respect to all k(k − 1)/2 overlaps. For instance, when k = 3, the Ψ 2 =Ψ 2 appearing in Eq. (28) is to be intended as [Ψ 2 (ρ 12 )+Ψ 2 (ρ 13 )+Ψ 2 (ρ 23 )]/3. The goodness of this prescription is substantiated by the numerical results shown in Fig. 3; see also the limit case (ii) in the Discussion below.
The solution for C n,p (with the appropriate boundary conditions) can be obtained, for instance via generating functions, but we do not give it here. Instead, we focus on the capacity, which can be computed by the same approximate method used for k = 2 [Eqs. (15) and (16)]: where we have defined the moments Summing Eq. (27) over l shows that λ 0 (k) = λ 0 (k−1) and therefore λ 0 (k) = λ 0 (1) = 2. By multiplying Eq. (27) by l and summing over l, one obtains λ 1 (k) = λ 1 (k − 1) + (1 −Ψ k )λ 0 (k − 1). The boundary condition λ 1 (1) = 1 then fixes the solution Finally, substituting λ 0 (k) and λ 1 (k) into Eq. (29) yields a remarkably simple formula for the capacity: Figure 3 compares our theory with numerical computations in the case of triplets (k = 3), for triangles with three, two, and no sides of the same length. The agreement is excellent. The functionΨ 3 is a double integral (given in the Appendix), which we evaluate numerically.

VI. DISCUSSION
Our extension of Cover's combinatorial technique to structured data allows to obtain closed expressions of C (k) n,p at finite n and p, for any k [we have written explicitly the result for k = 2 in Eq. (12)]. Beside this, our main result is Eq. (32), which expresses the capacity as a simple function of the quantitiesΨ l . Regarding these quantities, the merit of our method is twofold: first, theΨ l 's are revealed to be the only relevant parameters characterizing the linear separability of the multiplets; second, they have a very simple geometric interpretation in terms of probabilities.
Another interesting, albeit less elementary, limit case would be k → ∞, taken in such a way that the points generate a sphere of radius κ; then Eq. (32) should reproduce the well-known capacity with margin κ [13], which has never been obtained by combinatorial methods [8,12].
Other applications and extensions of the theory appear possible. First, the capacity is written in Eq. (29) as a combination of the zeroth and first moments, but higherorder moments can be computed similarly and give access to other useful quantities. For instance, the second moment is related to the width of the crossover region separating the regimes where c n,p ≈ 1, 0 respectively. Second, it would be interesting to express our results for general (non-linear) separating surfaces, in the same spirit of Cover's original work, and in view of useful applications.

ACKNOWLEDGMENTS
We would like to dedicate this work to the memory of Bruno Bassetti. P.R. acknowledges funding by the European Union through the H2020 -MCIF Grant No. 766442.
Appendix: Computation of Ψ k a. Computation of Ψ 2 (ρ). The fraction of hyperplanes assigning the same value to two points ξ andξ is given by: The normalization factor is where Ω n is the solid angle in n dimensions. Gram-Schmidt (GS) orthonormalization of ξ andξ yields Having orthonormalized the points allows to safely exploit the (n − 2)-dimensional spherical symmetry of the integral in the space orthogonal to ξ 1 and ξ 2 , and to reduce it to an integral over the two-dimensional solid angle: Ψ 2 = dΩ 2 π θ (cos φ) θ ρ cos φ + 1 − ρ 2 sin φ , (A.5) which evaluates to the result in Eq. (5), and shows that Ψ 2 = Ψ 2 (ρ).