Classification and Geometry of General Perceptual Manifolds

Perceptual manifolds arise when a neural population responds to an ensemble of sensory signals associated with different physical features (e.g., orientation, pose, scale, location, and intensity) of the same perceptual object. Object recognition and discrimination requires classifying the manifolds in a manner that is insensitive to variability within a manifold. How neuronal systems give rise to invariant object classification and recognition is a fundamental problem in brain theory as well as in machine learning. Here we study the ability of a readout network to classify objects from their perceptual manifold representations. We develop a statistical mechanical theory for the linear classification of manifolds with arbitrary geometry revealing a remarkable relation to the mathematics of conic decomposition. Novel geometrical measures of manifold radius and manifold dimension are introduced which can explain the classification capacity for manifolds of various geometries. The general theory is demonstrated on a number of representative manifolds, including L2 ellipsoids prototypical of strictly convex manifolds, L1 balls representing polytopes consisting of finite sample points, and orientation manifolds which arise from neurons tuned to respond to a continuous angle variable, such as object orientation. The effects of label sparsity on the classification capacity of manifolds are elucidated, revealing a scaling relation between label sparsity and manifold radius. Theoretical predictions are corroborated by numerical simulations using recently developed algorithms to compute maximum margin solutions for manifold dichotomies. Our theory and its extensions provide a powerful and rich framework for applying statistical mechanics of linear classification to data arising from neuronal responses to object stimuli, as well as to artificial deep networks trained for object recognition tasks.

A fundamental cognitive task performed by animals and humans is the invariant perception of objects, requiring the nervous system to discriminate between different objects despite substantial variability in each object's physical features. For example, in vision, the mammalian brain is able to recognize objects despite variations in their orientation, position, pose, lighting and background. Such impressive robustness to physical changes is not limited to vision; other examples include speech processing which requires the detection of phonemes despite variability in the acoustic signals associated with individual phonemes; and the discrimination of odors in the presence of variability in odor concentrations. Sensory systems are organized as hierarchies consisting of multiple layers transforming sensory signals into a sequence of distinct neural representations. Studies of high level sensory systems, e.g., the inferotemporal cortex (IT) in vision [1], auditory cortex in audition [2], and piriform cortex in olfaction [3], reveal that even the late sensory stages exhibit significant sensitivity of neuronal responses to physical variables. This suggests that sensory hierarchies generate representations of objects that although not entirely invariant to changes in physical features, are still readily decoded by a down-stream system. This hypothesis is formalized by the notion of the untangling of perceptual manifolds [4][5][6]. This viewpoint underlies a number of studies of object recognition in deep neural networks for artificial intelligence [7][8][9][10].
To conceptualize perceptual manifolds, consider a set of N neurons responding to a specific sensory signal associated with an object as shown in Fig. 1. The neural population response to that stimulus is a vector in R N . Changes in the physical parameters of the input stimulus that do not change the object identity modulate the neural state vector. The set of all state vectors corresponding to responses to all possible stimuli associated with the same object can be viewed as a manifold in the neural state space. In this geometrical perspective, object recognition is equivalent to the task of discriminating manifolds of different objects from each other. Presumably, as signals propagate from one processing stage to the next in the sensory hierarchy, the geometry of the manifolds is reformatted so that they become "untangled," namely they are more easily separated by a biologically plausible decoder [1]. In this paper, we model the decoder as a simple single layer network (the perceptron) and ask how the geometrical properties of the perceptual manifolds influence their ability to be separated by a linear classifier. The issue of quantifying linear separability has previously been studied in the context of the classification of points, by a perceptron, using combinatorics [11] and statistical mechanics [12,13]. Gardner's statistical mechanics theory is extremely important as it provides accurate estimates of the perceptron capacity beyond function counting by incorporating robustness measures. The robustness of a linear classifier is quantified by the margin, which measures the distance between the separating hyperplane and the closest point. Maximizing the margin of a classifier is a critical objective in machine learning, providing Support Vector Machines (SVM) with their good generalization performance guarantees [14].
The above theories focus on separating a finite set of points with no underlying geometrical structure and are not applicable to the problem of manifold classification which deals with separating infinite number of points geometrically organized as manifolds. This paper addresses the important question of how to quantify the capacity of the perceptron for dichotomies of input patterns described by manifolds. In an earlier paper, we have presented the analysis for classification of manifolds of extremely simple geometry, namely balls [15]. However, the previous results have limited applicability as the neural manifolds arising from realistic physical variations of objects can exhibit much more complicated geometries. Can statistical mechanics deal with the classification of manifolds with complex geometry?
In this paper, we develop a theory of linear classification of general manifolds, formalized in Sec. II. The theory is exact in the thermodynamic limit where the dimension of the manifolds is finite while both the number of neu-rons in the representation and the number of manifolds grow to infinity. Extending Gardner's approach we solve the mean field equations of the replica theory (Sec. III) and employ the Karush-Kuhn-Tucker conditions for convex quadratic optimization to delineate the qualitative and quantitative consequences of the mean field theory. This theory enables not only the evaluation of the capacity but also the support structure of the solution that maximizes the margin, extending the theory of SVMs for the classification of a finite number of random vectors to general manifolds. In addition to the study of the classification capacity, we introduce (Sec. IV) novel geometrical measures of manifolds which are motivated by the mean field theory; in particular, we provide new definitions of manifold dimension and radius, denoted as D M and R M , respectively. Interestingly, we show (Sec. V) that in the limit of high dimensional manifolds, these quantities are sufficient for estimating the classification capacity. We demonstrate the general theory in several prototypical examples for different classes of data: strictly convex manifolds such as 2 ellipsoids (Sec. VI), polytopes represented by 1 balls (Sec. VII), and smooth but non-convex manifolds such as orientation manifolds (Sec. VIII). Another important issue is the effect of sparsity in the labels, which arise in many biological and computational contexts. For a finite set of uncorrelated points, it is well known that a significant imbalance in the binary labels increases the classification capacity. Here we study (Sec. IX) the effect of label sparsity on manifold classification and analyze the rich consequences of the interaction between manifold size, dimension and class sparsity.
Our theory extends the statistical mechanics of linear classifications to complex realistic data structures and serves as a basis for further developments in this field. At the same time, our analysis reveals surprisingly rich connections between the manifold replica theory and other mathematical domains, including the theory of conic decompositions, high dimensional statistics of convex bodies, and the properties of trigonometric moment curves. Finally, our results are summarized and discussed in Sec. X.

II. MODEL OF MANIFOLDS
Manifolds in affine subspaces: We model a set of P perceptual manifolds corresponding to P perceptual object. Each manifold M µ for µ = 1, . . . , P consists of a compact subset of an affine subspace of R N with affine dimension D with D < N . A point on the manifold x µ ∈ M µ can be parameterized as: where the u µ i are a set of orthonormal bases of the (D+1)dimensional linear subspace containing M µ . The D + 1 components s i represents the coordinates of the manifold point within this subspace and are constrained to be in the set s ∈ S. The bold notation for x µ and u µ i indicates they are vectors in R N whereas the arrow notation for s indicates it is a vector in R D+1 . The set S defines the shape of the manifolds and encapsulates the affine constraint. For simplicity, we will first assume the manifolds have the same geometry so that the coordinate set S is the same for all the manifolds; extensions that consider heterogeneous geometries are provided in Sec. III D. We study the separability of P manifolds into two classes, denoted by binary labels y µ = ±1, by a linear hyperplane passing through the origin. A hyperplane is described by a weight vector w ∈ R N , normalized so w 2 = N and the hyperplane correctly separates the manifolds with a margin κ ≥ 0 if it satisfies, for all µ and x µ ∈ M µ . Since linear separability is a convex problem, separating the manifolds is equivalent to separating the convex hulls, conv (3) The position of an affine subspace relative to the origin can be defined via the translation vector that is closest to the origin. This orthogonal translation vector c µ is perpendicular to all the affine displacement vectors in M µ . Equivalently, all the points in the affine subspace have equal projections on c µ , i.e., x µ · c µ = c µ for all x µ ∈ M µ (Fig. 2(a)). We will assume that the norms of all the translation vectors are the same, denoted by c µ = c. It is conve-nient to represent the translational vectors in terms of their coordinates in the D + 1 subspaces. We will denote this representation by the D + 1 dimensional vector c, common to all the manifolds, such that s · c = c 2 for all s ∈ S. One possible choice of coordinates would be to have c = (0, 0, . . . , 0, c) and the coordinate set S = {(s 1 , s 2 , . . . , s D , c)}. This parameterization is convenient since it constrains the variability in the manifolds to be in the first D components of s. However, the general theory developed below is independent of the choice of coordinates. To investigate the separability properties of manifolds, it is helpful to consider scaling a manifold M µ by an overall scale factor r without changing its shape. We define the scaling relative to a center s 0 ∈ S by a scalar r > 0, by When r → 0, the manifold rM µ converges to a point: On the other hand, when r → ∞, the manifold rM µ spans the entire affine subspace. If the manifold has a symmetric shape (such as for an ellipsoid), there is a natural choice for a center. We will see later that our theory suggests a natural definition for the center point for general, assymmetric manifolds. In general, the translation vector c and center s 0 need not coincide as shown in Fig. 2(a). Bounds on linear separability of manifolds: For dichotomies of P input points in R N at zero margin, κ = 0, the number of dichotomies that can be separated by a linear hyperplane through the origin is given by [11]: where C n k = n! k!(n−k)! is the binomial coefficient for n ≥ k, and zero otherwise. This result holds for P input vectors that obey the mild condition that the vectors are in general position, namely that all subsets of input vectors of size p ≤ N are linearly independent. For large P and N , the probability 1 2 P C 0 (P, N ) of a dichotomy being linearly separable depends only upon the ratio P N and exhibits a sharp transition at the critical value of α 0 = 2. Unfortunately, we are not aware of a comprehensive extension of Cover's counting theorem for general manifolds. Nevertheless, we can provide lower and upper bounds on the number of linearly realizable dichotomies by considering the limit of r → 0 and r → ∞ under the following general conditions. First, in the limit of r → 0, the linear separability of P manifolds becomes equivalent to the separability of the P centers. This leads to the requirement that the centers of the manifolds, x µ 0 , are in general position in R N . Second, we consider the conditions under which the mani-folds are linearly separable when r → ∞ so that the manifolds span complete affine subspaces. For a weight vector w to consistently assign the same label to all points on an affine subspace, it must be orthogonal to all the displacement vectors in the affine subspace. In that case, the label assigned will be the same as that assigned to the manifold's orthogonal translation vector. Hence, to realize a dichotomy of P manifolds when r → ∞, the weight vector w must lie in a null space of dimension N − D tot where D tot is the rank of the union of affine displacement vectors. When the basis vectors u µ i are in general position, then D tot = min (DP, N ). Then for the affine subspaces to be separable, P D < N is required and the projections of the P orthogonal translation vectors need also be separable in the N − D tot dimensional null space. Under these general conditions, the number of dichotomies for D-dimensional affine subspaces that can be linearly separated, C D (P, N ), can be related to the number of dichotomies for a finite set of points via: From this relationship, we conclude that for affine subspaces: when P N ≥ 1 D , C D (P, N ) = 0; when P N ≤ 1 D+1 , C D (P, N ) = 2 P ; and when P N = 1 Thus, the ability to linearly separate D-dimensional affine subspaces exhibits a transition from always being separable to never being separable at the critical ratio P N = 2 1+2D for large P and N (see Supplementary Materials, SM, Sec. S1).
For general D-dimensional manifolds with finite size, the number of dichotomies that are linearly separable will be lower bounded by C D (P, N ) and upper bounded by C 0 (P, N ). We introduce the notation, α M (κ), to denote the maximal load P N that allows for linear separability of randomly labeled manifolds with high probability. Therefore, from the above considerations, it follows that the critical load α M (κ = 0) is bounded by,

III. STATISTICAL MECHANICAL THEORY
In order to make theoretical progress beyond the bounds above, we need to make additional statistical assumptions about the manifold spaces and labels. Specifically, we will assume that the individual components of u µ i are drawn independently and from identical Gaussian distributions with zero mean and variance 1 N , and that the binary labels y µ = ±1 are randomly assigned to each manifold with equal probabilities. We will study the thermodynamic limit where N, P → ∞, but with a finite load α = P N . In addition, the manifold geometries as specified by the set S in R D+1 are held fixed in the thermodynamic limit. Under these assumptions, the bounds in Eq. (7) can be extended to the linear separability of general manifolds with finite margin κ, and characterized by the reciprocal of the critical load ratio α −1 M (κ), where α 0 (κ) is the maximum load for separation of random i.i.d. points with a margin κ given by the Gardner theory [12], with Gaussian measure Dt = 1 √ 2π e − t 2 2 . For many interesting cases, the affine dimension D is large and the gap in Eq. (8) is overly loose. In the following, we use the replica approach to study the classification problem for manifolds with finite sizes and evaluate the dependence of the capacity and the nature of the solution on the geometrical properties of the manifolds.
A. Mean field theory of the capacity Following Gardner's framework [12,13], we compute the statistical average of log V , where V is the volume of the space of the solutions, which in our case can be written as: Θ (·) is the Heaviside function to enforce the margin constraints in Eq. (2), along with the delta function to ensure w 2 = N . In the following, we focus on the properties of the maximum margin solution, namely the solution for the largest load α M for a fixed margin κ, or equivalently for a given α M , the solution when the margin κ is maximized.
As shown in Appendix A, we prove that the general form of the inverse capacity, exact in the thermodynamic limit, is: where and . . . t is an average over random D + 1 d imensional vectors t whose components are i.i.d. normally distributed t i ∼ N (0, 1).
The inequality constraints in Eq. (12) can be written equivalently, as a constraint on the point on the manifold with maximal projection on v. We therefore consider the following convex function, known as the support function of S: which can be used to write the constraint for F ( t) as The components of the Gaussian vector t represent the quenched randomness in the solution due to the quenched variability in the manifolds' basis vectors and the labels; the difference v − t represents the scaled variability due to the entropy of the solution space (see Appendix A).
Together, these contributions comprise v which, up to a sign, represents the fields induced by the solution w on the basis vectors of the manifolds, u µ i . To conform to the standard definition of convex optimizations such as Eq. (12), the components of v are reversed with respect to these fields, hence they obey inequality constraints that appear opposite to the original constraints in Eq. (2). Karush-Kuhn-Tucker (KKT) conditions: To gain a deeper understanding of the nature of the maximum margin solution, it is useful to consider the KKT conditions of the convex optimization in Eq. 14 [16]. For each t, the KKT conditions that characterize the unique solution of v for F ( t) is given by: where The vectors( t) is one of the subgradients of the support function at v,s( t) ∈ ∂g S ( v). As a subgradient, it satisfies the requirement, for all vectors v and lies within the convex hull, ∂g S ( v) ⊆ conv (S). When the support function is differentiable, the subgradient ∂g S ( v) is unique and is equivalent to the gradient of the support function: In this cases( t) corresponds to the unique point obeying, Since the support function is positively homogeneous, depends only upon the unit directionv. For values of v such that g S ( v) is not differentiable, the subgradient is not unique, buts( t) can be defined uniquely as the particular subgradient that obeys the KKT conditions. We calls( t) the projection of t on the convex hull of the manifold S, or simply the S-projection of t.

B. Conic decomposition
The KKT conditions can also be interpreted geometrically in terms of the conic decomposition of t, which generalizes the notion of the decomposition of vectors onto linear subspaces and their null spaces via Euclidean projection. The shifted polar cone, S • κ , of the manifold S is defined as the convex set of points which satisfy and is illustrated for κ = 0 and κ > 0 in Fig. 3. For κ = 0, Eq. (20) is simply the polar cone of S [17]. Equation (15) can then be interpreted as the decomposition of t into the sum of two component vectors, v is its Euclidean projection onto S • κ , and the other component λs( t) is located in the convex cone of S, When κ = 0, the Moreau decomposition theorem states that the two components are perpendicular: v · λs( t) = 0 [17,18]. For non-zero κ, the two components need not be perpendicular but obey v · λs( t) = −κλ.

C. Types of supports
The various t vectors contribute in varying amounts to the inverse capacity calculation and represent qualitatively different solutions fors and v. They can be distinguished by the dimension of the span of the set of the subgradients {∂g S ( v)}, or equivalently, the span of the manifolds' intersection with the margin planes of the solution hyperplane. We call this dimension, the embedding dimension, denoted by 0 ≤ k ≤ D + 1, where the interior regime is characterized by k = 0. (See SM, Sec. S2, for additional details and examples). First, we note that there is a regime of vectors t for which the subgradient lies inside the shifted polar cone, S • κ of the manifold. In this case, the optimal v is t itself, hence, v = t and λ = 0. Hence, this regime does not contribute to the inverse capacity. In terms of the original P manifolds, this regime represents those manifolds that are interior to the margin planes of the solution, hence this regime is denoted the interior regime. The condition on t belonging to the interior regime is where we drop the explicit subscript S in the support function g( v) when the reference manifold is unambiguous. When λ = 0, (15) does not specify a unique point s on the manifold; in this interior regime, we define a unique S-projection of t using Eq. (18): Non-zero contributions to the inverse capacity only occur outside the interior regime when g( t) + κ > 0, in which case, λ > 0. In this case, the solution for v is active, satisfying the equality condition, With regard to the original classification problem, this indicates that the corresponding manifolds overlap with the margin planes of the solution w. From we obtain, λ = ( t·s( t)+κ) so that from Eq. (19), The following combined expression is valid for both the interior and non-interior regimes, where the function [x] + = max(x, 0). Thus, the inverse capacity, Eq. (19), can be written in the useful form: where the expectation is over D+1 dimensional Gaussian t with zero mean and unit covariance matrix ands( t) is the S-projection of t. Regimes with k > 0 can be further classified as in the following: Touching (k = 1): In this case, t lies slightly outside S • κ and the corresponding v is on the boundary of S • κ near t. The subgradient ∂g S ( v) is unique and the projectioñ s( t) is on the boundary of the manifold. This solution corresponds to manifolds that touch the solution margin plane at a single support vector. Fully embedded (k = D + 1): In this region, t lies within the convex cone, cone(S). The corresponding v is at the point of S • κ nearest the origin, i.e., v = − κ cĉ . The set of subgradients ∂g S ( v) is the entire convex hull, conv (S), and the S-projections( t) is in the interior of conv (S). This case corresponds to the fraction of manifolds that completely overlap the margin planes, which we denote as fully embedded support manifolds. In this case, the contribution to the inverse capacity is given by, Partially embedded (1 < k ≤ D): If the convex hull conv (S) is strictly convex, its boundary does not contain faces with dimension larger than unity. Hence, only interior, touching, or fully embedded regions exist; an example of such manifolds are 2 ellipsoids discussed in Sec. VI.
On the other hand, if conv (S) is not strictly convex, partially embedded regimes for which 1 < k ≤ D may exist. An example, described in Sec. VII, is 1 ellipsoids. It is important to note that even smooth manifolds with finite differentiable curvature everywhere may have convex hulls which contain faces. An example described in Sec. VIII is that of orientation manifolds, defined by x µ (θ) which are periodic differentiable functions of a single angle, and which display interesting partially embedded support structures.

D. Mixtures of manifold geometries
We previously assumed that each of the P manifolds had the same affine dimension and geometry, described by the set S. We extend our theory to heterogeneous manifolds, which are described by a set {S l } where l = 1, . . . , L denotes the different manifold shapes. In the replica theory, the shape of the manifolds appear only in the free energy term, G 1 (see Appendix, Eq. (A.13)). For a mixture of shapes, the combined free energy is given by simply averaging each of the individual free energy terms for S l .
Recall that this free energy term determines the capacity for each shape, giving an individual inverse critical load α −1 S l . The inverse capacity of the heterogeneous mixture is then, where the average is over the fractional proportions of manifold geometries. This remarkably simple but generic theoretical result enables analyzing diverse manifold classification problems, consisting of mixtures of manifolds with varying dimensions, shapes and sizes.
Eq. (29) assumes each manifold is assigned a binary label independent of its geometrical shape or dimension. But in more complex scenarios, the binary label of a manifold may be correlated with its underlying geometry. For instance, the positively labelled manifolds may consist of one geometry and the negatively labelled manifolds may have a different geometry. How do structural differences between the two classes affect the capacity of the linear classification? A linear classifier can take advantage of these correlations by adding a non-zero bias. Previously, it was assumed that the optimal separating hyperplane passes through the origin; this is reasonable when the two classes are statistically the same. However, when there are statistical differences between the two classes, Eq.
(2) should be replaced by y µ (w · x µ − b) ≥ κ where the bias b is chosen to maximize the capacity. The effect of optimizing the bias is discussed in more detail in Sec. IX and in SM (Sec. S3).

E. Numerical methods
The solution of the mean field equations consists of two stages. First,s is computed for a given t and then the relevant contributions to the inverse capacity are averaged over the Gaussian distribution of t. For simple geometries, the first step may be solved analytically, such as for 2 ellipsoids. However, for more complicated geometries, both steps need to be performed numerically. The first step involves determining v ands for a given t by solving the quadratic optimization problem Eq. (14) over the manifold S. This optimization problem is a quadratic semi-infinite programming (QSIP) problem, since the manifold S may contain infinitely many points. We have developed a novel "cutting plane" method to efficiently solve the QSIP problem. The details of the algorithm are given in the SM (Sec. S4). Gaussian expectations are computed by sampling t in D +1 dimensions and taking the appropriate averages, similar to the procedure for other mean field methods. The relevant quantities cor-responding to the capacity are quite concentrated and converge quickly with relatively few samples.
In the experimental sections, we have shown how the mean field theory compares with computer simulations that numerically solve for the maximum margin solution of realizations of P manifolds in R N , as described by Eq.
(2) for a variety of manifold geometries. Finding this solution is challenging as standard methods to solving SVM problems are limited to a finite number of input points. In our simulations, we approximate the maximum margin solution using the method described in [19] and in SM (Sec. S5).

IV. MANIFOLD GEOMETRY
Our theory describes the capacity and the properties of the maximum margin solution in terms of mapping Gaussian distributed vectors t to their S-projections( t) in the convex hull, conv (S). In this section we use this mapping to explore geometric properties of conv (S), such as effective size and dimension. Our definition of these geometric quantities, are motivated by the linear classification theory and shed light on the relation between geometry and linear separability. More generally, these quantities can be viewed as defining novel generalized signal-to-noise parameters of the manifolds.

Manifold centers:
The theory of manifold classification described in Sec. III is completely general and does not require the notion of a manifold center. However, in order to understand how scaling the manifold sizes by a parameter r affects their capacity, the center points, about which the manifolds are scaled, 4, need to be defined. For many geometries, the center is a point of symmetry, such as for an ellipsoid. For general manifolds, there are various ways to define a center point by averaging points on the manifold using a particular measure. In the present theory, a natural definition of the manifold center is provided by the Steiner point for convex bodies [20]: where the expectation is over the Gaussian measure of t ∈ R D+1 .
The geometric properties of the manifolds important for linear classification are determined by the displacements of the points on the manifold's convex hull relative to this center point. The set of relative displacement points from the center is defined as, s ⊥ = s − s 0 where s ∈ conv(S).
The displacement vectors span a D dimensional linear space. In general, a manifold may be shifted along its D dimensional linear span so that the center need not coincide with the orthogonal translation vector c ( Fig.  2(a)). It is, however, natural to characterize the manifold geometry with respect to a configuration where the center and orthogonal translation vector coincide: s 0 = c. We call manifolds with this configuration, centered manifolds as illustrated in Fig. 2(b). For simplicity, we will also normalize coordinates so that s 0 = c = 1. This means that all lengths are defined relative to the distance of the centers from the origin. We can then decompose vectors in R D+1 in terms of their projection on the center vector and their perpendicular components For brevity, we will use the notation s ⊥ ∈ S to mean ( s ⊥ , 1) ∈ S and similarly for other vectors.
For manifolds with centered configurations, the capacity can be written as, implying that the geometric characterization of the manifolds should focus on the statistics of s ⊥ ( t) and t ⊥ ·ŝ ⊥ ( t), whereŝ ⊥ ( t) denotes a unit vector in the direction ofs ⊥ ( t) .

A. Gaussian geometry
A simple mapping from t to pointss g ∈ conv (S) that is suggested by our theory is wheret ⊥ is a unit vector in the direction of t ⊥ . Here, each Gaussian vector is mapped onto the gradient of the support function, given by the displacement point in S that has maximal overlap witht ⊥ . We will call the statistics defined by this procedure Gaussian geometry, motivated by its relationship with the well-known Gaussian mean width of a convex body [21]. We denote Gaussian geometric quantities with the subscript g, and note that the parameters t 0 and κ have no influence on these quantities since the point of maximal overlap with t depends only upont ⊥ .
Gaussian manifold radius: The Gaussian manifold radius, denoted by R g , measures the mean square amplitude of s ⊥ , i.e., Gaussian manifold dimension: The Gaussian manifold dimension, D g , characterizes the number of orthogonal directions containing the variability of the manifold, and is defined as whereŝ g t ⊥ is the unit vector in the direction ofs g t ⊥ .
Thus, D g scales the affine dimension D by the averaged square of the cosine of the angle between the Gaussian vector t and its maximal projection point. While R 2 g measures the total variance of the manifold, D g measures the angular spread of the manifold.
The geometric intuition behind these definitions is shown in Fig. 4. For each Gaussian vector t ⊥ ∈ R D , the hyperplane normal to t ⊥ is translated until it just touches the manifold. Aside from a set of measure zero, for each t ⊥ the touching point is a unique point on the boundary of conv(S). The radii and dimension measure the second order statistics of the touching points and their projection on t ⊥ . The Gaussian geometry is related to the well known Gaussian mean width of convex bodies, For the case of D-dimensional 2 balls with radius R, s g t ⊥ is just the point on the boundary of the ball in the direction of t ⊥ ; hence, R g = R and D g = D. However, for general manifolds, D g can be much smaller than the manifold affine dimension D as will be seen in some examples later.

B. Polar-constrained geometry
In the limit of small manifold sizes, the Gaussian geometry is sufficient for describing the linear classification capacity, as will be shown in Sec. IV D. However, for manifolds with substantial extent, t may violate the constraint g( t) ≤ −κ and we need to consider statistics that represent valid solution vectors, and not simply the maximal projection point ∇g( t). A natural extension of the Gaussian geometry is provided by the KKT analysis, namely,s wheres( t) is the S-projection of t, defined by Eq. (16).
is the closest vector in the polar cone S • κ to t that satisfies the separability conditions in Eq.  Figure 5. Polar-constrained geometry. Determinings ⊥ for different values of t0 for (a) strictly convex manifold and (b) polytope manifold. In the interior regime (k = 0),s ⊥ is the same as ∇g( t ⊥ ) used for the Gaussian geometry. In other regimes, v needs to be self-consistently determined. In the fully embedded regime, v ⊥ = 0 ands ⊥ lies in the interior of the manifold.
and the constraint, g( v ⊥ ) ≤ −(v 0 + κ). Fig. 5 illustrates how the polar-constrained geometry captures the statistics of the subgradient that obeys the margin inequality. We will keep the definition of the center as the Steiner Point in Eq. (30), since it represents the limit point on the manifold when the manifold size shrinks to zero. However, other properties for the polarconstrained geometry will differ from the Gaussian counterparts. Polar-constrained manifold radius: The manifold radius R M is given by, Polar-constrained manifold dimension: The manifold radius D M is given by The polar-constrained geometric properties offer a richer description of the manifold properties relevant for classification. Since v ⊥ ands ⊥ depend in general on t 0 +κ, the above quantities are averaged not only over t ⊥ but also over t 0 . For the same reason, the quantities also depend upon the imposed margin, since the shifted polar cone S • κ depends upon κ. Under a change in scale, the Gaussian radius R g scales linearly with the scale factor r and D g is invariant. In contrast, the polar-constrained R M need not scale linearly with r, and D M may vary as well. This is because the classification properties of manifolds are not invariant to a change in scale if the center distances are kept fixed.
Gaussian properties ignore the effect of the relative scale between manifold size and center distances, whereas the polar-constrained geometrical quantities account for this scale change on classification performance. Thus, the polar-constrained quantities can be viewed as describing the general relationship between the signal (center distance) to noise (manifold variability) in classification capacity.

C. Embedding regimes
As discussed previously, the interior region of t does not contribute to capacity. In this region, the definition of s ⊥ ( t) for the polar-constrained geometry is equivalent to the Gaussian geometrys g ( t). However, for the noninterior regions, the vector v does not correspond to t and lies in the relative interior of a face of the shifted polar cone. We have described a partitioning of these regions into corresponding embedded regions with dimension k ≥ 1. This has a major effect on the geometrical quantities, since the induced measure becomes concentrated on sets ofs ⊥ ( t) that have zero measure in the Gaussian geometry. In particular, when the manifold becomes fully embedded,s ⊥ ( t) is in the interior of the convex hull and not confined to its boundary. This effect becomes very prominent in the limit of large manifolds. To understand the consequences of the geometry defined above, it is useful to follow changes in the nature of the mapping in Eq. (35) as t 0 increases from −∞ to +∞ for a fixed t ⊥ . These changes correspond to changing the dimension of the embedding. Here we summarize the differences in embedding for centered manifolds that are depicted in Fig. 5(b). 1. Interior (k = 0): For sufficiently negative t 0 , we see from Eq. (21) that the interior regime will become valid. We can define the transition to the interior region as, with the threshold value 2. Touching (k = 1): As t 0 increases to t touch ( t ⊥ ) − κ, the solution corresponds to the touching region with k = 1 . This solution is valid as long as is unique. In this case,s ⊥ lies on the boundary of the manifold.
3. Fully embedded (k = D+1): This region occurs for t such that the corresponding v is orthogonal to all the affine displacement vectors of the manifold, i.e., v ⊥ = 0, v 0 = −κ, and the subgradient is the entire convex hull, conv (S). In this case, the specific subgradient satifying the KKT conditions is given bỹ For a fixed t ⊥ , t 0 must be large enough, t 0 > t embed , where . This norm is given by the smallest scalar λ such that t ⊥ ∈ λconv (S) [21].
is not strictly convex, other types of solutions exist, which correspond to intermediate ranges of t 0 such that v ⊥ is perpendicular to a (k − 1)-dimensional face of the convex hull, ands ⊥ is the unique S-projected point on that face. For instance, k = 2 implies thats ⊥ lies on an edge whereas k = 3 implies thats ⊥ lies on a planar 2-face of the convex hull. Determining the transition values for t 0 that give different embeddings depend upon the specific geometry of the manifold and given t ⊥ .
We illustrate these geometrical quantities in Fig. 6 with two simple examples: a D = 2 2 ball and a D = 2 2 ellipse. In both cases we consider the distribution of s ⊥ and θ = cos −1 t ⊥ ·ŝ ⊥ , the angle between t ⊥ and s ⊥ . For the ball with radius r, the vectors t ⊥ ands ⊥ are parallel so the angle is always zero. For the Gaussian statistics, the distribution of s ⊥ is a delta function at r. On the other hand, for the polar-constrained geometry,s ⊥ will lie inside the ball in the fully embedded region. Thus the distribution of s ⊥ consists of a mixture of a delta function at r corresponding to the interior and touching regions and a smoothly varying distribution corresponding to the fully embedded region. Fig. 6 also shows the corresponding densities for a two dimensional ellipsoid with major and minor radius, R 1 = r and R 2 = 1 2 r. For the Gaussian geometry, the distribution of s ⊥ has finite support between R 1 and R 2 , whereas the polar-contrained geometry has support also below R 2 . Since t ⊥ ands ⊥ need not be parallel, the distribution of the angle varies between zero and π 4 , with the polar-constrained geometry being more concentrated near zero due to contributions from the embedded region.

D. Size and margin effects
We discuss the effect of the manifold size on its geometric properties. Scaling the manifolds as in Eq. (4), corresponds to scaling alls ⊥ by a scalar r. Here we focus on the limits of small and large r. In the limit of small size, s ⊥ → 0 giving t ·s ≈ t 0 and s 2 ≈ 1. Here, only the interior and touching regimes contribute to the statistics, and v ⊥ = t ⊥ − λs ⊥ ≈ t ⊥ . Therefore, in this limit, the polar-constrained statistics are equivalent to the Gaussian geometry quantities, R M → R g and D M → D g . Thus, for small scales r, the manifold dimension D M = D g which can be substantially smaller than D.
In the large size limit s ⊥ → ∞ . In this limit, . Thus, in the large size limit, the manifolds are almost parallel to the margin planes, either due to being fully embedded (when v ⊥ = 0) or partially embedded when v ⊥ is close to zero. Since The interior regime is negligible, so the statistics is dominated by the embedded regimes. In fact, the fully embedded transition is given by t embed ≈ −κ so that the fractional volume of the fully embedded regime is H(−κ) = ∞ −κ Dt 0 . The remaining summed probability of the touching and partially embedded regimes (k ≥ 1) is therefore H(κ). The radius R M also increases with r, but the relative ratio R M r will decrease with r because for large r, the distribution ofs ⊥ has support in the interior of the convex hull (due to the contribution of the fully embedded regime).
For manifolds with large sizes, there are two main contributions to the capacity.
The fully embedded regime contributes a factor in Eq.
In the touching and partially embedded regimes, Combining these two contributions, we obtain for large sizes, α −1 M = D + α −1 0 (κ), consistent with Eq. (8). The polar-contrained geometry also depends on the margin κ, through the dependencies of v ⊥ ands ⊥ on t 0 + κ. For a fixed t, Eq. (42) implies that larger κ increases the probability of being in the embedded regimes, influencing the statistics ofs ⊥ . Also, increasing κ shrinks the magnitude ofs ⊥ according to Eq. (41). Hence, when κ 1, the capacity becomes similar to that of P random points and the corresponding capacity is given by α M ≈ κ −2 , independent of manifold geometry.

V. HIGH DIMENSIONAL MANIFOLDS
We expect that in many applications the affine dimension of the manifold, D is large. High dimension can be reflected in the data having a large number of non-zero singular value decomposition (SVD) components. However, second-order statistics may not be sufficient to characterize generic manifolds, and we therefore define highdimensional manifolds as manifolds where the manifold dimension is large, i.e., D M 1. Thus, in general, this regime depends upon both the shape and on the size of the manifold, through their effect on D M . For high dimensional manifolds, the classification capacity can be described in terms of the statistics of R M and D M alone. Our analysis below elucidates the interplay between size and dimension, namely, how small R M needs to be for high dimensional manifolds to have a substantial classification capacity. Sufficient statistics for mean-field solution: In the high-dimensional regime, the mean field equations simplify due to self averaging of terms involving sums of components t i ands i . The two key quantities, t ·s( t) and s( t) 2 that appear in the capacity, (27), can be approximated as, and obtain for the capacity, where α −1 0 is the inverse capacity of P random points (details in SM, Sec. S6). Note that in the case of 2 balls with D 1 and radius R, we have shown previously that α B ≈ (1 + R 2 )α 0 (κ + R √ D) [15]. Hence Eq. . In this scaling regime, the calculation of the capacity and geometric properties are particularly simple. As shown in Sec. IV D, when the radius is small, the components ofs ⊥ are small, hence v ⊥ ≈ t ⊥ , and the Gaussian statistics for the geometry suffice: R M ≈ R g and D M ≈ D g . The capacity is simply, where κ g = R g D g = 1 2 w(S) is equal to half the Gaussian mean width [21]. As for the support structure, since the manifold size is small, the only significant contributions arise from the interior (k = 0) and touching (k = 1) cases. Note that in the scaling regime, the factor proportional to 1 + R 2 g in Eq. (46) is the next order correction to the overall capacity. Beyond the scaling regime: When R g is not small, the manifold geometrical parameters, R M and D M , cannot be adequately described by the Gaussian statistics, R g and D g . In this case, the manifold margin κ M = R M √ D M is large, and so Eq. (45) can be reduced to: In this section, we consider the problem of binary classifications of strictly convex manifolds described by Ddimensional ellipsoids under the 2 norm. We assume here and in the examples of the following sections that the manifolds are centered, as defined at the beginning of Sec. IV. Therefore, the ellipsoids can be parameterized by the set of points: The components of u µ i and of the ellipsoids' centers x 0 µ are i.i.d. Gaussian distributed with zero mean and variance 1 √ N so that they are orthonormal in the large N limit. The radii R i represent the principal radii of the ellipsoids relative to the center.

A. Support function
With ellipsoids, the support function in Eq. (13) can be computed explicitly. For a vector v = ( v ⊥ , v 0 ), with non zero v ⊥ , the support function g( v ⊥ ) is maximized by a vector s ⊥ which occurs on the boundary of the ellipsoid for non-zero v ⊥ . Maximizing v ⊥ · s ⊥ subject to the constraint that s ⊥ obeys the equality constraints in Eq. (48), yields, where we denote z• R as the vector whose components are (see SM, Sec. S7 for details). For a given ( t ⊥ , t 0 ), the vector ( v ⊥ , v 0 ) is determined by Eq. (15) and the analytic solution above can be used to derive an explicit expression fors ⊥ . The different solution regimes can be categorized as follows.
Interior regime: This regime holds for t 0 obeying the inequality Eq. (39) with Here λ = 0 and resulting in zero contribution to the inverse capacity, F = 0. The subgradient is given by the following boundary point on the ellipse,s i ⊥ ( t ⊥ ) = tiR 2 i t ⊥ • R , Touching regime: The ellipsoid touches the margin planes at a single point in the range, t touch < t 0 + κ < t embed defined in Eqs. (51) and (42), where The subgradient is a point on the boundary of the ellipse corresponding tõ where the parameter λ is determined by the condition In this regime, the contribution to the capacity is given by Eq. (25) withs in Eq. (53). When t 0 approaches the value t embed − κ, v ⊥ → 0 signalling the transition to the fully embedded regime.
Fully embedded regime: When t 0 + κ > t embed , we have v ⊥ = 0, v 0 = −κ, and λ = t 0 + κ, implying that the center as well as the entire ellipsoid is embedded in the margin plane. In this case, the S-projection of t is given by Eq. (41) which is located in the interior of the ellipsoid and gives a contribution to the capacity in Eq. (28).
Combining the contributions from both the touching and embedded regimes, the capacity for ellipsoids can be written as: where the limits in the integrals are given by Eq. (51) and (52).
The behavior of the capacity for 2D ellipsoids is illustrated in Fig. 7 for margins κ = 0 and κ = 0.5. In general, the capacity is higher when κ = 0 compared with κ = 0.5. Along the diagonal, the capacity is equivalent to 2 balls, as R 1 = R 2 . In particular, for large R 1 = R 2 the capacity for κ = 0.0 approaches 0.4. Along each of the axis one of the ellipse's vanishes, the capacity approaches that of a line, e.g., α(R 1 → 0, R 2 , κ) = α 1 (R 2 , κ) where α 1 is the capacity of one-dimensional line segments with radius R 2 [15]. In particular, when R 2 is large, the capacity approaches 2 3 .

B. High dimensional ellipsoids
It is instructive to apply the general analysis of high dimensional manifolds to ellipsoids with D 1. We will distinguish between different size regimes by assuming that all the radii of the ellipsoid are scaled by r, which controls the overall size of the ellipsoid without changing its shape. In the high dimensional regime, due to self-averaging, the boundaries of the touching and embedding transitions can be approximated by, both independent of t ⊥ . Then as long as R i √ D are not large as described below, t embed → ∞, and the probability of embedding vanishes. Discounting the embedded regime, the manifold geometry is described by, (see Eq. (53)) where the normalization factor Z is given through The capacity for highdimensional ellipsoids, Eq. (45), is determined from the manifold radius obeying and manifold dimension D M , We can also compute the covariance matrix ofs ⊥ . This matrix is diagonal in the basis corresponding to the principal directions of the ellipsoid with eigenvalues, The above geometrical measures can be related to the eigenvalues of the covariance matrix according to . D M is equivalent to the participation ratio of a covarinace matrix with eigenalues λ i [23,24]. For a spherical ball where all the eigenvalues are equal, D M = D; however, when one eigenvalue dominates, the dimension is D M ≈ 1. Note that R M and D M are not invariant to scaling the ellipsoid by a global factor r, reflecting the important role of the centers which do not scale. We discuss several important limits below. Scaling regime: In the scaling regime where the radii are small, the radius and dimension are equivalent to the Gaussian geometry: and the effective margin is given as, In this regime, the dimension is given by the participation ratio of a covariance matrix whose eigenvalues are the set of ellipsoid squared radii R 2 i . D g is also invariant to scaling of the radii by r and R g ∝ r as expected. The capacity is given by α M ≈ (1 + R 2 g )α 0 (κ + κ g ), and the ellipsoids are either interior or touching. Beyond the scaling regime: When R i = O(1) all the ellipsoids are touching manifolds since t touch → −∞ and t embed → ∞. The capacity is small because κ M 1, and is given by α M ≈ (1 + R 2 M )κ −2 M with R M and κ M in Eqs. (59)-(60). When all R i 1, we have, D M = D, and α −1 = D. The ellipsoids become fully embedded when R i ≈ √ D or larger. In this case, the transition value t embed is order one and the probability of embedding becomes significant.

C. Numerical examples
We illustrate the behavior of high D-dimensional ellipsoids with a bimodal distribution of radii. In Fig. 8, the properties of ellipsoids with radii R i , where R i = r, for 1 ≤ i ≤ 95 R i = 0.1r, for 96 ≤ i ≤ 100, are shown as a function of the overall scale r. Fig. 8(a) shows that the mean field theory agrees with the spherical approximation with corresponding R M and D M . As seen in (b)-(d), the system is in the scaling regime for r < 0.3. In this regime, the manifold dimension is constant and equals D g ≈ 14, as predicted by the participation ratio, Eq. (63). In this regime, the manifold radius R g is linear with r, as expected from Eq. (62). The ratio Rg r ≈ 0.9 is close to unity, indicating that in the scaling regime, the system is dominated by the largest radii. For r > 0.3 the ellipsoid margin is larger than 1 and the system's properties become increasingly affected by the full dimension of the ellipsoid, as seen by the marked increase in dimension as well as a decrease in R M r . For larger r, D M approaches D and α −1 M = D. Fig. 8(e1)-(e3) shows the distributions of the embedding dimension 0 ≤ k ≤ D + 1. In the scaling regime, the interior and touching regimes each have probability close to 1 2 and the embedding regime is negligible. As r increses beyond the scaling regime, the interior probability decreases and the solution is almost exclusively in the touching regime. For very high values of r, the embedded support solution gains a substantial probability. Note that the capacity decreases to approximately 1 D for a value of r below where a substantial fraction of solutions are embedded. In this case, the touching ellipsoids all have small very angle with the margin plane.
We also computed the capacity on ellipsoids containg a more realistic distrubtion of radii. We have taken as examples, a class of images from ImageNet [25], and analyzed the SVD spectrum of the representations of these images in the last readout layer of a deep convolutional network, GoogLeNet [26]. The computed radii are shown in Fig. 9(a) and are scaled by a overall factor r in our analysis. Because of the fall-off in the distribution of radii, the Gaussian dimension for the ellipsoid is only about D g ≈ 15, much smaller than the affine dimension D = 1023. As r increases above r 0.03, κ M becomes larger than 1 and the solution leaves the scaling regime, resulting in a rapid increase in D M and a rapid falloff in capacity as shown in Fig. 9(b-c). Finally, for very large r ≈ 10, we have α −1 M ≈ D M ≈ D approaching the lower bound for capacity as expected.

VII. CONVEX POLYTOPES: 1 BALLS
The family of 2 ellipsoids is a prototypical example of a manifold whose convex hull is strictly convex and consists only of vertices and no other faces. This implies that the 2 ellipsoids can only display three types of support: interior, touching at a single point, or fully embedded. On the other hand, there are other types of manifolds whose convex hulls are not strictly convex. In this section, we consider manifolds that are convex polytopes formed by the convex hulls of finite numbers of points in R N . A particularly simple example is provided by a D-dimensional 1 ellipsoid, parameterized by radii {R i } and specified by the convex set: Each manifold M µ ∈ R N is centered at x µ 0 and consists of a convex polytope with a finite number (2D) of vertices: {x µ 0 ± R k u µ k , k = 1, ..., D}. The vectors u µ i specify the principal axes of the 1 ellipsoids. For simplicity, we consider the case of 1 balls when all the radii are equal: R i = r. We will concentrate on when the 1 balls are high-dimensional; the case for 1 balls with D = 2 was briefly described in [15].
High-dimensional 1 balls, scaling regime: In the scaling regime, we have v ≈ t. In this case, we can write the solution for the subgradient as: In other words,s ⊥ ( t ⊥ ) is a vertex of the polytope corresponding to the component of t ⊥ with the largest magnitude. The components of t ⊥ are i.i.d. Gaussian random variables, and for large D, its maximum component is concentrated around √ 2 log D; hence, D g = 2 log D which is much smaller than D. This result is consistent with the fact that the Gaussian mean width of a D-dimensional 1 ball scales with √ log D and not with D [21]. Since all the points have norm R, we have R g = r, and the effective margin is then given by κ M = r √ 2 log D , which is order one in the scaling regime. In this regime, the capacity is given by simple relation α M = 1 + r 2 α 0 (κ + κ M ).
High-dimensional 1 balls, r = O(1): When the radius r is small as in the scaling regime, the only contributing solution is the touching solution (k = 1). When r increases, solutions with all values of k, 1 ≤ k ≤ D + 1 occur, and the support can be any of the faces of the convex polytope with dimension k. As r increases, the probability distribution p(k) over k of the solution shifts to larger values. Finally, for large r, only two regimes dominate: fully embeddded (k = D + 1) with probability H (−κ) and partially embedded with k = D with probability H(κ).
For convex polytopes at zero margin, κ = 0, the capacity can be related to the probability distribution p(k): α −1 = D+1 k=1 p(k)k [27], so the shift to higher dimensional support is related to a decrease in capacity. Unfortunately, this simple relationship does not hold for smooth manifolds or for convex polytopes with non-zero margin κ = 0.
Numerical examples: We illustrate the behavior of 1 balls with radius r and affine dimension D = 100. In Fig.  10, (a) shows the linear classification capacity as a function of r. When r → 0, the manifold approaches the point capacity, α M ≈ 2, and when r → ∞, α M ≈ 1 D = 0.01.  and finally for very large manifolds (d3), most polytope manifolds are nearly fully embedded.

VIII. ORIENTATION MANIFOLDS
An important example of an object manifold is the set of neuronal responses to an object subject to a onedimensional rotation. Here we consider the simplest case of the rotation of an image, parameterized by an orientation angle θ. We model the neuronal responses as smooth periodic functions of θ, which can be parameterized in terms of Fourier modes in s = ( s ⊥ , 1) ∈ S with where R n is the magnitude of the n-th Fourier component for 1 ≤ n ≤ D 2 . The neural responses in Eq. (1) are determined by projecting onto the basis: with centers at u µ 2D+1 . The parameters θ µ n are the preferred orientation angles for the corresponding neurons and are assumed to be evenly distributed between −π ≤ θ µ i ≤ π. The statistical assumptions of our analysis assume that the different manifolds are randomly positioned and oriented with respect to the others. For this orientation manifold model, this implies that the mean responses u µ 2D+1 are independent random Gaussian vectors and also that the preferred orientation θ µ n angles are uncorrelated. The neuronal tuning curves are unimodal and symmetric functions since the Fourier components have zero phase relative to each other. We could have also chosen random phases for the Fourier components which would have resulted in assymmetric tuning curves. With this definition for the orientation manifold, all the vectors s ⊥ ∈ S obey the normalization s ⊥ = r where r 2 = D 2 n=1 R 2 n . The orientation manifolds are thus smooth closed curves parameterized by a single angle θ that lie on the surface of a D-dimensional sphere with radius r. For the simplest case of D = 2, the orientation manifold is equivalent to a circle in two dimensions. However, for larger D, the manifold is not convex and its convex hull is composed of faces with varying dimensions. In Fig. 11, we investigate the geometrical properties of these manifolds relevant for classification as a function of the overall scale factor r, where for simplicity we have chosen R n = r for all n. The most striking feature is the small dimension in the scaling regime, scaling roughly as D M ≈ 2 log D. This dependence is similar to that of the 1 ball polytope in the previous section. Thus, we see as r increases, D M increases dramatically from 2 log D to D.
The similarity of the orientation manifold convex hull to a convex polytope is also seen in the embedding dimension k of the manifolds. Support faces of dimension k = 0, 1, D + 1 are seen, implying the presence of partially embedded solutions. Interestingly, D 2 < k ≤ D are excluded, indicating that the maximal face dimension of the convex hull is D 2 . Each face is the convex hull of a set of k ≤ D 2 points where each point is drawn from a pair of Fourier harmonics. The orientation manifolds are closely related to the trigonometric moment curve, whose convex hull geometrical properties have been extensively studied [28,29].
As expected, the classification capacity of the orientation manifolds drops as the size r, relative to the mean activity increases. An interesting question is the effect of the tuning width on the classification capacity. To explore this question, we have modeled a family of tuning curves where the Fourier components, R n , have a Gaussian falloff with n, where the normalization factor A is such that n R n = f 0 . These tuning curves are characterized by two parameters: σ which controls the width of the tuning, and f 0 which controls the modulation amplitude; example tuning curves are shown in Fig. 12(a). Note that, in the large N limit, n is unbounded and σ determines the number of nonzero components, see below.
In Fig. 12(b), we study the effect of the width σ on the capacity. At high modultion, we find that the classification capacity monotonically increases with tuning width. This is consistent with the fact that the Fisher information of the orientation angle decreases with the tuning width [30]. Interestingly, for moderate or weak tuning modulation, e.g., f 0 = 1, the capacity decreases with increasing tuning width. This demonstrates the complex balance between size and dimension in determining the capacity (further details in SM, Sec. S8). We also examine the dependence of R M and D M with the tuning width in Fig. 12(c)-(d). D M decreases with tuning width since for large widths, there are only a few dominant Fourier components. On the other hand, R M increases with the tuning width. This is because more power is concentrated on the low-frequency Fourier components when the overall size given by the sum of R n is held fixed. For narrower tuning, embedded regimes also contribute to decreasing R M as explained in Sec. IV. Finally, we examine in more detail the scaling of manifold dimension on the tuning curve width as σ → 0. A general argument based on SVD decompositions suggests that the manifold dimension scales inversely with the tuning width (for σ 1) [31]. Consistent with this argument, the participation ratio for this system (see Sec. VI B and SM, Sec. S9) is inversely related to σ for small σ, as seen in Fig. 12(e). Surprisingly, the manifold dimension does not scale with 1 σ . For instance, the Gaussian dimension, D g , shown in Fig. 12(f), appears to be inversely related to σ 1 3 , and not with σ, reflecting the complexity of the underlying convex hull.

IX. MANIFOLDS WITH SPARSE LABELS
So far, we have assumed that the number of manifolds with positive labels is approximately equal to the number of manifolds with negative labels. In this section, we consider the case where the two classes are unbalanced such that the number of positively-labeled manifolds is far less than the negatively-labeled manifolds (the opposite scenario is equivalent). We define the sparsity parameter f as the fraction of positively-labeled manifolds so that f = 0.5 corresponds to having balanced labels. From the theory of the classification of a finite set of random points, it is known that having sparse labels with f 0.5 can drastically increase the capacity [12]. In this section, we investigate how sparsity of manifold labels improves the manifold classification capacity. If the separating hyperplane is constrained to go through origin and the distribution of inputs is symmetric around the origin, the labeling y µ is immaterial to the capacity. Thus, the effect of sparse labels is closely tied to having a non-zero bias. We thus consider inequality constraints of the form y µ (w·x µ −b) ≥ κ, and define the bias-dependent capacity of general manifolds with label sparsity f , margin κ and bias b, as α M (κ, f, b). Next, we observe that the bias acts as a positive contribution to the margin for the positively-labeled population and as a negative contribution to the negatively-labeled population. Thus, where α M (x) is the classification capacity with zero bias (and hence equivalent to the capacity with f = 0.5) for the same manifolds. Note that Eq. (71) is similar to Eq. (29) for mixtures of manifolds. The actual capacity with sparse labels is given by optimizing the above expression with respect to b, i.e., In the following, we consider for simplicity the effect of sparsity for zero margin, κ = 0. Small-sized manifolds are expected to have capacity that increases upon decreasing f as α M (0, f ) ∝ 1 f |log f | , similar to P uncorrelated points. On the other hand, when the manifolds are large, the solution has to be orthogonal to the manifold directions so that α M (0, f ) ≈ 1 D . Thus, the geometry of the manifolds play an important role in controlling the effect of sparse labels on capacity.

A. Sparsely labeled 2 balls
We first investigate the effect of sparse labels for 2 balls with dimension D, radius r, and f 1. For these manifolds, the full mean field equation is given in Appendix B. These equations reveal the following regimes: Small r regime: For r < 1, the 2 ball capacity, α B (f, r, D), is equivalent to the capacity for sparsely labeled random points with an effective margin given by r √ D, where the capacity of sparsely labeled points is α 0 (κ, f ) = max b α 0 (κ, f, b) from the Gardner theory. Large r regime: For r 2, the dependence of the capacity on f and r is only through the scaled sparsity parameterf The equation for the capacity as a function off and D is given by, Forf 1, the capacity grows roughly as with a proportionality constant depending on D. For largef , α B ≈ 1 D , as expected. The reason for the scaling of f with r 2 is as follows. When the labels are sparse, the dominant contribution to the inverse capacity comes from the minority class, and so the capacity is On the other hand, the optimal value of b depends on the balance between the contributions from both classes and scales linearly with r, as it needs to overcome the local fields from the spheres. Thus, α −1 B ∝ f r 2 ; we have introduced the factor 1 + r 2 instead of r 2 in our scaled variablesf andb for a smoother crossover behavior from small to large r. Fig. 13 illustrate these regimes for D = 100 and 10 2 balls, respectively. The capacity is calculated for a sparsity range 10 −4 ≤ f ≤ 10 −1 and radius 10 −2 ≤ r ≤ 10 3 . The capacity is plotted as a function of scaled sparsity parameterf = f (1 + r 2 ). As predicted by Eq. (73), for small values of r, the capacity and scaled optimal bias depend upon both f and r, seen as "vertical branches" in Fig. 13(a)-(b). When r 2, we see that the capacity follows a universal curve described by Eq. (74) that depends only upon the scaled parameterf . For values off smaller than 1, even for large r, the capacity is substantially enhanced due to sparsity. For large values of f , the capacity eventually drops to 1 D as expected. Fig.  13 also shows the remarkable agreement between the full mean field theory (solid lines) and the approximation in Eqs. (73) and (75) in both small r and large r regimes (dotted lines).

B. Sparsely labeled general manifolds
A remarkable result for sparsely labeled manifolds with general geometries is that their capacity is well approximated by a sparse spherical approximation. In this approximation, the capacity is given by Eqs. (73) and (75) for 2 balls with dimension D = D g and radius r = R g , where R g and D g are the Gaussian radius and dimension of the manifold defined in Sec. IV. Interestingly, even though R g is not necessarily small, we find that the relevant sufficient statistics with sparse labels are given by the Gaussian geometry rather than the polar-constrained values D M and R M . The reason is that for small f , the bias is large. In that case, the positively labeled manifolds have large positive margin b and are fully embedded giving a contribution to the inverse capacity of b 2 regardless of their detailed geometry. On the other hand, the negatively labeled manifolds have large negative margin, and exist only in the interior or touching regimes. Hence, their geometry is well approximated by the Gaussian quantities D g and R g .
The equivalence between general manifolds and balls with R g and D g parameters holds as long as the scaled optimal biasb = b/ 1 + R 2 g 1, which occurs when the scaled sparsity is less than 1. Thus, we conclude that for all geometries wheref = f (1 + R 2 g ) 1, the capacity is well approximated by the sparse spherical approximation, Whenf increases above 1,b is small and the capacity is affected by the detailed geometry of the manifold. In particular, asf 1, the capacity of the manifold approaches α M (f ) → 1 D , not 1 Dg . To demonstrate the theory in general manifolds, we have evaluated the capacity of two classes of manifolds: orientation manifolds and 1 ellipsoids as shown in Fig. 14. For the first examples, we have used the orientation manifold model (Sec. VIII). Fig. 14(a) shows that the capacity evaluated by the full mean field theory (solid lines) as well as the numerical simulations (markers) agree well with the sparse spherical approximation (dotted) forf < 1. Note the similarity with the capacity of the balls (Fig. 13) in both the small and large R g regimes. The spherical approximation 78 breaks down forf > 1 , rapidly decreasing below the saturation of the approximation, 1 Dg , for largef . For the example of D = 100 1 ellipsoids, we used two different values of radii: D 1 = 10 components with R 1 = r and D − D 1 components with R 2 = 1 2 r. This choice is interesting because in this case, D g is roughly 2 log D 1 , much smaller than D. Moreover, R g is close to r , rather than the average value of R i which is close to 1 2 r. Fig.  14(b) shows the capacity as a function off (by varying f ,r). For the entire range off below 1, the spherical approximation agrees reasonably well with the full mean field as well as simulations. As expected, whenf > 1, the true capaciy rapidly decreases and saturates at 1 D , rather than to the 1 Dg limit of the spherical approximation. As discussed above, in the (high and moderate) sparse regimes, a large bias alters the geometry of the two classes in different ways. To illustrate this important aspect, we show in Fig. 15, the effect of sparsity and bias on the geometry of the 1 ellipsoids studied in Fig. 14(b). Here we show the evolution of R M and D M for the majority and minority classes asf increases. Note that the manifolds have the same shape; nevertheless, their polarconstrained geometry depends on both class membership and sparsity levels because these measures depend on the margin. Whenf is small, the minority class has D M = D as seen in Fig. 15(c), i.e., the minority class manifolds are close to be fully embedded due to the large positive margin. This can also be seen in the distributions of embedding dimension shown in Fig.15(a)-(b). On the other hand, the majority class has D M ≈ D g = 2logD 1 D, and these manifolds are mostly in the interior regime. As f increases, the geometrical statistics for the two classes become more similar. This is seen in Fig.15(c)-(d) where D M and R M for both majority and minority classes converge to the zero margin value for largef = f 1 + R 2 g .

Summary:
We have developed statistical mechanical theory for linear classification of points organized in perceptual manifolds, where all points in a manifold share the same label. The notion of perceptual manifolds is critical in a variety of contexts in computational neuroscience modeling and in signal processing. Our theory is not restricted to manifolds with smooth or regular geometries; it applies to any compact subset of a D-dimensional affine subspace. Thus, the theory is applicable to manifolds arising from any variation in neuronal responses with a continuously varying physical variable, or from any sampled set arising from experimental measurements on a limited number of stimuli.
The theory describes the capacity of a linear classifier to separate a dichotomy of general manifolds (with a given margin) by a universal set of mean field equations. These equations may be solved analytically for simple geometries, but for more complex geometries, we have developed iterative algorithms to solve the self-consistent equations. The algorithms are efficient and converge very quickly since they involve only solving for O(D) variables for a single manifold, rather than invoking simulations of full systems of P manifolds embedded in R N .
The main goal of our work has been to characterize the dependence of the separation capacity and nature of the maximum-margin separating hyperplane on key features of the geometry of the convex hull of the manifold. In the limit of small manifolds or manifolds with sparse labels, we have shown how the capacity can be statistically described by a Gaussian size R g and dimension D g . These geometric measures are induced by choosing points on the boundary of the manifold that have maximum projection on Gaussian random vectors t. The Gaussian geometry is related to well-known size measures in the theory of convex bodies, and differs from standard second-order statistics based upon SVD analysis with a uniform distribution. In particular, we show how the Gaussian dimension D g can be much smaller than either the affine dimension D or the dimension computed from SVD.
When the manifolds are large and not sparsely labeled, linear separability is not well described by the Gaussian geometry. Instead, we have introduced new geometrical quantities: the polar-constrained manifold radius R M and dimension D M that take the geometry of the manifold cone and polar cone into account. Unlike the Gaussian statistics, the polar-constrained statistics depend upon the relative scale r between the manifolds and the size of their centers. In particular, as r increases from zero to infinity, D M increases from D g to the full affine dimension D, the latter reflects the fact that for infinitely large manifolds, the solution needs to be orthogonal to all D directions spanned by the manifold relative to its center. Interestingly, we show that the capacity of high dimensional manifolds is well approximated by that of 2 balls with dimension D M and radius R M , We also explained how the various manifolds are positioned relative to the separating margin planes at the max-margin solution. There are interior manifolds that do not contribute to the overall capacity, manifolds that touch a margin plane at a single point, and fully embedded manifolds that are wholly contained within a margin plane. For manifold convex hulls that are not strictly convex, there are also partially embedded support manifolds that display the richness of the convex hull geometry with the existence of embedded k-dimensional faces.
We have applied the general theory to some prototypical examples. The first one is a set of manifolds described by 2 ellipsoids. These manifolds are analogous to a generative model consisting of a mixture of low-dimensional Gaussians, but with bounded extent so that the ellipsoids have zero overlap. These manifolds are strictly convex, and we show how the interior, touching, and fully embedded support regimes contribute to the overall capacity as a function of the principal radii of the ellipsoids.
We have also applied the theory to manifolds with nonsmooth convex hulls. One example is 1 balls that exhibit a polytope geometry. The other example is orientation manifolds that are parameterized by a single angular variable. Because of the faceted nature of these convex hulls, the Gaussian random t induces induces a distribution on embedding dimensions, k, of the support faces, reflecting an interesting signature of the geometry of the convex hull. Changing the overall size of the manifolds shifts the distribution of k towards higher values as the manifolds become increasingly embedded in the margin planes. Despite the qualitative differences in the underlying geometries of the two examples, both exhibit a weak, logarithmic growth of Gaussian dimension with the affine dimension, D, reflecting the large anisotropy in the shape of the convex hulls.
We have extended the theory for label imbalance, when the vast majority of the manifolds belong to a single class, which can be taken as y µ = −1 without loss of generality.
In the extreme limit where only one manifold is positively labeled, the task is related to object recognition since the classifier is trained to indicate the presence of a single object. Combining sparse classifiers to solve multi-class recognition problems would be an interesting extension of the present work. We show that classification with sparse labels enjoy higher capacity, or larger margins at the same load, as compared to balanced classes. The enhanced capacity is particulary important for large and high dimensional manifolds, as their balanced capacity is quite small. Our analysis indeed reveals the important interplay between label sparsity and manifold geometry in determining the classification capacity.
We have elucidated the major effects of the geometrical properties of the manifold convex hulls on their classification capacity. However, to apply the theory quantitatively to experimental data, several outstanding issues need to be addressed, including: Correlations: In the present work we have assumed that the directions of the affine subspaces of the different manifolds are uncorrelated. In realistic situations we expect to see correlations in the manifold geometries, mainly of two types: One is center-center correlations. Such correlations can be harmful for linear separability [32,33]. Another is correlated variability in which the directions of the affine subspaces are correlated but not the centers. Positive correlations of the latter form are beneficial for separability. In the extreme case when the manifolds share a common affine subspace, the rank of the union of the subspaces is D tot = D rather than D tot = P D, and the solution weight vector need only lie in the null space of this smaller subspace. Further work is needed to extend the present theory to incorporate more general correlations.
Generalization performance: We have studied the separability of manifolds with known geometries. In many realistic problems, this information is not readily available and only samples reflecting the natural variability of input patterns are provided. These samples can be used to estimate the underlying manifold model (using manifold learning techniques [34,35]) and/or to train a classifer based upon a finite training set. Generalization error [36] describes how well a classifier trained on a finite number of samples would perform on other test points drawn from the manifolds. It would be important to extend our theory to calculate the expected generalization error achieved by the maximum margin solution trained on point cloud manifolds, as a function of the size of the training set and the geometry of the underlying full manifolds.
Unrealizable classification: Throughout the present work, we have assumed that the manifolds are sepa-rable by a linear classifier. In realistic problems, the load may be above the capacity for linear separation, i.e. α > α M (κ = 0). Alternatively, neural noise may cause the manifolds to be unbounded in extent, with the tails of their distribution overlapping so that they are not separable with zero error. There are several ways to handle this issue in supervised learning problems. One possibility is to map the unrealizable representation nonlinearly to a higher dimensional feature space, via a multi-layer network or nonlinear kernel function, where the classification can be performed with zero error. The design of multilayer networks could be facilitated using manifold processing principles uncovered by our theory.
Another possibility is to introduce an optimization problem allowing a small training error, for example, using an SVM with complementary slack variables [14]. These procedures raise interseting theoretical challenges, including understanding how the geometry of manifolds change as they undergo nonlinear transformations, as well as investigating by statistical mechanics, the performance of a linear classifier of manifolds with slack variables [37].
Concluding remarks: The statistical mechanical theory of perceptron learning has long provided a basis for understanding the performance and fundamental limitations of single layer neural architectures and their kernel extensions. However, the previous theory only considered a finite number of random points with no underlying geometric structure, and could not explain the performance of linear classifiers on large, possibly infinite number of inputs organized as distinct manifolds by the variability due to changes in physical parameters of objects. The statistical mechanical theory presented in this work can explain the capacity and limitations of linear classification of general manifolds, and be used to elucidate changes in neural representations across hierarchical sensory systems. We believe the application of this theory and its corollary extensions will precipitate novel insights into how perceptual systems, biological or artificial, can efficiently code and process sensory information.
We In this section, we outline the derivation of the mean field replica theory summarized in Eqs. (11)- (12). We define the capacity of linear classification of manifolds, α M (κ), as the maximal load, α = P N , for which with high probability a solution to y µ w · x µ ≥ κ exists for a given κ. Here x µ are points on the P manifolds M µ , Eq. (1), and we assume that all N P (D + 1) components of {u µ i } are drawn independently from a Gaussian distribution with zero mean and variance 1 N , and that the binary labels y µ = ±1 are randomly assigned to each manifold with equal probabilities. We consider the thermodynamic limit where N, P → ∞ but α = P N , and D are finite. Note that the geometric margin, κ , defined as the distance from the solution hyperplane is given by y µ w·x µ ≥ κ w = κ √ N . However, this distance depends on the scale of the input vectors x µ . The correct scaling of the margin in the thermodynamic limit is κ = x √ N κ. Since we adopted the normalization of x µ = O(1), the correct scaling of the margin is y µ w · x µ ≥ κ . Evaluation of solution volume: Following Gardner's replica framework, we first consider the volume V of the solution space for α < α M (κ) . We define the signed projections of the the ith direction vector u µ i on the solution weight as h µ i = √ N y µ w · u µ i , where i = 1, ..., D + 1 and µ = 1, ..., P . Then, the separability constraints can be written as D+1 i=1 s i h µ i ≥ κ . Hence the volume can be written as where Θ(x) is a Heavyside step function. g S is the support function of S defined in Eq. (13) as g S ( v) = max { v · s | s ∈ S}.
The volume defined above depends on the the quenched random variables u µ i and y µ through h µ i . It is well known that in order to obtain the typical behavior in the thermodynamic limit, we need to average log V , which we carry out using the replica trick, log V = lim n→0 V n −1 n , where refers to the average over u µ i and y µ . For natural n, we need to evaluate, where we have used the notation, Using Fourier representation of the delta functions, we obtain Performing the average over the Gaussian distribution of u µ i (each of the N components has zero mean and variance 1 N ) yields, and we have used the fact that all manifolds contribute the same factor. We proceed by making the replica symmetric ansatz on the order parameter q αβ at its saddle point, q αβ = (1 − q)δ αβ + q, from which one obtains in the n → 0 limit: and logdetq = n log(1 − q) + nq 1 − q (A.9) Thus the exponential term in X can be written as Using the Hubbard-Stratonovich transformation, we obtain Completing the square in the exponential and using D tA n = exp n D t log A in the n → 0 limit, we obtain, X = exp nq 2(1−q) + n D t log Z( t) with || h − √ q t|| 2 (A.12) Combining these terms, we write the last factor in Eq. (A.6) as exp nP G 1 where, The first factors in V n , Eq. (A.6), can be written as exp nN G 0 , where as in the Gardner theory, the entropic term in the thermodynamic limit is G 0 (q) = 1 2 ln(1 − q) + q 2(1 − q) (A.14) and represents the constraints on the volume of w α due to normalization and the order parameter q. Combining the G 0 and G 1 contributions, we have Note that √ qt i represents the quenched random component due to the randomness in the u µ i , and z i is the "thermal" component due to the variability within the solution space. The order parameter q is calculated via 0 = ∂G0 ∂q + α ∂G1 ∂q . Capacity: In the limit where α → α M (κ) , the overlap between the solutions become unity and the volume shrinks to zero. It is convenient to define Q = q 1−q and study the limit of Q → ∞. In this limit the leading order is where the first term is the contribution from G 0 → Q 2 . The second term comes from G 1 → − Q 2 α F ( t) t , where the average is over the Gaussian distribution of the D + 1 dimensional vector t, and is independent of Q and is given by replacing the integrals in Eq. (A.16) by their saddle point, which yields It is convenient to change variables by in- Noting that the distribution of t is invariant under t → − t, we obtain the final form, At the capacity, log V vanishes, the capacity of a general manifold with margin κ, is given by, Finally, we note that the mean squared 'annealed' variability in the fields due to the entropy of solutions vanishes at the capacity limit, as 1/Q , see Eq. (A.16) . Thus, the quantity v − t 2 in the above equation represents the annealed variability times Q which remains finite in the limit of Q → ∞.

APPENDIX B: CAPACITY OF 2 BALLS WITH SPARSE LABELS
Here we outline the mean field theory of the capacity of sparsely labeled 2 balls. The capacity of balls with balanced labels (and zero bias) is is the D-dimensional Chi probability density function. Eq. (71) then yields, 25) and the optimal bias b is given by ∂α B /∂b = 0. Small r: First, we note that the capacity of sparsely labeled points is α 0 (f, κ) = max b α 0 (f, κ, b), where and optimizing b yields the following equation for b, (A.27) Next, for balls with small radius, such that r √ D 1, we can use the general theory of manifolds in the scaling regime, Sec. V, which predicts that, α B (f, R, D) ≈ α 0 (f, κ = r √ D) (A.28) Small f and large r: Here we analyze the above equations in the limit of small f and large r. We assume that f is sufficiently small so that the optimal bias b is large. Furthermore, since r is large, in order to be effective, b must be of the order of r or larger. Hence we assume that the scaled bias,b = b √ 1+R 2 , is order 1 or larger. Under these conditions, the contribution of the minority class to the inverse capacity α −1 B is dominated by Dt 0 ((t 0 +b) 2 +t 2 ) ≈ f b 2 (A. 29) where we have assumed κ = 0 for simplicity. The dominant contribution of the majority class to α −1 B is, (1 − f )