You are on page 1of 26

PHYSICAL REVIEW X 8, 031003 (2018)

Classification and Geometry of General Perceptual Manifolds


SueYeon Chung,1,2,6 Daniel D. Lee,2,3 and Haim Sompolinsky2,4,5
1
Program in Applied Physics, School of Engineering and Applied Sciences, Harvard University,
Cambridge, Massachusetts 02138, USA
2
Center for Brain Science, Harvard University, Cambridge, Massachusetts 02138, USA
3
School of Engineering and Applied Science, University of Pennsylvania,
Philadelphia, Pennsylvania 19104, USA
4
Racah Institute of Physics, Hebrew University, Jerusalem 91904, Israel
5
Edmond and Lily Safra Center for Brain Sciences, Hebrew University, Jerusalem 91904, Israel
6
Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology,
Cambridge, Massachusetts 02139, USA

(Received 17 October 2017; published 5 July 2018)

Perceptual manifolds arise when a neural population responds to an ensemble of sensory signals
associated with different physical features (e.g., orientation, pose, scale, location, and intensity) of the same
perceptual object. Object recognition and discrimination require classifying the manifolds in a manner that
is insensitive to variability within a manifold. How neuronal systems give rise to invariant object
classification and recognition is a fundamental problem in brain theory as well as in machine learning.
Here, we study the ability of a readout network to classify objects from their perceptual manifold
representations. We develop a statistical mechanical theory for the linear classification of manifolds with
arbitrary geometry, revealing a remarkable relation to the mathematics of conic decomposition. We show
how special anchor points on the manifolds can be used to define novel geometrical measures of radius and
dimension, which can explain the classification capacity for manifolds of various geometries. The general
theory is demonstrated on a number of representative manifolds, including l2 ellipsoids prototypical of
strictly convex manifolds, l1 balls representing polytopes with finite samples, and ring manifolds
exhibiting nonconvex continuous structures that arise from modulating a continuous degree of freedom.
The effects of label sparsity on the classification capacity of general manifolds are elucidated, displaying a
universal scaling relation between label sparsity and the manifold radius. Theoretical predictions are
corroborated by numerical simulations using recently developed algorithms to compute maximum margin
solutions for manifold dichotomies. Our theory and its extensions provide a powerful and rich framework
for applying statistical mechanics of linear classification to data arising from perceptual neuronal responses
as well as to artificial deep networks trained for object recognition tasks.
DOI: 10.1103/PhysRevX.8.031003 Subject Areas: Biological Physics, Complex Systems,
Statistical Physics

I. INTRODUCTION vision; other examples include speech processing, which


requires the detection of phonemes despite variability in the
One fundamental cognitive task performed by animals
acoustic signals associated with individual phonemes, and
and humans is the invariant perception of objects, requiring
the discrimination of odors in the presence of variability in
the nervous system to discriminate between different
odor concentrations. Sensory systems are organized as
objects despite substantial variability in each object’s
hierarchies, consisting of multiple layers, transforming
physical features. For example, in vision, the mammalian
sensory signals into a sequence of distinct neural repre-
brain is able to recognize objects despite variations in their
sentations. Studies of high-level sensory systems, e.g., the
orientation, position, pose, lighting, and background. Such
inferotemporal cortex (IT) in vision [1], auditory cortex in
impressive robustness to physical changes is not limited to
audition [2], and piriform cortex in olfaction [3], reveal that
even the late sensory stages exhibit significant sensitivity of
neuronal responses to physical variables. This suggests that
Published by the American Physical Society under the terms of sensory hierarchies generate representations of objects that,
the Creative Commons Attribution 4.0 International license.
Further distribution of this work must maintain attribution to although not entirely invariant to changes in physical
the author(s) and the published article’s title, journal citation, features, are still readily decoded in an invariant manner
and DOI. by a downstream system. This hypothesis is formalized

2160-3308=18=8(3)=031003(26) 031003-1 Published by the American Physical Society


CHUNG, LEE, and SOMPOLINSKY PHYS. REV. X 8, 031003 (2018)

objective in machine learning, providing support vector


machines (SVM) with their good generalization perfor-
mance guarantees [14].
The above theories focus on separating a finite set of
points with no underlying geometrical structure and are not
applicable to the problem of manifold classification, which
deals with separating an infinite number of points geomet-
rically organized as manifolds. This paper addresses the
important question of how to quantify the capacity of the
perceptron for dichotomies of input patterns described
by manifolds. In an earlier paper, we have presented the
analysis for classification of manifolds of extremely simple
geometry, namely, balls [15]. However, the previous
FIG. 1. Perceptual manifolds in neural state space.(a) Firing results have limited applicability as the neural manifolds
rates of neurons responding to images of a dog shown at various arising from realistic physical variations of objects can
orientations θ and scales s. The response to a particular exhibit much more complicated geometries. Can statistical
orientation and scale can be characterized by an N-dimensional mechanics deal with the classification of manifolds with
population response. (b) The population responses to the images
complex geometry, and what specific geometric properties
of the dog form a continuous manifold representing the complete
determine the separability of manifolds?
set of invariances in the RN neural activity space. Other object
images, such as those corresponding to a cat in various poses, are In this paper, we develop a theory of the linear
represented by other manifolds in this vector space. separability of general, finite-dimensional manifolds. The
summary of our key results is as follows:
by the notion of the untangling of perceptual manifolds (i) We begin by introducing a mathematical model of
[4–6]. This viewpoint underlies a number of studies of general manifolds for binary classification (Sec. II).
object recognition in deep neural networks for artificial This formalism allows us to generate generic bounds
intelligence [7–10]. on the manifold separability capacity from the limits
To conceptualize perceptual manifolds, consider a set of of small manifold sizes (classification of isolated
N neurons responding to a specific sensory signal asso- points) to that of large sizes (classification of
ciated with an object, as shown in Fig. 1. The neural entire affine subspaces). These bounds highlight
population response to that stimulus is a vector in RN . the fact that for large ambient dimension N, the
Changes in the physical parameters of the input stimulus maximal number P of separable finite-dimensional
that do not change the object identity modulate the neural manifolds is proportional to N, even though each
state vector. The set of all state vectors corresponding to consists of an infinite number of points, setting the
responses to all possible stimuli associated with the same stage for a statistical mechanical evaluation of the
object can be viewed as a manifold in the neural state space. maximal α ¼ ðP=NÞ.
In this geometrical perspective, object recognition is equiv- (ii) Using replica theory, we derive mean-field equations
alent to the task of discriminating manifolds of different for the capacity of linear separation of finite-dimen-
objects from each other. Presumably, as signals propagate sional manifolds (Sec. III) and for the statistical
from one processing stage to the next in the sensory properties of the optimal separating weight vector.
hierarchy, the geometry of the manifolds is reformatted The mean-field solution is given in the form of self-
so that they become “untangled”; namely, they are more consistent Karush-Kuhn-Tucker (KKT) conditions
easily separated by a biologically plausible decoder [1]. In involving the manifold anchor point. The anchor
this paper, we model the decoder as a simple single layer point is a representative support vector for the
network (the perceptron) and ask how the geometrical manifold. The position of the anchor point on
properties of the perceptual manifolds influence their ability a manifold changes as the orientations of the other
to be separated by a linear classifier. manifolds are varied, and the ensuing statistics of the
Linear separability has previously been studied in the distribution of anchor points plays a key role in our
context of the classification of points by a perceptron, using theory. The optimal separating plane intersects a
combinatorics [11] and statistical mechanics [12,13]. fraction of the manifolds (the supporting manifolds).
Gardner’s statistical mechanics theory is extremely impor- Our theory categorizes the dimension of the span of
tant as it provides accurate estimates of the perceptron the intersecting sets (e.g., points, edges, faces, or full
capacity beyond function counting by incorporating robust- manifolds) in relation to the position of the anchor
ness measures. The robustness of a linear classifier is points in the manifolds’ convex hulls.
quantified by the margin, which measures the distance (iii) The mean-field theory motivates a new definition of
between the separating hyperplane and the closest manifold geometry, which is based on the measure
point. Maximizing the margin of a classifier is a critical induced by the statistics of the anchor points.

031003-2
CLASSIFICATION AND GEOMETRY OF GENERAL … PHYS. REV. X 8, 031003 (2018)

In particular, we define the manifold anchor radius equivalent to sparsely labeled l2 balls with radii
and dimension, RM and DM , respectively. These Rg and dimensions Dg , as demonstrated by our
quantities are relevant since the capacity of general numerical evaluations (Sec. VI). Conversely, when
manifolds can be well approximated by the capacity fR2g ≫ 1, the capacity is low and close to ð1=DÞ,
of l2 balls with radii RM and dimensions DM . where D is the dimensionality of the manifold affine
Interestingly, we show that, in the limit of small subspace, even for extremely small f.
manifolds, the anchor point statistics are dominated (vi) Our theory provides, for the first time, quantitative
by points on the boundary of the manifolds which and qualitative predictions for the perceptron clas-
have minimal overlap with Gaussian random vec- sification of realistic data structures. However,
tors. The resultant Gaussian radius Rg and dimen- application to real data may require further exten-
sion Dg are related to the well-known Gaussian sions of the theory and are discussed in Sec. VII.
mean width of convex bodies (Sec. IV). Beyond Together, the theory makes an important contribu-
understanding fundamental limits for classification tion to the development of statistical mechanical
capacity, these geometric measures offer new quan- theories of neural information processing in realistic
titative tools for assessing how perceptual manifolds conditions.
are reformatted in brain and artificial systems.
(iv) We apply the general theory to three examples, II. MODEL OF MANIFOLDS
representing distinct prototypical manifold classes.
One class consists of manifolds with strictly smooth Manifolds in affine subspaces:—We model a set of P
convex hulls, which do not contain facets and are perceptual manifolds corresponding to P perceptual objects.
exemplified by l2 ellipsoids. Another class is that of Each manifold M μ for μ ¼ 1; …; P consists of a compact
convex polytopes, which arise when the manifolds subset of an affine subspace of RN with affine dimension D
consist of a finite number of data points, and are with D < N. A point on the manifold xμ ∈ Mμ can be
exemplified by l1 ellipsoids. Finally, ring manifolds parametrized as
represent an intermediate class: smooth but non-
convex manifolds. Ring manifolds are continuous X
Dþ1
⃗ ¼
xμ ðSÞ Si uμi ; ð1Þ
nonlinear functions of a single intrinsic variable,
i¼1
such as object orientation angle. The differences
between these manifold types show up most clearly where uμi are a set of orthonormal bases of the (D þ 1)-
in the distinct patterns of the support dimensions. dimensional linear subspace containing M μ . The D þ 1
However, as we show, they share common trends. As components Si represent the coordinates of the manifold
the size of the manifold increases, the capacity and point within this subspace and are constrained to be in the set
geometrical measures vary smoothly, exhibiting a S⃗ ∈ S. The bold notation for xμ and uμi indicates that they are
smooth crossover from a small radius and dimension
with high capacity to a large radius and dimension vectors in RN , whereas the arrow notation for S⃗ indicates that
with lowpffiffiffiffiffiffi
capacity. This crossover occurs as it is a vector in RDþ1 . The set S defines the shape of the
Rg ∝ ð1= Dg Þ. Importantly, for many realistic manifolds and encapsulates the affine constraint. For sim-
plicity, we will first assume that the manifolds have the same
cases, when the size is smaller than the crossover
geometry so that the coordinate set S is the same for all the
value, the manifold dimensionality is substantially
manifolds; extensions that consider heterogeneous geom-
smaller than that computed from naive second-order
etries are provided in Sec. VI A.
statistics, highlighting the saliency and significance
of our measures for the anchor geometry. We study the separability of P manifolds into two classes,
(v) Finally, we treat the important case of the classi- denoted by binary labels yμ ¼ 1, by a linear hyperplane
fication of manifolds with imbalanced (sparse) passing through the origin. A hyperplane is described by a
labels, which commonly arise in problems of object weight vector w ∈ RN , normalized so kwk2 ¼ N, and the
recognition. It is well known that, in highly sparse hyperplane correctly separates the manifolds with a margin
labels, the classification capacity of random points κ ≥ 0 if it satisfies
increases dramatically as ðfj log fjÞ−1 , where f ≪ 1
y μ w · xμ ≥ κ ð2Þ
is the fraction of the minority labels. Our analysis of
sparsely labeled manifolds highlights the interplay for all μ and xμ ∈ Mμ . Since linear separability is a convex
between manifold size and sparsity. In particular, it problem, separating the manifolds is equivalent to separating
shows that sparsity enhances the capacity only ⃗ S⃗ ∈ convðSÞg, where
the convex hulls, convðM μ Þ ¼ fxμ ðSÞj
when fR2g ≪ 1, where Rg is the (Gaussian) manifold
radius. Notably, for a large regime of parameters, XDþ1 X
Dþ1 
sparsely labeled manifolds can approximately convðSÞ ¼ αi S⃗ i jS⃗ i ∈ S; αi ≥ 0; αi ¼ 1 : ð3Þ
be described by a universal capacity function i¼1 i¼1

031003-3
CHUNG, LEE, and SOMPOLINSKY PHYS. REV. X 8, 031003 (2018)

where Cnk ¼ ½n!=k!ðn − kÞ! is the binomial coefficient for


n ≥ k, and zero otherwise. This result holds for P input
vectors that obey the mild condition that the vectors are in
general position, namely, that all subsets of input vectors
of size p ≤ N are linearly independent.
For large P and N, the probability ð1=2P ÞC0 ðP; NÞ of a
dichotomy being linearly separable depends only upon the
ratio ðP=NÞ and exhibits a sharp transition at the critical
value of α0 ¼ 2. We are not aware of a comprehensive
extension of Cover’s counting theorem for general mani-
folds; nevertheless, we can provide lower and upper bounds
on the number of linearly realizable dichotomies by con-
sidering the limit of r → 0 and r → ∞ under the following
general conditions.
First, in the limit of r → 0, the linear separability of P
manifolds becomes equivalent to the separability of the P
centers. This leads to the requirement that the centers of
FIG. 2. Model of manifolds in affine subspaces. D ¼ 2 mani-
the manifolds xμ0 are in a general position in RN . Second,
fold embedded in RN . c is the orthogonal translation vector for
the affine space, and x0 is the center of the manifold. As the scale
we consider the conditions under which the manifolds are
r is varied, the manifold shrinks to the point x0 or expands to fill linearly separable when r → ∞, so that the manifolds span
the entire affine space. complete affine subspaces. For a weight vector w to
consistently assign the same label to all points on an affine
The position of an affine subspace relative to the subspace, it must be orthogonal to all the displacement
origin can be defined via the translation vector that is vectors in the affine subspace. Hence, to realize a
closest to the origin. This orthogonal translation vector cμ is dichotomy of P manifolds when r → ∞, the weight vector
perpendicular to all the affine displacement vectors in M μ , w must lie in a null space of dimension N − Dtot , where Dtot
so that all the points in the affine subspace have equal is the rank of the union of affine displacement vectors.
projections on cμ , i.e., x⃗ μ · cμ ¼ kcμ k2 for all xμ ∈ Mμ When the basis vectors uμi are in general position, then
(Fig. 2). We will further assume for simplicity that the P Dtot ¼ min ðDP; NÞ. Then, for the affine subspaces to be
norms kcμ k are all the same and normalized to 1. separable, PD < N is required, and the projections of the P
To investigate the separability properties of manifolds, it orthogonal translation vectors also need to be separable in
is helpful to consider scaling a manifold M μ by an overall the (N − Dtot )-dimensional null space. Under these general
scale factor r without changing its shape. We define the conditions, the number of dichotomies for D-dimensional
affine subspaces that can be linearly separated, CD ðP; NÞ,
scaling relative to a center S⃗ 0 ∈ S by a scalar r > 0, by
can be related to the number of dichotomies for a finite set
XDþ1 
μ 0 0 μ ⃗ of points via
rM ¼ ½Si þ rðSi − Si Þui jS ∈ S : ð4Þ
i¼1 CD ðP; NÞ ¼ C0 ðP; N − PDÞ: ð6Þ
μ
WhenPr → 0, the manifold rM converges to a point:
From this relationship, we conclude that the ability to
xμ0 ¼ i S0i uμi . On the other hand, when r → ∞, the mani-
linearly separate D-dimensional affine subspaces exhibits a
fold rMμ spans the entire affine subspace. If the manifold is
transition from always being separable to never being
symmetric (such as for an ellipsoid), there is a natural choice
separable at the critical ratio ðP=NÞ ¼ ð2=1 þ 2DÞ for large
for a center. We will later provide an appropriate definition
P and N [see Supplemental Materials (SM) [16], Sec. S1].
for the center point for general, asymmetric manifolds. In
For general D-dimensional manifolds with finite size,
general, the translation vector c⃗ and center S⃗ 0 need not the number of dichotomies that are linearly separable will
coincide, as shown in Fig. 2. However, we will also discuss be lower bounded by CD ðP; NÞ and upper bounded by
later the special case of centered manifolds, in which the C0 ðP; NÞ. We introduce the notation αM ðκÞ to denote
translation vector and center do coincide. the maximal load ðP=NÞ such that randomly labeled
Bounds on linear separability of manifolds:—For dichot- manifolds are linearly separable with a margin κ, with
omies of P input points in RN at zero margin, κ ¼ 0, the high probability. Therefore, from the above considerations,
number of dichotomies that can be separated by a linear it follows that the critical load at zero margin, αM ðκ ¼ 0Þ, is
hyperplane through the origin is given by [11] bounded by
X
N −1
C0 ðP; NÞ ¼ 2 CP−1 ≤ 2P ; ð5Þ 1 1
k ≤ α−1
M ðκ ¼ 0Þ ≤ þ D: ð7Þ
k¼0 2 2

031003-4
CLASSIFICATION AND GEOMETRY OF GENERAL … PHYS. REV. X 8, 031003 (2018)

These bounds highlight the fact that, in the large N limit, properties of the maximum margin solution, namely, the
the maximal number of separable finite-dimensional mani- solution for the largest load αM for a fixed margin κ, or
folds is proportional to N, even though each consists of an equivalently, the solution when the margin κ is maximized
infinite number of points. This sets the stage for a statistical for a given αM .
mechanical evaluation of the maximal α ¼ ðP=NÞ, where P As shown in Appendix A, we prove that the general form
is the number of manifolds, and is described in the of the inverse capacity, exact in the thermodynamic limit, is
following section.
α−1 ⃗
M ðκÞ ¼ hFðTÞiT⃗ ; ð11Þ
III. STATISTICAL MECHANICAL THEORY
where FðTÞ ⃗ 2 jV⃗ · S⃗ − κ ≥ 0; ∀ S⃗ ∈ Sg
⃗ ¼ min ⃗ fkV⃗ − Tk
V
In order to make theoretical progress beyond the bounds and h…iT⃗ is an average over random (D þ 1)-dimensional
above, we need to make additional statistical assumptions
⃗ whose components are i.i.d. normally distributed
vectors T,
about the manifold spaces and labels. Specifically, we will
assume that the individual components of uμi are drawn T i ∼ N ð0; 1Þ. The components of the vector V⃗ represent
independently and from identical Gaussian distributions the signed fields induced by the solution vector w on the
with zero mean and variance ð1=NÞ, and that the binary D þ 1 basis vectors of the manifold. The Gaussian vector T⃗
labels yμ ¼ 1 are randomly assigned to each manifold represents the part of the variability in V⃗ due to quenched
with equal probabilities. We will study the thermodynamic variability in the manifold’s basis vectors and the labels, as
limit where N, P → ∞, but with a finite load α ¼ ðP=NÞ. will be explained in detail below.
In addition, the manifold geometries as specified by the set The inequality constraints in F can be written equiva-
S in RDþ1 and, in particular, their affine dimension D are lently, as a constraint on the point on the manifold with
held fixed in the thermodynamic limit. Under these minimal projection on V.⃗ We therefore consider the concave
assumptions, the bounds in Eq. (7) can be extended to ⃗ ¼ min ⃗ fV⃗ · Sj ⃗ S⃗ ∈ Sg, which
support function of S, gS ðVÞ
the linear separability of general manifolds with finite S
margin κ, and characterized by the reciprocal of the critical can be used to write the constraint for FðTÞ ⃗ as
load ratio α−1
M ðκÞ,
⃗ ¼ min ⃗ fkV⃗ − Tk
FðTÞ ⃗ 2 jgS ðVÞ
⃗ − κ ≥ 0g: ð12Þ
V
α−1
0 ðκÞ ≤ α−1
M ðκÞ ≤ α−1
0 ðκÞ þ D; ð8Þ
⃗ is easily mapped to the
Note that this definition of gS ðVÞ
where α0 ðκÞ is the maximum load for separation of random conventional convex support function defined via the max
independent and identically distributed (i.i.d.) points, with operation [17].
a margin κ given by the Gardner theory [12], KKT conditions:—To gain insight into the nature of the
Z κ maximum margin solution, it is useful to consider the KKT
−1
α0 ðκÞ ¼ Dtðt − κÞ2 ; ð9Þ conditions of the convex optimization in Eq. (12) [17]. For
−∞ ⃗ the KKT conditions that characterize the unique
each T,
pffiffiffiffiffiffi 2 solution of V⃗ for FðT)
⃗ is given by
with Gaussian measure Dt ¼ ð1= 2π Þe−ðt =2Þ . For many
interesting cases, the affine dimension D is large, and the
V⃗ ¼ T⃗ þ λS̃ðTÞ;
⃗ ð13Þ
gap in Eq. (8) is overly loose. Hence, it is important to
derive an estimate of the capacity for manifolds with finite
where
sizes and evaluate the dependence of the capacity and the
nature of the solution on the geometrical properties of the
λ≥0
manifolds, as shown below.
⃗ −κ ≥0
gS ðVÞ
A. Mean-field theory of manifold separation capacity ⃗ − κ ¼ 0:
λ½gS ðVÞ ð14Þ
Following Gardner’s framework [12,13], we compute
the statistical average of log Z, where Z is the volume of the ⃗ is a subgradient of the support function at
The vector S̃ðTÞ
space of the solutions, which in our case can be written as ⃗ S̃ðTÞ
⃗ ∈ ∂gS ðVÞ⃗ [17]; i.e., it is a point on the convex hull
V,
Z ⃗ When the support
of S with minimal overlap with V.
Z ¼ dN wδðkwk2 − NÞΠμ;xμ ∈Mμ Θðyμ w · xμ − κÞ: ð10Þ ⃗ is unique
function is differentiable, the subgradient ∂gS ðVÞ
and is equivalent to the gradient of the support function,
Θð·Þ is the Heaviside function to enforce the margin
constraints in Eq. (2), along with the delta function to ⃗ ¼ ∇gS ðVÞ
S̃ðTÞ ⃗
⃗ ¼ arg min V⃗ ·S: ð15Þ
ensure kwk2 ¼ N. In the following, we focus on the ⃗
S∈S

031003-5
CHUNG, LEE, and SOMPOLINSKY PHYS. REV. X 8, 031003 (2018)

Since the support function is positively homogeneous, the appropriate statistics from self-consistent equations of
⃗ ¼ γgS ðVÞ
gS ðγ VÞ ⃗ for all γ > 0; thus, ∇gS ðVÞ ⃗ depends the fields on a single manifold. To see this, consider
only on the unit vector V̂. For values of V⃗ such that gS ðVÞ
⃗ is projecting the solution vector w onto the affine subspace
⃗ is of one of the manifolds, say, μ ¼ 1. We define a (D þ 1)-
not differentiable, the subgradient is not unique, but S̃ðTÞ
defined uniquely as the particular subgradient that obeys dimensional vector V⃗ 1 as V 1i ¼ y1 w · u1i , i ¼ 1; …; D þ 1,
the KKT conditions, Eqs. (13)–(14). In the latter case, which are the signed fields of the solution on the affine
⃗ ∈ convðSÞ may not reside in S itself. basis vectors of the μ ¼ 1 manifold. Then, Eq. (18) reduces
S̃ðTÞ
From Eq. (13), we see that the capacity can be written in to V⃗ 1 ¼ λ1 S̃1 þ T,
⃗ where T⃗ represents the contribution
⃗ as
terms of S̃ðTÞ from all the other manifolds and since their subspaces are
randomly oriented, this contribution is well described as a
⃗ ¼ kλS̃ðTÞk
FðTÞ ⃗ 2: ð16Þ random Gaussian vector. Finally, self consistency requires
⃗ S̃1 is such that it has a minimal overlap with
that for fixed T,
The scale factor λ is either zero or positive, corresponding to V⃗ 1 and represents a point residing on the margin hyper-
⃗ − κ is positive or zero. If gS ðVÞ
whether gS ðVÞ ⃗ − κ is posi- plane; otherwise, it will not contribute to the max margin
⃗ ⃗ ⃗
tive, then λ ¼ 0, meaning that V ¼ T and T · S̃ðTÞ−κ⃗ >0. If solution. Thus, Eq. (13) is just the decomposition of the
field induced on a specific manifold into the contribution
T⃗ · S̃ðTÞ
⃗ − κ < 0, then λ > 0 and V⃗ ≠ T. ⃗ In this case,
induced by that specific manifold along with the contri-
multiplying Eq. (13) with S̃ðTÞ ⃗ ⃗ 2¼
yields λkS̃ðTÞk butions coming from all the other manifolds. The self-
−T⃗ · S̃ðTÞ
⃗ þ κ. Thus, λ obeys the self-consistent equation, consistent Eqs. (14) and (17) relating λ to the Gaussian
statistics of T⃗ then naturally follow from the requirement
½−T⃗ · S̃ðTÞ
⃗ þ κþ
that S̃1 represents a support vector.
λ¼ ; ð17Þ
⃗ 2
kS̃ðTÞk
C. Anchor points and manifold supports
where the function ½xþ ¼ maxðx; 0Þ.
The vectors x̃μ contributing to the solution, Eq. (18), play
B. Mean-field interpretation of the KKT relations a key role in the theory. We will denote them or,
equivalently, their affine subspace components S̃μ as the
The KKT relations have a nice interpretation within the “manifold anchor points.” For a particular configuration of
framework of the mean-field theory. The maximum margin manifolds, the manifolds could be replaced by an equiv-
solution vector can always be written as a linear combi- alent set of P anchor points to yield the same maximum
nation of a set of support vectors. Although there are margin solution. It is important to stress, however, that an
infinite numbers of input points in each manifold, the individual anchor point is determined not only by the
solution vector can be decomposed into P vectors, one per
configuration of its associated manifold, but also by the
manifold,
random orientations of all the other manifolds. For a fixed
X
P manifold, the location of its anchor point will vary with the
w¼ λμ yμ x̃μ ; λμ ≥ 0; ð18Þ relative configurations of the other manifolds. This varia-
μ¼1
tion is captured in mean-field theory by the dependence of
where x̃μ ∈ convðMμ Þ is a vector in the convex hull of the the anchor point S̃ on the random Gaussian vector T. ⃗
μth manifold. In the large N limit, these vectors are In particular, the position of the anchor point in the
uncorrelated with each other. Hence, squaring this equa- convex hull of its manifold reflects the nature of the relation
tion, ignoring correlations between the different contribu- between that manifold and the margin planes. In general, a
tions (Using the “cavity” method, one can show that the λ’s fraction of the manifolds will intersect with the margin
appearing in the mean field KKT equations and the λ’s hyperplanes; i.e., they have nonzero λ. These manifolds are
appearing in Eq. (18) differ by an overall global scaling the support manifolds of the system. The nature of this
factor which accounts for the presence of correlations support varies and can be characterized by the dimension k
between the individual terms in Eq. (18), which we have of the span of the intersecting set of convðSÞ with the
neglected here for brevity; see [18,19]), yields kwk2 ¼ margin planes. Some support manifolds, which we call
P
N ¼ Pμ¼1 λ2μ kS̃μ k2 ; where S̃μ are the coordinates of x̃μ in “touching” manifolds, intersect with the margin hyperplane
the μth affine subspace of the μth manifold; see Eq. (1). only with their anchor point. They have a support dimen-
From this equation, it follows that α−1 ¼ ðN=PÞ ¼ sion k ¼ 1, and their anchor point S̃ is on the boundary of
hλ2 kS̃k2 i, which yields the KKT expression for the capac- S. The other extreme is “fully supporting” manifolds,
ity; see Eqs. (11) and (16). which completely reside in the margin hyperplane. They
The KKT relations above are self-consistent equations are characterized by k ¼ D þ 1. In this case, V⃗ is parallel to
for the statistics of λμ and S̃μ . The mean-field theory derives the translation vector c⃗ of S. Hence, all the points in S are

031003-6
CLASSIFICATION AND GEOMETRY OF GENERAL … PHYS. REV. X 8, 031003 (2018)

support vectors, and all have the same overlap κ with V. ⃗ κ ¼ 0, the Moreau decomposition theorem states that the
The anchor point, in this case, is the unique point in the two components are perpendicular: V⃗ · ðλS̃ðTÞÞ ⃗ ¼ 0 [21].
interior of convðSÞ that obeys the self-consistent equation, For nonzero κ, the two components need not be
Eq. (13), namely, that it balances the contribution from the perpendicular but obey V⃗ · ðλS̃ðTÞÞ ⃗ ¼ κλ.
other manifolds to zero out the components of V⃗ that are The position of vector T⃗ in relation to the cones, coneðSÞ
orthogonal to c⃗ . In the case of smooth convex hulls (i.e., and S ∘κ , gives rise to qualitatively different expressions for
when S is strongly convex), no other manifold support ⃗ and contributions to the solution weight vector and
FðTÞ
configurations exist. For other types of manifolds, there are inverse capacity. These correspond to the different support
also “partially supporting” manifolds, whose convex hull dimensions mentioned above. In particular, when −T⃗ lies
intersections with the margin hyperplanes consist of k-
dimensional faces with 1 < k < D þ 1. The associated inside S ∘κ , T⃗ ¼ V⃗ and λ ¼ 0, so the support dimension
anchor points then reside inside the intersecting face. k ¼ 0. On the other hand, when −T⃗ lies inside coneðSÞ,
For instance, k ¼ 2 implies that S̃ lies on an edge, whereas V⃗ ¼ κ⃗c, and the manifold is fully supporting, k ¼ D þ 1.
k ¼ 3 implies that S̃ lies on a planar two-face of the convex
hull. Determining the dimension of the support structure E. Numerical solution of the mean-field equations
that arises for various T⃗ is explained below. The solution of the mean-field equations consists of two
⃗ and then the
stages. First, S̃ is computed for a particular T,
D. Conic decomposition relevant contributions to the inverse capacity are averaged
The KKT conditions can also be interpreted in terms over the Gaussian distribution of T.⃗ For simple geometries
of the conic decomposition of T, ⃗ which generalizes such as l2 ellipsoids, the first step may be solved
the notion of the decomposition of vectors onto linear analytically. However, for more complicated geometries,
subspaces and their null spaces via Euclidean projection. both steps need to be performed numerically. The first step
The convex cone of the manifold S is defined as involves determining V⃗ and S̃ for a given T⃗ by solving the
PDþ1
coneðSÞ ¼ f i¼1 αi S⃗ i jS⃗ i ∈ S; αi ≥ 0g; see Fig. 3. The quadratic semi-infinite programming problem (QSIP),
shifted polar cone of S, denoted S ∘κ , is defined as the convex Eq. (12), over the manifold S, which may contain infinitely
set of points given by many points. A novel “cutting plane” method has been
developed to efficiently solve the QSIP problem; see SM
S ∘κ ¼ fU ⃗ · S⃗ þ κ ≤ 0;
⃗ ∈ RDþ1 jU ∀ S⃗ ∈ Sg ð19Þ [16] (Sec. S4). Expectations are computed by sampling the
and is illustrated for κ ¼ 0 and κ > 0 in Fig. 3. For κ ¼ 0, Gaussian T⃗ in D þ 1 dimensions and taking the appropriate
Eq. (19) is simply the conventional polar cone of S [20]. averages, similar to procedures for other mean-field meth-
Equation (13) can then be interpreted as the decomposition ods. The relevant quantities corresponding to the capacity
are quite concentrated and converge quickly with relatively
of −T⃗ into the sum of two component vectors: One few samples.
component is −V, ⃗ i.e., its Euclidean projection onto S ∘κ ; In the following sections, we will also show how the
the other component λS̃ðTÞ ⃗ is located in coneðSÞ. When mean-field theory compares with computer simulations that
numerically solve for the maximum margin solution of
(a) (b)
realizations of P manifolds in RN , given by Eq. (2), for a
variety of manifold geometries. Finding the maximum
margin solution is challenging, as standard methods to
solve SVM problems are limited to a finite number of input
points. We have recently developed an efficient algorithm
for finding the maximum margin solution in manifold
classification and have used this method in the present
work [see Ref. [22] and SM [16] (Sec. S5)].

IV. MANIFOLD GEOMETRY


A. Longitudinal and intrinsic coordinates
In this section, we address how the capacity to separate
FIG. 3. Conic decomposition of −T⃗ ¼ −V⃗ þ λS̃ at (a) zero a set of manifolds can be related to their geometry, in
margin κ ¼ 0 and (b) nonzero margin κ > 0. Given a random particular, to their shape within the D-dimensional affine
Gaussian T,⃗ V⃗ is found such that kV⃗ − Tk⃗ is minimized, while subspace. Since the projections of all points in a manifold
⃗ ∘
−V is on the polar cone S κ :λS̃ is in coneðSÞ and S̃ is the anchor onto the translation vector cμ are the same, x⃗ μ · cμ ¼
point, a projection on the convex hull of S. kcμ k2 ¼ 1, it is convenient to parametrize the Dþ1 affine

031003-7
CHUNG, LEE, and SOMPOLINSKY PHYS. REV. X 8, 031003 (2018)

basis vectors such that uμDþ1 ¼ cμ . In these coordinates, gð⃗vÞ þ t0 − κ þ λ ¼ 0; ð23Þ


the (D þ 1)-dimensional vector representation of cμ is
C⃗ ¼ ð0; 0; …; 0; 1Þ. This parametrization is convenient outside the interior regime.
since it constrains the manifold variability to be in the Fully supporting manifolds ðk ¼ D þ 1Þ:—When t0 is
first D components of S, ⃗ while the D þ 1 coordinate is a sufficiently negative, v0 ¼ κ, v⃗ ¼ 0, and λ ¼ −t0 þ κ. The
longitudinal variable measuring the distance of the mani- ⃗ which obeys the KKT equations, resides
anchor point s̃ðTÞ,
fold affine subspace from the origin. We can then write the in the interior of convðSÞ,
(D þ 1)-dimensional vectors T⃗ ¼ ð⃗ t; t0 Þ, where t0 ¼ T⃗ · C, ⃗ −⃗ t
⃗V ¼ ð⃗v; v0 Þ, S⃗ ¼ ð⃗s; 1Þ, and S̃ ¼ ðs̃; 1Þ, where lowercase s̃ð⃗ t; t0 Þ ¼ : ð24Þ
κ − t0
vectors denote vectors in RD . We will also refer to the
For a fixed ⃗ t, t0 must be negative enough, t0 − κ < tfs ,
D-dimensional intrinsic vector s̃ as the anchor point. In this
where
notation, the capacity, Eqs. (16) and (17), can be written as
 
Z Z −⃗ t
½−t − ⃗ t · s̃ð⃗ t; t0 Þ þ κ2þ ⃗
tfs ðtÞ ¼ arg max t0 j ∈ convðSÞ ; ð25Þ
αM ðκÞ ¼ D⃗ t Dt0 0
−1
: ð20Þ κ − t0
1 þ ks̃ð⃗ t; t0 Þk2
It is clear from this form that when D ¼ 0, or when guaranteeing that s̃ð⃗ t; t0 Þ ∈ convðSÞ. The contribution of
this regime to the capacity is
all the vectors in the affine subspace obey s̃ð⃗ t; t0 Þ ¼ 0,
the capacity reduces to the Gardner result, Eq. (9).
⃗ ¼ k⃗v − ⃗ tk2 þ ðv0 − t0 Þ2 ¼ k⃗ tk2 þ ðκ − t0 Þ2 ; ð26Þ
Since V⃗ · S̃ ¼ ⃗v · s̃ þ v0 , for all S̃, gðVÞ
⃗ ¼ gð⃗vÞ þ v0 , and the FðTÞ
support function can be expressed as
see Eq. (12). Finally, for values of t0 such that tfs ð⃗ tÞ ≤ t0 −
gð⃗vÞ ¼ minf⃗v · s⃗ j⃗s ∈ Sg: ð21Þ κ ≤ ttouch ð⃗ tÞ, the manifolds are partially supporting with
s⃗
support dimension 1 ≤ k ≤ D. Examples of different sup-
The KKT relations can be written as v⃗ ¼ ⃗ t þ λ⃗s, where porting regimes are illustrated in Fig. 4.
λ ¼ v0 − t0 ≥ 0, gð⃗vÞ ≥ κ − v0 , λ½gð⃗vÞ þ v0 − κ ¼ 0, and
s̃ minimizes the overlap with v⃗ . The resultant equation for λ (a1) (a2) (a3)
(or v0 ) is λ ¼ ½−t0 − ⃗ t · s̃ð⃗ t; t0 Þ þ κþ =ð1 þ ks̃ð⃗ t; t0 Þk2 Þ,
which agrees with Eq. (17).
Interior (k=0) Touching (k=1) Fully supporting (k=3)

B. Types of supports (b1) (b2) (b3) (b4)

Using the above coordinates, we can elucidate the


conditions for the different types of support manifolds
defined in the previous section. To do this, we fix the Interior (k=0) Touching (k=1) Partially supporting (k=2) Fully supporting (k=3)

random vector ⃗ t and consider the qualitative change in the


FIG. 4. Determining the anchor points of the Gaussian distrib-
anchor point s̃ð⃗ t; t0 Þ as t0 decreases from þ∞ to −∞. uted vector ð⃗ t; t0 Þ onto the convex hull of the manifold, denoted
Interior manifolds ðk ¼ 0Þ:—For sufficiently positive t0 , as s̃ð⃗t; t0 Þ. Here, we show, for the same vector ⃗ t, the change in
the manifold is interior to the margin plane, i.e., λ ¼ 0 with s̃ð⃗ t; t0 Þ as t0 decreases from þ∞ to −∞. (a) D ¼ 2 strictly convex
corresponding support dimension k ¼ 0. Although not manifold. For sufficiently positive t0 , the vector −⃗t obeys the
contributing to the inverse capacity and solution vector, constraint gð⃗tÞ þ t0 > κ; hence, −⃗t ¼ −⃗v, and the configuration
it is useful to associate anchor points to these manifolds corresponds to an interior manifold (support dimension k ¼ 0).
defined as the closest point on the manifold to the margin For intermediate values where ðttouch > t0 − κ > tfs Þ, ð−⃗ t; −t0 Þ
plane: s̃ð⃗ tÞ ¼ arg mins⃗ ∈S ⃗ t ·⃗s ¼ ∇gð⃗ tÞ, since v⃗ ¼ ⃗ t. This violates the above constraints, and s̃ is a point on the boundary of
definition ensures continuity of the anchor point for the manifold that maximizes the projection on the manifold of a
all λ ≥ 0. vector ð−⃗v; −v0 Þ that is the closest to ð−⃗t; −t0 Þ and obeys
This interior regime holds when gð⃗ tÞ > κ − t0 or, equiv- gð⃗vÞ þ t0 þ λ ¼ κ. Finally, for larger values of −t0 , v⃗ ¼ 0, and
alently, for t0 − κ > ttouch ð⃗ tÞ, where s̃ is a point in the interior of the manifold in the direction of −⃗t
(fully supporting with k ¼ 3). (b) D ¼ 2 square manifold. Here,
ttouch ð⃗ tÞ ¼ −gð⃗ tÞ: ð22Þ in both the interior and touching regimes, s̃ is a vertex of the
square. In the fully supporting regime, the anchor point is in the
Nonzero contributions to the capacity only occur outside interior and collinear to −⃗ t. There is also a partially supporting
this interior regime when gð⃗ tÞ þ t0 − κ < 0, in which regime when −t0 is slightly below tfs. In this regime, −⃗v is
case λ > 0. Thus, for all support dimensions k > 0, the perpendicular to one of the edges and s̃ resides on an edge,
solution for v⃗ is active, satisfying the equality condition, corresponding to manifolds whose intersections with the margin
gð⃗vÞ þ v0 − κ ¼ 0, so that, from Eq. (13), planes are edges (k ¼ 2).

031003-8
CLASSIFICATION AND GEOMETRY OF GENERAL … PHYS. REV. X 8, 031003 (2018)

C. Effects of size and margin of the anchor points when the manifold size is small.
We now discuss the effect of changing manifold size Furthermore, it is natural to define the geometric properties
and the imposed margin on both capacity and geometry. As of manifolds in terms of centered manifolds where the
described by Eq. (4), a change in the manifold size manifolds are shifted within their affine subspace so that
corresponds to scaling every s̃ by a scalar r. the center and orthogonal translation vector coincide, i.e.,
Small size:—If r → 0, the manifold shrinks to a point s⃗ 0 ¼ c⃗ with k⃗s0 k ¼ k⃗ck ¼ 1. This means that all lengths
(the center), whose capacity reduces to that of isolated points. are defined relative to the distance of the centers from the
However, in the case where D ≫ 1, the capacity may be origin and the D-dimensional intrinsic vectors s⃗ give the
offset relative to the manifold center.
affected by the manifold structure even if r ≪ 1; see
Sec. IV F. Nevertheless, the underlying support structure
is simple. When r is small, the manifolds have only two D. Manifold anchor geometry
support configurations. For t0 − κ > 0, the manifold is The capacity equation, Eq. (20), motivates defining
interior with λ¼0, v0 ¼t0 , and ⃗v¼ t.⃗ When t0 − κ < 0, the geometrical measures of the manifolds, which we call
manifold becomes touching with support dimension k¼1. In “manifold anchor geometry.” Manifold anchor geometry is
that case, because s̃ has a small magnitude, v⃗ ≈ ⃗ t, λ ≈ κ − t0 , based on the statistics of the anchor points s̃ð⃗ t; t0 Þ induced
and v0 ≈ κ. Thus, in both cases, v⃗ is close to the Gaussian by the Gaussian random vector ð⃗ t; t0 Þ, which are relevant to
vector ⃗ t. The probability of other configurations vanishes. the capacity in Eq. (20). These statistics are sufficient for
Large size:—In the large size limit r → ∞, separating the determining the classification properties and supporting
manifolds is equivalent to separating the affine subspaces. structures associated with the maximum margin solution.
As we show in Appendix C, when r ≫ 1, there are R ∞two main
We accordingly define the manifold anchor radius and
support structures. With probability Hð−κÞ ¼ −κ Dt0, the dimension as follows.
manifolds are fully supporting; namely, the underlying Manifold anchor radius:—Denoted RM , it is defined by
affine subspaces are parallel to the margin plane. This is the mean squared length of s̃ð⃗ t; t0 Þ,
the first regime corresponding to fully supporting manifolds,
which contributes to the inverse capacity an amount R2M ¼ hks̃ð⃗ t; t0 Þk2 it;t
⃗ 0: ð28Þ
Hð−κÞD þ α−1 0 ðκÞ. The second regime corresponds to
touching or partially supporting manifolds, such that the Manifold anchor dimension: DM is given by
angle between the affine subspace and the margin plane is
almost zero, and contributes an amount HðκÞD to the inverse DM ¼ h(⃗ t · ŝðt;⃗ t0 Þ)2 it;t
⃗ 0; ð29Þ
capacity. Combining the two contributions, we obtain, for
large sizes, α−1 −1
M ¼ D þ α0 ðκÞ, consistent with Eq. (8).
where ŝ is a unit vector in the direction of s̃. The anchor
Large margin:—For a fixed T, ⃗ Eq. (25) implies that dimension measures the angular spread between ⃗ t and
larger κ increases the probability of being in the supporting the corresponding anchor point s̃ in D dimensions. Note
regimes. Increasing κ also shrinks the magnitude of s̃ that the manifold dimension obeys DM ≤ hk⃗ tk2 i ¼ D.
according to Eq. (24). Hence, when κ ≫ 1, the capacity Whenever there is no ambiguity, we will call RM and
becomes similar to that of P random points and the DM the manifold radius and dimension, respectively.
corresponding capacity is given by αM ≈ κ −2, independent These geometric descriptors offer a rich description of
of manifold geometry. the manifold properties relevant for classification. Since v⃗
Manifold centers:—The theory of manifold classification and s̃ depend in general on t0 − κ, the above quantities are
described in Sec. III does not require the notion of a averaged not only over ⃗ t but also over t0. For the same
manifold center. However, once we understand how scaling reason, the manifold anchor geometry also depends upon
the manifold sizes by a parameter r in Eq. (4) affects their the imposed margin.
capacity, the center points about which the manifolds are
scaled need to be defined. For many geometries, the center E. Gaussian geometry
is a point of symmetry such as for an ellipsoid. For general
We have seen that, for small manifold sizes, v⃗ ≈ ⃗ t, and
manifolds, a natural definition would be the center of mass
⃗ averaging over the Gaussian the anchor points s̃ can be approximated by s̃ð⃗ tÞ ¼ ∇gS ðt̂Þ.
of the anchor points S̃ðTÞ
Under these conditions, the geometry simplifies as shown

measure of T. We will adopt here a simpler definition for in Fig. 5. For each Gaussian vector ⃗ t ∈ RD, s̃ð⃗ tÞ is the point
the center provided by the Steiner point for convex bodies
on the manifold that first touches a hyperplane normal to −⃗ t
[23], S⃗ 0 ¼ ð⃗s0 ; 1Þ, with as it is translated from infinity towards the manifold. Aside
s⃗ 0 ¼ h∇gS ð⃗ tÞi⃗t ; ð27Þ from a set of measure zero, this touching point is a unique
point on the boundary of convðSÞ. This procedure is similar
and the expectation is over the Gaussian measure of to that used to define the well-known Gaussian mean width,
⃗ t ∈ RD . This definition coincides with the center of mass which in our notation equals wðSÞ ¼ −2hgS ð⃗ tÞi⃗t [24].

031003-9
CHUNG, LEE, and SOMPOLINSKY PHYS. REV. X 8, 031003 (2018)

(a) (b) (c)


scaling the manifold by a global scale factor r as defined in
Eq. (4) results in scaling Rg by the factor r. Likewise, the
dimensionality Dg is invariant to a global scaling of the
manifold size. In contrast, the anchor geometry does not
obey such invariance for larger manifolds. The reason is
that the anchor point depends on the longitudinal degrees of
FIG. 5. Gaussian anchor points, with mapping from ⃗t to points freedom, namely, on the size of the manifold relative to the
s̃g ð⃗ tÞ, showing the relation between −⃗ t and the point on the distance from the center. Hence, RM need not scale linearly
manifold that touches a hyperplane orthogonal to ⃗t. D ¼ 2 with r, and DM will also depend on r. Thus, the anchor
manifolds shown are (a) circle, (b) l2 ellipsoid, (c) polytope geometry can be viewed as describing the general relation-
manifold. In (c), only for values of ⃗t of measure zero (when ⃗t is ship between the signal (center distance) and noise (mani-
exactly perpendicular to an edge) will s̃g ð⃗tÞ lie along the edge; fold variability) in classification capacity. We also note that
otherwise it coincides with a vertex of the polytope. In all cases, the manifold anchor geometry automatically accounts for
s̃g ð⃗ tÞ will be in the interior of the convex hulls only for ⃗ t ¼ 0. the rich support structure described in Sec. IV D. In
Otherwise, it is restricted to their boundary. particular, as t0 decreases, the statistics of the anchor
points change from being concentrated on the boundary
of convðSÞ to its interior. Additionally, for manifolds that
Note that, for small sizes, the touching point does not
are not strictly convex and intermediate values of t0 , the
depend on t0 or κ, so its statistics are determined only by the
anchor statistics become concentrated on k-dimensional
shape of convðSÞ relative to its center. This motivates
facets of the convex hull, corresponding to partially
defining a simpler manifold Gaussian geometry denoted by
supported manifolds.
the subscript g, which highlights its dependence over a
We illustrate the difference between the two geometries
D-dimensional Gaussian measure for ⃗ t.
in Fig. 6 with two simple examples: a D ¼ 2 l2 ball and a
Gaussian radius:—Denoted by Rg, this measures the
D ¼ 2 l2 ellipse. In both cases, we consider the distribu-
mean square amplitude of the Gaussian anchor point,
tion of ks̃k and θ ¼ cos−1 ð−t̂ · ŝÞ, the angle between −⃗ t
s̃g ð ⃗ t Þ ¼ ∇gS ðt̂Þ:
and s̃. For the ball with radius r, the vectors −⃗ t and s̃ are
R2g ¼ hks̃g ð⃗ tÞk2 i⃗t ; ð30Þ parallel, so the angle is always zero. For the manifold
anchor geometry, s̃ may lie inside the ball in the fully
where the expectation is over the D-dimensional supporting region. Thus, the distribution of ks̃k consists of
Gaussian ⃗ t. a mixture of a delta function at r, corresponding to the
Gaussian dimension:—Dg is defined by interior and touching regions, and a smoothly varying
distribution corresponding to the fully supporting region.
Dg ¼ h(⃗ t · ŝg ð⃗ tÞ)2 i⃗t ; ð31Þ Figure 6 also shows the corresponding distributions for a
two-dimensional ellipsoid with major and minor radius,
where ŝg is the unit vector in the direction of s̃g . While R2g R1 ¼ r and R2 ¼ 12 r. For the Gaussian geometry, the
measures the total variance of the manifold, Dg measures distribution of ks̃k has finite support between R1 and R2 ,
the angular spread of the manifold, through the statistics whereas the manifold anchor geometry has support also
of the angle between ⃗ t and its Gaussian anchor point. Note below R2. Since ⃗ t and s̃ need not be parallel, the distribution
that Dg ≤ hk⃗ tk2 i ¼ D.
It is important to note that, even in this limit, our (a) (b) (c)
geometrical definitions are not equivalent to conventional
geometrical measures such as the longest chord or second-
order statistics induced from a uniform measure over the
boundary of convðSÞ. For the special case of D-dimen-
sional l2 balls with radius R, s̃g ð⃗ tÞ is the point on the
boundary of the ball in the direction of ⃗ t so that Rg ¼ R and
Dg ¼ D. However, for general manifolds, Dg can be much FIG. 6. Distribution of ks̃k (norm of manifold anchor vectors)
smaller than the manifold affine dimension D, as illustrated and θ ¼ ∠ð−⃗t; s̃Þ for D ¼ 2 ellipsoids. (a) Distribution of s̃ for an
in some examples later. l2 ball with R ¼ 5. The Gaussian geometry is peaked at ks̃k ¼ r
We recapitulate the essential difference between the (blue), while ks̃k < r has nonzero probability for the manifold
Gaussian geometry and the full manifold anchor geometry. anchor geometry (red). (b),(c) D ¼ 2 ellipsoids, with radii R1 ¼ 5
In the Gaussian case, the radius is an intrinsic property of and R2 ¼ 12 R1 . (b) Distribution of ks̃k for Gaussian geometry
the shape of the manifold in its affine subspace and is (blue) and anchor geometry (red). (c) Corresponding distribution
invariant to changing its distance from the origin. Thus, of θ ¼ ∠ð−⃗ t; s̃Þ.

031003-10
CLASSIFICATION AND GEOMETRY OF GENERAL … PHYS. REV. X 8, 031003 (2018)

of the angle varies between zero and ðπ=4Þ, with the of s̃ are small; hence v⃗ ≈ ⃗ t, and the Gaussian statistics for
manifold anchor geometry being more concentrated near the geometry suffice. Thus, we can replace RM and DM in
zero due to contributions from the fully supporting regime. Eqs. (32) and (33) by Rg and Dg , respectively. Note that,
In Sec. VI, we will show how the Gaussian geometry in the scaling regime, the factor proportional to 1 þ R2g in
becomes relevant even for larger manifolds when the labels Eq. (32) is the next-order correction to the overall capacity,
are highly imbalanced, i.e., sparse classification. since Rg is small. Notably, the margin in this regime,
pffiffiffiffiffiffi
κg ¼ Rg Dg , is equal to half the Gaussian mean width
F. Geometry and classification of
high-dimensional manifolds of convex bodies, κg ≈ 12 wðSÞ [25]. As for the support
structure, since the manifold size is small, the only signifi-
In general, linear classification as expressed in Eq. (20)
cant contributions arise from the interior (k ¼ 0) and
depends on high-order statistics of the anchor vectors
touching (k ¼ 1) supports.
s̃ð⃗ t; t0 Þ. Surprisingly, our analysis shows that, for high- Beyond the scaling regime:—When Rg is not small, the
dimensional manifolds, the classification capacity can be
anchor geometry, RM and DM , cannot be adequately
described in terms of the statistics of RM and DM alone.
described by the Gaussian statistics, Rg and Dg . In this case,
This is particularly relevant as we expect that, in many pffiffiffiffiffiffiffi
applications, the affine dimension of the manifolds is large. the manifold margin κM ¼ RM DM is large and Eq. (32)
More specifically, we define high-dimensional manifolds reduces to
as manifolds where the manifold dimension is large, i.e.,
DM ≫ 1 (but still finite in the thermodynamic limit). In 1 þ R−2
practice, we find that DM ≳ 5 is sufficient. Our analysis αM ≈ M
; ð34Þ
DM
below elucidates the interplay between size and dimension,
namely, how small RM needs to be for high-dimensional
manifolds to have a substantial classification capacity. where we have used α0 ðκÞ ≈ κ −2 for large margins and
In the high-dimensional regime, the mean-field equa- assumed that κ ≪ κM . For strictly convex high-dimensional
tions simplify due to self-averaging of terms involving manifolds with RM ¼ Oð1Þ, only the touching regime
sums of components ti and s̃i . The quantity, ⃗ t · s̃, that (k ¼ 1) contributes significantly to the geometry and, hence,
appears in the capacity, Eq. (20), can be approximated as to the capacity. For manifolds that are not strictly convex,
partially supporting solutions with k ≪ D can also contrib-
−⃗ t · s̃ ≈ κ M , where we introduce the effective manifold
pffiffiffiffiffiffiffi ute to the capacity.
margin κ M ¼ RM DM . Combined with ks̃k2 ≈ R2M , we
Finally, when RM is large, fully supporting regimes
obtain
with k ≈ D contribute to the geometry, in which case, the
 
κ þ κM manifold anchor dimension approaches the affine
αM ðκÞ ≈ α0 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ; ð32Þ dimension DM → D and Eqs. (32) and (34) reduce to αM ≈
1 þ R2M
ð1=DÞ, as expected.
where α0 is the capacity for P random points, Eq. (9). To
gain insight into this result, we note that the effective
V. EXAMPLES
margin on the center is its mean distance from the point
closest to the margin plane s̃, which is roughly the mean of
pffiffiffiffiffiffiffi A. Strictly convex manifolds: l2 ellipsoids
−⃗ t · s̃ ≈ RM DM . The denominator in Eq. (32) indicates The family of l2 ellipsoids are examples of manifolds
thatffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
p the margin needs to be scaled by the input norm, that are strictly convex. Strictly convex manifolds have
1 þ R2M . In Appendix B, we show that Eq. (32) can also smooth boundaries that do not contain corners, edges, or
be written as flats (see Appendix B). Thus, the description of the
anchor geometry and support structures is relatively
αM ðκÞ ≈ αBall ðκ; RM ; DM Þ; ð33Þ simple. The reason is that the anchor vectors s̃ð⃗ t; t0 Þ
correspond to interior (k ¼ 0), touching (k ¼ 1), or fully
namely, the classification capacity of a general high- supporting ðk ¼ D þ 1Þ, while partial support ð1 < k <
dimensional manifold is well approximated by that of l2 D þ 1Þ is not possible. The l2 ellipsoid geometry can be
balls with dimension DM and radius RM . solved analytically; nevertheless, because it has less
Scaling regime:—Equation (32) implies that, to obtain a symmetry than the sphere, it exhibits some salient pro-
finite capacity in the high-dimensional regime, the effective
pffiffiffiffiffiffiffi perties of manifold geometry, such as nontrivial dimen-
margin κ M ¼ RM DM needs to be of order unity, which sionality and nonuniform measure of the anchor points.
requires the radius to be small, scaling for large DM as We assume the ellipsoids are centered relative to their
−1
RM ¼ OðDM2 Þ. In this scaling regime, the calculation of the symmetry centers as described in Sec. IV, and they
P can be
capacity and geometric properties are particularly simple. parametrized by the set of points Mμ ¼ xμ0 þ D μ
i¼1 si ui ,
As argued above, when the radius is small, the components where

031003-11
CHUNG, LEE, and SOMPOLINSKY PHYS. REV. X 8, 031003 (2018)

 X D  2 
S ¼ s⃗ j
si
≤1 : ð35Þ R2i
λi ¼ ; ð41Þ
i¼1
Ri Zð1 þ R2i Þ
P
The components of uμi and of the ellipsoid centers x0 μ where the normalization factor Z2 ¼ D 2 2 2
i¼1 ½Ri =ð1 þ Ri Þ 
are pi.i.d.
ffiffiffiffi Gaussian distributed with zero mean and variance (see Appendix B). The capacity for high-dimensional
ð1= N Þ so that they are orthonormal in the large N limit. ellipsoids is determined via Eq. (32) with manifold anchor
The radii Ri represent the principal radii of the ellipsoids radius
relative to the center. The anchor points s̃ð⃗ t; t0 Þ can be
computed explicitly (details in the Appendix B), corre- X
D
R2M ¼ λ2i ð42Þ
sponding to three regimes.
i¼1
The interior regime (k ¼ 0) occurs when t0 − κ ≥
ttouch ð⃗ tÞ, where and anchor dimension
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P
u D ð D λi Þ2
uX DM ¼ Pi¼1 ð43Þ
ttouch ð⃗ tÞ ¼ t
:
R2i t2i : ð36Þ 2
i¼1 λi
D
i¼1
Note that RM =r and DM are not invariant to scaling the
Here, λ ¼ 0, resulting in zero contribution to the inverse ellipsoid by a global factor r, reflecting the role of the fixed
capacity. centers.
The touching regime (k ¼ 1) holds when ttouch ð⃗ tÞ > t0 − The anchor covariance matrix:—We can also compute
κ > tfs ð⃗ tÞ and the covariance matrix of the anchor points. This matrix is
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi diagonal in the principal directions of the ellipsoid, and its
u D   eigenvalues are λ2i . It is interesting to compare DM with a
uX ti 2
tfs ðtÞ ¼ −t
⃗ : ð37Þ well-known measure of an effective dimension of a
i¼1
Ri covariance matrix, the participation ratio [26,27]. Given
a spectrum of eigenvalues of the covariance matrix, in
Finally, the fully supporting regime ðk ¼ D þ 1Þ occurs our notation λ2i , we can define a generalized participation
P q 2 PD 2q
when tfs ð⃗ tÞ > t0 − κ. The full expression for the capacity ratio as PRq ¼ ½ð D i¼1 λi Þ = i¼1 λi , q > 0. The con-
for l2 ellipsoids is given in Appendix B. Below, we focus ventional participation ratio uses q ¼ 2, whereas DM
on the interesting cases of ellipsoids with D ≫ 1. uses q ¼ 1.
High-dimensional ellipsoids:—It is instructive to apply Scaling regime:—In the scaling regime where the radii
the general analysis of high-dimensional manifolds to are small, the radius and dimension are equivalent to the
ellipsoids with D ≫ 1. We will distinguish among differ- Gaussian geometry:
ent-size regimes by assuming that all the radii of the PD 4
ellipsoid are scaled by a global factor r. In the high- Ri
dimensional regime, due to self-averaging, the boundaries R2g ¼ Pi¼1
D R2
ð44Þ
i¼1 i
of the touching and fully supporting transitions can be
P
approximated by ð D R2i Þ2
Dg ¼ Pi¼1 ; ð45Þ
vffiffiffiffiffiffiffiffiffiffiffiffiffi D
i¼1 Ri
4
u D
uX
ttouch ¼t R2i ð38Þ and the effective margin is given as
i¼1
X
D 1=2
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
u D κg ¼ R2i : ð46Þ
uX
tfs ¼ −t R−2i ; ð39Þ i¼1

i¼1
As can be seen, Dg and ðRg =rÞ are invariant to scaling all
pffiffiffiffi the radii by r, as expected. It is interesting to compare the
both independent of ⃗ t. Then, as long as Ri D are not large above Gaussian geometric parameters with the statistics
(see below), tfs → −∞, and the probability of fully sup- induced by a uniform measure on the surface of the
porting vanishes. Discounting the fully supporting regime, ellipsoid. In this case, the covariance matrix has Peigenval-
the anchor vectors s̃ð⃗ t; t0 Þ are given by ues λ2i ¼ R2i , and the total variance is equal to i R2i . In
contrast, in the Gaussian geometry, the eigenvalues of the
s̃i ¼ −λi ti ð40Þ covariance matrix λ2i are proportional to R4i . This and the

031003-12
CLASSIFICATION AND GEOMETRY OF GENERAL … PHYS. REV. X 8, 031003 (2018)

r=1
corresponding expression (44) for the squared radius is (a) (b)
2
1
the result of a nonuniform induced measure on the
0.8 1.5
surface of the ellipse in the anchor geometry, even in its
0.6

Ri
Gaussian limit. 1
Beyond the scaling regime:—When Ri ¼ Oð1Þ, the 0.4
0.5
0.2
high-dimensional ellipsoids all become touching manifolds
since ttouch → −∞ and tfs → ∞. The capacity is small 0
50 100 150 200
0
10
-2
10
-1 0
10 10
1
10
2

because κ M ≫ 1 and is given by Eq. (34) with Eqs. (42) and Dimension r
(43). Finally, when all Ri ≫ 1, we have (c) 200 (d) 1

0.8
1
150
D
R2M ¼ PD
0.6
¼ −2 : ð47Þ

M
−2
hRi ii 100

D
i¼1 Ri 0.4

50 0.2

Here, RM scales with the global scaling factor r, DM ¼ D, 0 0


10-2 100 102
and α−1 ¼ D. Although the manifolds are touching,
-2 -1 0 1 2
10 10 10 10 10
r r
their angle p relative
ffiffiffiffi to the margin plane is near zero. (e1) (e2) (e3)
When Ri ≈ D or larger, the fully supporting transition 0.6
r=0.01
1
r=1
1
r=100
tfs becomes order one, and the probability of fully support-
0.4
ing is significant. 0.5 0.5
In Fig. 7, we illustrate the behavior of high-D- 0.2

dimensional ellipsoids, using ellipsoids with a bimodal 0 0 0


distribution of their principal radii Ri : Ri ¼ r, for 0 1 D+1 0 1 D+1 0 1 D+1

1 ≤ i ≤ 10, and Ri ¼ 0.1r, for 11 ≤ i ≤ 200 [Fig. 7(a)]. FIG. 7. Bimodal l2 ellipsoids. (a) Ellipsoidal radii Ri with
Their properties are shown as a function of the overall scale r ¼ 1. (b) Classification capacity as a function of scaling factor r:
r. Figure 7(b) shows numerical simulations of the capacity, full mean-field theory capacity (blue lines); approximation of the
the full mean-field solution, and the spherical high-dimen- capacity given by equivalent ball with RM and DM (black dashed
sional approximation (with RM and DM ). These calcula- lines); simulation capacity (circles), averaged over five repeti-
tions are all in good agreement, showing the accuracy of the tions, measured with 50 dichotomies per repetition. (c) Manifold
mean-field theory and spherical approximation. As seen in dimension DM as a function of r. (d) Manifold radius RM relative
Figs. 7(b)–7(d), the system is in the scaling regime for to r. (e) Fraction of manifolds with support dimension k for
r < 0.3. In this regime, the manifold dimension is constant different values of r: k ¼ 0 (interior), k ¼ 1 (touching), k ¼
D þ 1 (fully supporting). (e1) Small r ¼ 0.01, where most
and equals Dg ≈ 14, as predicted by the participation ratio,
manifolds are interior or touching. (e2) r ¼ 1, where most
Eq. (45), and the manifold radius Rg is linear with r, as manifolds are in the touching regime. (e3) r ¼ 100, where the
expected from Eq. (44). The ratio ðRg =rÞ ≈ 0.9 is close to fraction of fully supporting manifolds is 0.085, predicted by
pffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P −2
unity, indicating that, in the scaling regime, the system is Hð−tfs Þ ¼ Hð i Ri Þ [Eq. (39)].
dominated by the largest radii. For r > 0.3, the effective
margin is larger than unity, and the system becomes dimension N. In realistic data, it is likely that the data
increasingly affected by the full affine dimensionality of manifolds are technically full rank, i.e., D þ 1 ¼ N, raising
the ellipsoid, as seen by the marked increase in dimension, the question of whether our mean-field theory is still valid
as well as a corresponding decrease in ðRM =rÞ. For larger r, in such cases. We investigate this scenario by computing
DM approaches D and α−1 M ¼ D. Figures 7(e1)–7(e3) show the capacity on ellipsoids containing a realistic distribution
the distributions of the support dimension 0 ≤ k ≤ D þ 1. of radii. We have taken, as examples, a class of images from
In the scaling regime, the interior and touching regimes the ImageNet data set [28], and analyzed the SVD spectrum
each have probability close to 12, and the fully supporting of the representations of these images in the last layer of a
regime is negligible. As r increases beyond the scaling deep convolutional network, GoogLeNet [29]. The com-
regime, the interior probability decreases, and the solution puted radii are shown in Fig. 8(a) and yield a value
is almost exclusively in the touching regime. For very high D ¼ N − 1 ¼ 1023. In order to explore the properties of
values of r, the fully supporting solution gains a substantial such manifolds, we have scaled all the radii by an overall
probability. Note that the capacity decreases to approx- factor r in our analysis. Because of the decay in the
imately D1 for a value of r below which a substantial fraction distribution of radii, the Gaussian dimension for the
of solutions are fully supporting. In this case, the touching ellipsoid is only about Dg ≈ 15, much smaller than D or
ellipsoids all have a very small angle with the margin plane. N, implying that, for small r, the manifolds are effectively
Until now, we have assumed that the manifold affine low dimensional and the geometry is dominated by a
subspace dimension D is finite in the limit of large ambient small number of radii. As r increases above r ≳ 0.03,

031003-13
CHUNG, LEE, and SOMPOLINSKY PHYS. REV. X 8, 031003 (2018)

(a) r=1 (b) fxμ0  Rk uμk ; k ¼ 1; …; Dg. The vectors uμi specify the
0
10
1 10 principal axes of the l1 ellipsoids. For simplicity, we
consider the case of l1 balls when all the radii are equal:
10-1
Ri ¼ R. We will concentrate on the cases when l1 balls are
Ri

100
10
-2 high dimensional; the case for l1 balls with D ¼ 2 was
-1 briefly described in Ref. [15]. The analytical expression of
10
200 400 600 800 1000 the capacity is complex due to the presence of contributions
10-3 10-2 10-1 100 101 102
Dimension r from all types of supports, 1 ≤ k ≤ D þ 1. We address
(c) (d)
important aspects of the high-dimensional solution below.
1000 12 High-dimensional l1 balls, scaling regime:—In the
750 10
scaling regime, we have v⃗ ≈ ⃗ t. In this case, we can write
R /r
8
M

M
the solution for the subgradient as
D

500 6
250 4 
2 −Rsignðti Þ; jti j > jtj j ∀ j ≠ i
0 0 s̃i ð⃗ tÞ ¼ : ð49Þ
10-2 100 102 -3 -2
10 10 10
-1 0 1
10 10 10
2 0 otherwise
r r
In other words, s̃ð⃗ tÞ is a vertex of the polytope correspond-
FIG. 8. Ellipsoids with radii computed from realistic image ing to the component of ⃗ t with the largest magnitude; see
data. (a) SVD spectrum Ri , taken from the readout layer of Fig. 5(c). The components of ⃗ t are i.i.d. Gaussian random
GoogLeNet from a class of ImageNet images with N ¼ 1024 and
variables and, for large D, its maximum component tmax is
D ¼ 1023. The radii are scaled by a factor r: R0i ¼ rRi for pffiffiffiffiffiffiffiffiffiffiffiffiffiffi
(b) Classification capacity as a function of r: full mean-field concentrated around 2 log D. Hence, Dg ¼ hðŝ · ⃗ tÞ2 i ¼
theory capacity (blue lines); approximation of the capacity as that ht2max i ¼ 2 log D, which is much smaller than D. This result
of a ball with RM and DM from the theory for ellipsoids (black is consistent with the fact that the Gaussian
pffiffiffiffiffiffiffiffiffiffiffi mean width of a
dashed lines); simulation capacity (circles), averaged over 5 D-dimensional l1 ball scales with log D and not with D
repetitions, measured with 50 random dichotomies per each [25]. Since all the points have norm R, we have Rg ¼ R,
repetition. (c) Manifold dimension as a function of r. (d) Manifold pffiffiffiffiffiffiffiffiffiffiffiffiffiffi
radius relative to the scaling factor r, RM =r as a function of r. and the effective margin is then given by κ g ¼ R 2 log D,
which is order unity in the scaling regime. In this

κM becomes larger than 1 and the solution leaves the


regime, the capacitypisffiffiffiffiffiffiffiffiffiffiffiffiffi
pffiffiffiffiffiffiffiffiffiffiffiffiffiffi
given ffi by simple relation αM ¼
2
α0 ð½κ þ R 2 log D= 1 þ R Þ.
scaling regime, resulting in a rapid increase in DM and a High-dimensional l1 balls, R ¼ Oð1Þ:—When the
rapid falloff in capacity, as shown in Figs. 8(b) and 8(c). radius R is small as in the scaling regime, the only
Finally, for r > 10, we have α−1 M ≈ DM ≈ D approaching contributing solution is the touching solution (k ¼ 1).
the lower bound for capacity as expected. The agreement When R increases, solutions with all values of k,
between the numerical simulations and the mean-field 1 ≤ k ≤ D þ 1, occur, and the support can be any face
estimates of the capacity illustrates the relevance of the of the convex polytope with dimension k. As R increases,
theory for realistic data manifolds that can be full rank. the probability distribution pðkÞ over k of the solution
shifts to larger values. Finally, for large R, only two regimes
B. Convex polytopes: l1 ellipsoids dominate: fully supporting (k ¼ D þ 1) with probability
The family of l2 ellipsoids represents manifolds that Hð−κÞ and partially supporting with k ¼ D with proba-
are smooth and strictly convex. On the other hand, there bility HðκÞ.
are other types of manifolds whose convex hulls are not We illustrate the behavior of l1 balls with radius r and
strictly convex. In this section, we consider D-dimensional affine dimension D ¼ 100. In Fig. 9, (a) shows the linear
classification capacity as a function of r. When r → 0, the
l1 ellipsoids. They are prototypical of convex polytopes
manifold approaches the capacity for isolated points,
formed by the convex hulls of finite numbers of points
αM ≈ 2, and when r → ∞, αM ≈ ð1=DÞ ¼ 0.01. The
in RN . The D-dimensional l1 ellipsoid is parametrized by
μ μ numerical simulations demonstrate that, despite the differ-
PD fRi μg and specified by the convex set M ¼ x0 þ
radii
ent geometry, the capacity of the l1 polytope is similar to
i¼1 si ui , with that of a l2 ball with radius RM and dimension DM . In (b),
pffiffiffiffi
 X  for the scaling regime when r < 0.1 ¼ ð1= DÞ, we see
D
jsi j RM ≈ r, but when r ≫ 1, RM is much smaller than r,
S ¼ s⃗ j ≤1 : ð48Þ
i¼1
Ri despite the fact that all Ri of the polytope are equal. This is
because when r is not small, the various faces and
Each manifold Mμ ∈ RN is centered at xμ0 and consists of a eventually interior of the polytope contribute to the anchor
convex polytope with a finite number (2D) of vertices: geometry. In (c), we see DM ≈ 2 log D in the scaling

031003-14
CLASSIFICATION AND GEOMETRY OF GENERAL … PHYS. REV. X 8, 031003 (2018)
P
Mμ ¼ xμ0 þ D μ μ
(a) (b) (c)
i¼1 si ui , where x0 represents the mean (over
θ) of the population response to the object. The different D
components correspond to the different Fourier compo-
nents of the angular response, so that

(d1) (d2) (d3) R


s2n ðθÞ ¼ pnffiffiffi cosðnθÞ
2
R
s2n−1 ðθÞ ¼ pnffiffiffi sinðnθÞ; ð50Þ
2

where Rn is the magnitude of the nth Fourier component for


FIG. 9. Separability of l1 balls. (a) Linear classification 1 ≤ n ≤ D2 . The neural responses in Eq. (1) are determined
capacity of l1 balls as a function of radius r with D ¼ 100 by projecting onto the basis:
and N ¼ 200: MFT solution (blue lines), spherical approximation
(black dashed lines), full numerical simulations (circles). Inset: pffiffiffi
Illustration of l1 ball. (b) Manifold radius RM relative to the uμi
2n ¼ 2 cosðnθμi Þ
pffiffiffi
actual radius r. (c) Manifold dimension DM as a function of r. In uμi
2n−1 ¼ 2 sinðnθμi Þ: ð51Þ
the small r limit, DM is approximately 2 log D, while, in large r,
DM is close to D, showing how the solution is orthogonal to the
manifolds when their sizes are large. (d1)–(d3) Distribution of The parameters θμi are the preferred orientation angles for
support dimensions: (d1) r ¼ 0.001, where most manifolds are the corresponding neurons and are assumed to be evenly
either interior or touching; (d2) r ¼ 2.5, where the support distributed between −π ≤ θμi ≤ π (for simplicity, we
dimension has a peaked distribution; (d3) r ¼ 100, where most assume that the orientation tunings of the neurons are all
manifolds are close to being fully supporting. of the same shape and are symmetric around the preferred
angle). The statistical assumptions of our analysis assume
regime, while DM → D as r → ∞. In terms of the support that the different manifolds are randomly positioned and
structures, when r ¼ 0.001 in the scaling regime (d1), most oriented with respect to the others. For the ring manifold
manifolds are either interior or touching. For intermediate model, this implies that the mean responses xμ0 are
sizes (d2), the support dimension is peaked at an inter- independent random Gaussian vectors and also that the
mediate value, and, finally, for very large manifolds (d3), preferred orientation θμi angles are uncorrelated. With this
most polytope manifolds are nearly fully supporting. definition, all the vectors s⃗ ∈ S obey the normalization
PD2
k⃗sk ¼ r, where r2 ¼ n¼1 R2n . Thus, for each object, the
C. Smooth nonconvex manifolds: Ring manifolds ring manifold is a closed nonintersecting smooth curve
Many neuroscience experiments measure the responses residing on the surface of a D-dimensional sphere with
of neuronal populations to a continuously varying stimulus, radius r.
with one or a small number of degrees of freedom. A For the simplest case with D ¼ 2, the ring manifold is
prototypical example is the response of neurons in visual equivalent to a circle in two dimensions. However, for
cortical areas to the orientation (or direction of movement) larger D, the manifold is not convex and its convex hull is
of an object. When a population of neurons responds both composed of faces with varying dimensions. In Fig. 10, we
to an object identity and to a continuous physical variation, investigate the geometrical properties of these manifolds
the result is a set of smooth manifolds, each parametrized relevant for classification as a function of the overall scale
by a single variable, denoted θ, describing continuous factor r, where for simplicity we have chosen Rn ¼ r for all
curves in RN . Since, in general, the neural responses are not n. The most striking feature is the small dimension in the
linear in θ, the curve spans more than one linear dimension. scaling regime, scaling roughly as Dg ≈ 2 log D. This
This smooth curve is not convex and is endowed with a logarithmic dependence is similar to that of the l1 ball
complex nonsmooth convex hull. It is, thus, interesting to polytopes. Then, as r increases, DM increases dramatically
consider the consequences of our theory on the separability from 2 log D to D.
of smooth but nonconvex curves. The similarity of the ring manifold to a convex polytope
The simplest example considered here is the case is also seen in the support dimension k of the manifolds.
where θ corresponds to a periodic angular variable such Support faces of dimension k ≠ 0, 1, D þ 1 are seen,
as the orientation of an image, and we call the resulting implying the presence of partially supporting solutions.
nonconvex curve a ring manifold. We model the Interestingly, ðD=2Þ < k ≤ D are excluded, indicating that
neuronal responses as smooth periodic functions of θ, the maximal face dimension of the convex hull is ðD=2Þ.
which can be parametrized by decomposing the neuronal Each face is the convex hull of a set of k ≤ ðD=2Þ points,
responses in Fourier modes. Here, for each object, where each point resides in the 2D subspace spanned by a

031003-15
CHUNG, LEE, and SOMPOLINSKY PHYS. REV. X 8, 031003 (2018)

D=20, Uniform R n
(a)
2
(b) demonstrate that, in many cases, when the size is smaller
20
than the crossover value, the manifold dimensionality is
1.5 15 substantially smaller than that expected from naive second-

DM
1 10 order statistics, highlighting the saliency and significance
0.5 5 of our anchor geometry.
0 0
-2 -1 0 1 2 1 2
10 10 10 10 10 10 -2 10 -1 10 0 10 10 VI. MANIFOLDS WITH SPARSE LABELS
r r
(c) 1 (d) So far, we have assumed that the number of manifolds
12
0.8
with positive labels is approximately equal to the number of
10 manifolds with negative labels. In this section, we consider
RM /r

0.6 DM
0.4 8 the case where the two classes are unbalanced such that the
0.2 6
number of positively labeled manifolds is far less than
the negatively labeled manifolds (the opposite scenario is
0
10-2 10-1 100 101 102 6 8 10 12 equivalent). This is a special case of the problem of
r classification of manifolds with heterogeneous statistics,
(e1) (e2) (e3) where manifolds have different geometries or label statis-
0.6 tics. We begin by addressing the capacity of mixtures of
0.4 0.3
0.4 manifolds and then focus on sparsely labeled manifolds.
0.2
0.2 0.2
0.1
A. Mixtures of manifold geometries
0 0 0
0 10 20 0 10 20 0 10 20
Our theory of manifold classification is readily extended
Support dimension (k) Support dimension (k) Support dimension (k)
to a heterogeneous ensemble of manifolds, consisting of
FIG. 10. Linear classification L distinct classes. In the replica theory, the shape of the
pffiffiffiffiffiffiffiffiffiffiffiffiffi of D-dimensional ring manifolds, manifolds appear only in the free energy term G1 [see
with uniform Rn ¼ ð2=DÞr. (a) Classification capacity for D ¼
20 as a function of r with m ¼ 200 test samples (for details of Appendix, Eq. (A13)]. For a mixture statistics, the com-
numerical simulations, see SM [16], Sec. S9): mean-field theory bined free energy is given by simply averaging the
(blue lines), spherical approximation (black dashed lines), individual free energy terms over each class l. Recall that
numerical simulations (black circles). Inset: Illustration of a ring this free energy term determines the capacity for each
manifold. (b) Manifold dimension DM , which shows DM ∼ D in shape, giving an individual inverse critical load α−1l . The
the large r limit, showing the orthogonality of the solution. inverse capacity of the heterogeneous mixture is then
(c) Manifold radius relative to the scaling factor r, ðRM =rÞ, as a
function of r. The fact that ðRM =rÞ becomes small implies that the α−1 ¼ hα−1
manifolds are “fully supporting” to the hyperplane, showing the l il ; ð52Þ
small radius structure. (d) Manifold dimension p grows with affine
ffiffiffiffiffiffiffiffiffiffiffiffiffiffi where the average is over the fractional proportions of the
dimension, D, as 2 log D for small r ¼ ð1=2 2 log DÞ in the
scaling regime. (e1)–(e3) Distribution of support dimensions. different manifold classes. This remarkably simple but
(d1) r ¼ 0.01, where most manifolds are either interior or generic theoretical result enables analyzing diverse mani-
touching; (d2) r ¼ 20, where the support dimension has a peaked fold classification problems, consisting of mixtures of
distribution, truncated at ðD=2Þ; (d3) r ¼ 200, where most manifolds with varying dimensions, shapes, and sizes.
support dimensions are ðD=2Þ or fully supporting at D þ 1. Equation (52) is adequate for classes that differ in their
geometry, independent of the assigned labels. In general,
classes may differ in the label statistics (as in the sparse case
pair of (real and imaginary) Fourier harmonics. The ring studied below) or in geometry that is correlated with labels.
manifolds are closely related to the trigonometric moment For instance, the positively labeled manifolds may consist
curve, whose convex hull geometrical properties have been of one geometry, and the negatively labeled manifolds may
extensively studied [30,31]. have a different geometry. How do structural differences
In conclusion, the smoothness of the convex hulls between the two classes affect the capacity of the linear
becomes apparent in the distinct patterns of the support classification? A linear classifier can take advantage of
dimensions [compare Figs. 8(d) and 10(e)]. However, we these correlations by adding a nonzero bias. Previously, it
see that all manifolds with Dg ∼ 5 or larger share common was assumed that the optimal separating hyperplane passes
trends. As the size of the manifold increases, the capacity through the origin; this is reasonable when the two classes
and geometry vary smoothly, exhibiting a smooth crossover are statistically the same. However, when there are stat-
from a high capacity with low radius and dimension istical differences between the label assignments of the two
to a low capacity with large radius
pffiffiffiffi and dimension. This classes, Eq. (2) should be replaced by yμ ðw · xμ − bÞ ≥ κ,
crossover occurs as Rg ∝ 1= Dg . Also, our examples where the bias b is chosen to maximize the mixture

031003-16
CLASSIFICATION AND GEOMETRY OF GENERAL … PHYS. REV. X 8, 031003 (2018)

capacity, Eq. (52). The effect of optimizing the bias is seen in the case of sparsely labeled l2 balls (Appendix E).
discussed in detail in the next section for the sparse labels Here, we summarize the main results for general manifolds.
and in other scenarios in the SM [16] (Sec. S3). Sparsity and size:—There is a complex interplay
between label sparsity and manifold size. Our analysis
B. Manifolds with sparse labels yields three qualitatively different regimes.
Low Rg :—When the Gaussian radius of the manifolds are
We define the sparsity parameter f as the fraction of small, i.e., Rg < 1, the extent of the manifolds is noticeable
positively labeled manifolds so that f ¼ 0.5 corresponds to only when the dimension is high. Similar to our previous
having balanced labels. From the theory of the classifica- analysis of high-dimensional manifolds, we find here that the
tion of a finite set of random points, it is known that having sparse capacity is equivalent to the capacity of sparsely
sparse labels with f ≪ 0.5 can drastically increase the labeled
capacity [12]. In this section, we investigate how the pffiffiffiffiffiffirandom points, with an effective margin given by
Rg Dg ,
sparsity of manifold labels improves the manifold classi-
fication capacity. pffiffiffiffiffiffi
If the separating hyperplane is constrained to go through αM ðfÞ ≈ α0 ðf; κ ¼ Rg Dg Þ; ð55Þ
the origin and the distribution of inputs is symmetric
around the origin, the labeling yμ is immaterial to the where α0 ðκ; fÞ ¼ maxb α0 ðκ; f; bÞ from the Gardner theory.
capacity. Thus, the effect of sparse labels is closely tied to It should be noted that κ g in the above equation has a
having a nonzero bias. We thus consider inequality con- noticeable effect only for moderate sparsity. It has a negli-
straints of the form yμ ðw · xμ − bÞ ≥ κ, and we define the gible effect when f → 0, since the bias is large and dominates
bias-dependent capacity of general manifolds with label over the margin.
sparsity f, margin κ, and bias b, as αM ðκ; f; bÞ. Next, we Moderate sizes, Rg > 2:—In this case, the equivalence to
observe that the bias acts as a positive contribution to the the capacity of points breaks down. Remarkably, we find
margin for the positively labeled population and as a that the capacity of general manifolds with substantial size
negative contribution to the negatively labeled population. is well approximated by that of equivalent l2 balls with the
Thus, the dependence of αM ðκ; f; bÞ on both f and b can be same sparsity f and with dimension and radius equal to the
expressed as Gaussian dimension and radius of the manifolds, namely,

α−1 −1 −1 αM ðfÞ ≈ αBall ðf; Rg ; Dg Þ: ð56Þ


M ðκ; f; bÞ ≡ fαM ðκ þ bÞ þ ð1 − fÞαM ðκ − bÞ; ð53Þ

where αM ðxÞ is the classification capacity with zero bias Surprisingly, unlike the nonsparse approximation, where
(and, hence, equivalent to the capacity with f ¼ 0.5) for the the equivalence of general manifold to balls, Eq. (33), is
same manifolds. Note that Eq. (53) is similar to Eq. (52) for valid only for high-dimensional manifolds, in the sparse
mixtures of manifolds. The actual capacity with sparse limit, the spherical approximation is not restricted to large
labels is given by optimizing the above expression with D. Another interesting result is that the relevant statistics
respect to b, i.e., are given by the Gaussian geometry, Rg and Dg , even when
Rg is not small. The reason is that, for small f, the bias is
αM ðκ; fÞ ¼ maxb αM ðκ; f; bÞ: ð54Þ large. In that case, the positively labeled manifolds have
large positive margin b and are fully supporting giving a
In the following, we consider for simplicity the effect of contribution to the inverse capacity of b2 , regardless of their
sparsity for zero margin, κ ¼ 0. detailed geometry. On the other hand, the negatively
Importantly, if D is not large, the effect of the manifold labeled manifolds have large negative margin, implying
geometry in sparsely labeled manifolds can be much larger that most of them are far from the separating plane (interior)
than that for nonsparse labels. For nonsparse labels, the and a small fraction have touching support. The fully
capacity ranges between 2 and ðD þ 0.5Þ−1 , while for sparse supporting configurations have negligible probability;
manifolds, the upper bound can be much larger. Indeed, hence, the overall geometry is well approximated by the
small-sized manifolds are expected to have capacity that Gaussian quantities Dg and Rg .
increases upon decreasing f as αM ð0; fÞ ∝ ð1=fj log fjÞ, Scaling relationship between sparsity and size:—A
similar to P uncorrelated points [12]. This potential increase further analysis of the capacity of balls with sparse labels
in the capacity of sparse labels is, however, strongly con- shows it retains a simple form αðf̄; DÞ [see Appendix E,
strained by the manifold size, since, when the manifolds are Eq. (E10)], which depends on R and f only through the
large, the solution has to be orthogonal to the manifold scaled sparsity f̄ ¼ fR2. The reason for the scaling of f with
directions so that αM ð0; fÞ ≈ ð1=DÞ. Thus, the geometry of R2 is as follows. When the labels are sparse, the dominant
the manifolds plays an important role in controlling the contribution to the inverse capacity comes from the minority
effect of sparse labels on capacity. These aspects are already class, and so the capacity is α−1 2
Ball ≈ fb for large b. On the

031003-17
CHUNG, LEE, and SOMPOLINSKY PHYS. REV. X 8, 031003 (2018)

other hand, the optimal value of b depends on the balance scaled sparsity f̄ ¼ fð1 þ R2 Þ. In each example, there is a
between the contributions from both classes and scales good agreement between the three calculations for the
linearly with R, as it needs to overcome the local fields from range of f̄ < 1. Furthermore, the drop in α with increasing
the spheres. Thus, α−1 ∝ fR2 . f̄ is similar in all cases, except for an overall vertical shift,
Combining Eq. (E10) with Eq. (56) yields, for general which is due to the different Dg , similar to the effect of
sparsely labeled manifolds, dimension in l2 balls [Fig. 11(a)]. In the regime of
Z ∞ moderate radii, results for different f and r all fall on a
α−1
M ¼ f̄ b̄2þ dtχ Dg ðtÞðt − b̄Þ2 ; ð57Þ universal curve, which depends only on f̄, as predicted by

the theory. For small r, the capacity deviates from this
where scaled sparsity is f̄ ¼ fð1 þ R2g Þ ≲ 1. scaling, as it is dominated by f alone, similar to sparsely
Note that we have defined the scaled sparsity f̄ ¼ labeled points. When f̄ > 1, the curves deviate from the
fð1þR2g Þ rather than fR2 to yield a smoother crossover spherical approximation. The true capacity (as revealed by
to the small Rg regime. Similarly, we define the optimal simulations and full mean field) rapidly decreases with f̄
qffiffiffiffiffiffiffiffiffiffiffiffiffiffi
scaled bias as b̄ ¼ b 1 þ R2g ≫ 1. Qualitatively, the func- and saturates at ð1=DÞ, rather than to the ð1=Dg Þ limit of the
spherical approximation. Finally, we note that our choice of
tion α is roughly proportional to f̄ −1 with a proportionality parameters in Fig. 11(c) (SM [16] Sec. VII) was such that
constant that depends on Dg . In the extreme limit of Rg (entering in the scaled sparsity) was significantly
j log f̄j ≫ Dg , we obtain the sparse limit α ∝ jf̄ log f̄j−1 . different from simply an average radius. Thus, the agree-
Note that, if f is sufficiently small, the gain in capacity due ment with the theory illustrates the important role of the
to sparsity occurs even for large manifolds as long as f̄ < 1. Gaussian geometry in the sparse case.
Large Rg regime, f̄ > 1:—Finally, when Rg is suffi- As discussed above, in the (high and moderate) sparse
ciently large such that f̄ increases above 1, b̄ is of order 1 or regimes, a large bias alters the anchor geometry of the two
smaller, and the capacity is small with a value that depends classes in different ways. To illustrate this important aspect,
on the detailed geometry of the manifold. In particular, we show in Fig. 12 the effect of sparsity and bias on the
when f̄ ≫ 1, the capacity of the manifold approaches
αM ðfÞ → ð1=DÞ, not ð1=Dg Þ.
(a) (b)
To demonstrate these remarkable predictions, Eq. (56),
we present in Fig. 11 the capacity of three classes of
sparsely labeled manifolds: l2 balls, l1 ellipsoids, and ring
manifolds. In all cases, we show the results of numerical
simulations of the capacity, the full mean-field solution,
and the spherical approximation, Eq. (56), across several
orders of magnitude of sparsity and size as a function of the

(a) (b) (c) (c) (d)

FIG. 11. (a),(b) Classification of l2 balls with sparse labels.


(a) Capacity of l2 balls as a function of f̄ ¼ fð1 þ r2 Þ, for D ¼
100 (red) and D ¼ 10 (blue) mean-field theory results. Overlaid FIG. 12. Manifold configurations and geometries for classifi-
black dotted lines represent an approximation interpolating cation of l1 ellipsoids with sparse labels, analyzed separately in
between Eqs. (55) and (57) (details in SM [16], Sec. S9). (b), terms of the majority and minority classes. The radii for l1
(c) Classification of general manifolds with sparse labels. ellipsoids are Ri ¼ r for 1 ≤ i ≤ 10 and Ri ¼ 12 r for
(b) Capacity of l1 ellipsoids with D ¼ 100, where the first 10 11 ≤ i ≤ 100, with r ¼ 10. (a) Histogram of support dimensions
components are equal to r, and the remaining 90 components are for moderate sparsity f ¼ 0.1: minority (blue) and majority (red)
1 2 −3
2 r, as a function of f̄ ¼ fð1 þ Rg Þ. r is varied from 10 to 103 : manifolds. (b) Histogram of support dimensions for high sparsity
numerical simulations (circles), mean-field theory (lines), spheri- f ¼ 10−3 : minority (blue) and majority (red) manifolds. (c) Mani-
cal approximation (dotted lines). (c) Capacity of ring manifolds fold dimension as a function of f̄ ¼ fð1 þ R2g Þ as f is varied:
with a Gaussian falloff spectrum, with σ ¼ 0.1 and D ¼ 100 minority (blue) and majority (red) manifolds. (d) Manifold radius
(details in SM [16]). The Fourier components in these manifolds relative to the scaling factor r as a function of f̄: minority (blue)
have a Gaussian falloff, i.e., Rn ¼ A exp ð− 12 ½2πðn − 1Þσ2 Þ. and majority (red) manifolds.

031003-18
CLASSIFICATION AND GEOMETRY OF GENERAL … PHYS. REV. X 8, 031003 (2018)

geometry of the l1 ellipsoids studied in Fig. 11(c). Here, This new theory is important for the understanding of
we show the evolution of RM and DM for the majority and how sensory neural systems perform invariant perceptual
minority classes as f̄ increases. Note that, despite the fact discrimination and recognition tasks of realistic stimuli.
that the shape of the manifolds is the same for all f, their Furthermore, beyond estimating the classification capacity,
manifold anchor geometry depends on both class member- our theory provides theoretically based geometric measures
ship and sparsity levels, because these measures depend on for assessing the quality of the neural representations of the
the margin. When f̄ is small, the minority class has DM ¼ D, perceptual manifolds. There are a variety of ways for
as seen in Fig. 12(c); i.e., the minority class manifolds are defining the geometry of manifolds. Our geometric mea-
close to be fully supporting due to the large positive margin. sures are unique in that they determine the ability to linearly
This can also be seen in the distributions of support separate the manifold, as our theory shows.
dimension shown in Figs. 12(a) and 12(b). On the other Our theory focuses on linear classification. However, it
hand, the majority class has DM ≈ Dg ¼ 2 log D1 ≪ D, has broad implications for nonlinear systems and, in
and these manifolds are mostly in the interior regime. As particular, for deep networks. First, most models of sensory
f increases, the geometrical statistics for the two classes discrimination and recognition in biological and artificial
become more similar. This is seen in Figs. 12(c) and 12(d), deep architectures model the readouts of the networks as
where DM and RM for both majority and minority classes linear classifiers operating on the top sensory layers. Thus,
converge to the zero margin value for large f̄ ¼ fð1 þ R2g Þ. our manifold classification capacity and geometry can be
applied to understand the performance of the deep network.
Furthermore, the computational advantage of having multi-
VII. SUMMARY AND DISCUSSION ple intermediate layers can be assessed by comparing the
Summary.— We have developed a statistical mechanical performance of a hypothetical linear classifier operating on
theory for linear classification of inputs organized in these layers. In addition, the changes in the quality of
perceptual manifolds, where all points in a manifold share representations across the deep layers can be assessed by
the same label. The notion of perceptual manifolds is comparing the changes in the manifold’s geometries across
critical in a variety of contexts in computational neurosci- layers. Indeed, previous discussions of sensory processing
ence modeling and in signal processing. Our theory is not in deep networks hypothesized that neural object repre-
restricted to manifolds with smooth or regular geometries; sentations become increasingly untangled as the signal
it applies to any compact subset of a D-dimensional affine propagates along the sensory hierarchy. However, no
subspace. Thus, the theory is applicable to manifolds concrete measure of untangling has been provided. The
arising from any variation in neuronal responses with a geometric measures derived from our theory can be used to
continuously varying physical variable, or from any quantify the degree of entanglement, tracking how the
sampled set arising from experimental measurements on perceptual manifolds are nonlinearly reformatted as they
a limited number of stimuli. propagate through the multiple layers of a neural network
The theory describes the capacity of a linear classifier to to eventually allow for linear classification in the top layer.
separate a dichotomy of general manifolds (with a given Notably, the statistical, population-based nature of our
margin) by a universal set of mean-field equations. These geometric measures renders them ideally suited for com-
equations may be solved analytically for simple geom- parison between layers of different sizes and nonlinearities,
etries, but, for more complex geometries, we have devel- as well as between different deep networks or between
oped iterative algorithms to solve the self-consistent artificial networks and biological ones. Lastly, our theory
equations. The algorithms are efficient and converge fast, can suggest new algorithms for building deep networks, for
as they involve only solving for OðDÞ variables of a single instance, by imposing successive reduction of manifold
manifold, rather than invoking simulations of a full system dimensions and radii as part of an unsupervised learning
of P manifolds embedded in RN . strategy.
Applications:—The statistical mechanical theory of per- We have discussed above the domain of visual object
ceptron learning has long provided a basis for under- classification and recognition, which has received immense
standing the performance and fundamental limitations of attention in recent years. However, we would like to
single-layer neural architectures and their kernel exten- emphasize that our theory can be applied for modeling
sions. However, the previous theory only considered a neural sensory processing tasks in other modalities. For
finite number of random points, with no underlying geo- instance, it can be used to provide insight on how the
metric structure, and could not explain the performance of olfactory system performs discrimination and recognition
linear classifiers on a large, possibly infinite number of of odor identity in the presence of orders-of-magnitude
inputs organized as distinct manifolds by the variability due variations in odor concentrations. Neuronal responses to
to changes in the physical parameters of objects. The theory sensory signals are not static, but vary in time. Our theory
presented in this work can explain the capacity and can be applied to explain how the brain correctly decodes
limitations of the linear classification of general manifolds. the stimulus identity despite its temporal nonstationarity.

031003-19
CHUNG, LEE, and SOMPOLINSKY PHYS. REV. X 8, 031003 (2018)

Some of these applications may require further extensions In conclusion, we believe the application of this theory
of the present theory. The most important ones (currently the and its corollary extensions will precipitate novel insights
subject of ongoing work) include the following. into how perceptual systems, biological or artificial, effi-
Correlations:— In the present work, we have assumed ciently code and process complex sensory information.
that the directions of the affine subspaces of the different
manifolds are uncorrelated. In realistic situations, we ACKNOWLEDGMENTS
expect to see correlations in the manifold geometries,
We would like to thank Uri Cohen, Ryan Adams, Leslie
mainly of two types. One is center-center correlations.
Valiant, David Cox, Jim DiCarlo, Doris Tsao, and Yoram
Such correlations can be harmful for linear separability Burak for helpful discussions. The work is partially sup-
[32,33]. Another is correlated variability, in which the ported by the Gatsby Charitable Foundation, the Swartz
directions of the affine subspaces are correlated but not the Foundation, the Simons Foundation (SCGB Grant
centers. Positive correlations of the latter form are benefi- No. 325207), the National Institutes of Health (Grant
cial for separability. In the extreme case when the manifolds No. 1U19NS104653), the MAFAT Center for Deep
share a common affine subspace, the rank of the union of Learning, and the Human Frontier Science Program
the subspaces is Dtot ¼ D, rather than Dtot ¼ PD, and the (Project No. RGP0015/2013). D. D. Lee also acknowledges
solution weight vector need only lie in the null space of this the support of the U.S. National Science Foundation, Army
smaller subspace. Further work is needed to extend the Research Laboratory, Office of Naval Research, Air Force
present theory to incorporate more general correlations. Office of Scientific Research, and Department of
Generalization performance:— We have studied the Transportation.
separability of manifolds with known geometries. In many
realistic problems, this information is not readily available APPENDIX A: REPLICA THEORY OF
and only samples reflecting the natural variability of input MANIFOLD CAPACITY
patterns are provided. These samples can be used to
estimate the underlying manifold model (using manifold In this section, we outline the derivation of the mean-
learning techniques [34,35]) and/or to train a classifier field replica theory summarized in Eqs. (11) and (12). We
based upon a finite training set. Generalization error define the capacity of linear classification of manifolds,
describes how well a classifier trained on a finite number αM ðκÞ, as the maximal load, α ¼ ðP=NÞ, for which with
of samples would perform on other test points drawn from high probability a solution to yμ w · xμ ≥ κ exists for a given
the manifolds [36]. It would be important to extend our κ. Here, xμ are points on the P manifolds Mμ , Eq. (1), and
theory to calculate the expected generalization error we assume that all NPðD þ 1Þ components of fuμi g are
achieved by the maximum margin solution trained on point drawn independently from a Gaussian distribution with
cloud manifolds, as a function of the size of the training set zero mean and variance ð1=NÞ, and that the binary labels
and the geometry of the underlying full manifolds. yμ ¼ 1 are randomly assigned to each manifold with
Unrealizable classification:—Throughout the present equal probabilities. We consider the thermodynamic limit,
work, we have assumed that the manifolds are separable where N, P → ∞, but α ¼ ðP=NÞ, and D is finite.
by a linear classifier. In realistic problems, the load Note that the geometric margin κ0 , defined as the distance
μ μ 0
may be above the capacity for linear separation, i.e., from
0
pffiffiffiffithe solution hyperplane, is given by y w·x ≥κ kwk¼
α > αM ðκ ¼ 0Þ. Alternatively, neural noise may cause κ N . However, this distance depends on the scale of the
the manifolds to be unbounded in extent, with the tails input vectors xμ . The correct scaling ofpthe ffiffiffiffi margin in the
of their distribution overlapping so that they are not thermodynamic limit is κ 0 ¼ ðkxk= N Þκ. Since we
separable with zero error. There are several ways to handle adopted the normalization of kxμ k ¼ Oð1Þ, the correct
this issue in supervised learning problems. One possibility scaling of the margin is yμ w · xμ ≥ κ.
is to nonlinearly map the unrealizable inputs to a higher- Evaluation of solution volume:—Following Gardner’s
dimensional feature space, via a multilayer network or replica framework, we first consider the volume Z of the
nonlinear kernel function, where the classification can be solution space for α < αM ðκÞ. We define the signed projec-
performed with zero error. The design of multilayer net- tions of the ith direction vector uμi on the solution weight as
pffiffiffiffi
works could be facilitated using manifold processing Hμi ¼ N yμ w · uμi , where i ¼ 1;…; D þ 1 and μ ¼ 1; …; P.
principles uncovered by our theory. Then, the separability constraints can be written as
Another possibility is to introduce an optimization PDþ1 μ
i¼1 Si H i ≥ κ. Hence, the volume can be written as
problem allowing a small training error, for example, using Z
a SVM with complementary slack variables [14]. These ⃗ μ Þ − κÞ;
Z ¼ dN wδðw2 − NÞΠPμ¼1 Θμ ðgS ðH ðA1Þ
procedures raise interesting theoretical challenges, includ-
ing understanding how the geometry of manifolds changes
as they undergo nonlinear transformations, as well as where ΘðxÞ is a Heaviside step function. gS is the
investigating, by statistical mechanics, the performance ⃗ ¼
support function of S defined for Eq. (12) as gS ðVÞ
of a linear classifier of manifolds with slack variables [37]. ⃗ S⃗ ∈ Sg.
minS⃗ fV⃗ · Sj

031003-20
CLASSIFICATION AND GEOMETRY OF GENERAL … PHYS. REV. X 8, 031003 (2018)

The volume defined above depends on the quenched and we have used the fact that all manifolds contribute the
random variables uμi and yμ through H μi . It is well known same factor.
that, in order to obtain the typical behavior in the thermo- We proceed by making the replica symmetric ansatz
dynamic limit, we need to average log Z, which we carry on the order parameter qαβ at its saddle point, qαβ ¼
out using the replica trick, hlog Zi ¼ limn→0 ðhZn i − 1Þ=n, ð1 − qÞδαβ þ q, from which one obtains, in the n → 0 limit,
where hi refers to the average over uμi and yμ . For natural n,
1 q
we need to evaluate q−1
αβ ¼ δαβ − ðA8Þ
Z Y 1−q ð1 − qÞ2
n YP Z
hZn i ¼ dwα δðw2α − NÞ DH⃗ μα
and
α μ
DY  nq
þ1 pffiffiffiffiffiffi μα log det q ¼ n logð1 − qÞ þ : ðA9Þ
μ T μ
× 2π δðHi − y wα ui Þ ; ðA2Þ 1−q
i uμi ;yμ
Thus, the exponential term in X can be written as
where we have used the notation
 pffiffiffi X 2

1 X ðHαi Þ2 1 X q α
DH Dþ1 dH i
⃗ ¼ Πi¼1 ⃗ − κÞ:
pffiffiffiffiffiffi ΘðgS ðHÞ ðA3Þ exp − þ H : ðA10Þ
2 αi 1 − q 2 i 1 − q α i

Using a Fourier representation of the delta functions, we Using the Hubbard-Stratonovich transformation, we
obtain obtain
Z Y Z Z  pffiffiffi 
n
n P Z
Y 1 H⃗ 2 q
X ¼ DT ⃗ ⃗
DH exp − þ ⃗ ⃗
hZn i ¼ dwα δðw2α − NÞ DH⃗ μα 21 − q 1 − q
H·T ;
α μ
þ1 Z
ðA11Þ
Y
D
dĤμα
ffi hexpfiĤ μα
pffiffiffiffiffi μα μ T μ pffiffiffiffiffiffi
i ðH i − y wα ui Þgiuμi ;yμ :
i
×
2π where DT⃗ ¼ Πi ðdT i = 2π Þ exp ½−ðT 2i =2Þ. Completing
i¼1 R
the square in the exponential and using ⃗ n¼
DTA
ðA4Þ R
exp n DT⃗ log A in the n → 0 limit, we obtain X ¼
R
Performing the average over the Gaussian distribution of exp f½nqðD þ 1Þ=2ð1 − qÞg þ n DT⃗ log zðTÞÞ,⃗ with
uμi (each of the N components has zero mean and variance Z n 1 o
ð1=NÞ) yields ⃗ ¼ DH
zðTÞ ⃗ exp − ⃗ − pffiffiffi
jjH ⃗ 2 :
qTjj ðA12Þ
 2ð1 − qÞ
Dþ1 X
X  X
N 

μα μ j μ
exp μα
iĤi −y wα ui;j Combining these terms, we write the last factor in
i¼1 j¼1 uμi ;yμ Eq. (A6) as exp nPG1 , where
 
1 X X μα μβ Z
¼ exp − qαβ Ĥi Ĥi ; ðA5Þ ðD þ 1Þ
2 αβ G1 ¼ DT⃗ log zðTÞ
⃗ − logð1 − qÞ: ðA13Þ

2
PN
where qαβ ¼ ð1=NÞ j¼1 wjα wjβ . Thus, integrating the var- The first factors in hZn i, Eq. (A6), can be written as
iables Ĥμα
i yields
exp nNG0 , whereas, in the Gardner theory, the entropic
term in the thermodynamic limit is
Z Y
n Z
hZn i ¼ dwα δðw2α − NÞ dqαβ Παβ 1 q
α¼1 G0 ðqÞ ¼ lnð1 − qÞ þ ðA14Þ
 
P 2 2ð1 − qÞ
ðD þ 1Þ
· δðNqαβ − wα wβ Þ exp −
T
logdetq X ;
2 and represents the constraints on the volume of wα due to
normalization and the order parameter q. Combining the G0
ðA6Þ
and G1 contributions, we have
where
hZn it0 ;t ¼ eNn½G0 ðqÞþαG1 ðqÞ : ðA15Þ
Z Y

⃗ α 1 X α −1 β
X¼ DH exp − H ðq Þαβ Hi ; ðA7Þ The classification constraints contribute αG1 , with
α
2 i;α;β i
Eq. (A13), and

031003-21
CHUNG, LEE, and SOMPOLINSKY PHYS. REV. X 8, 031003 (2018)
Z
⃗ ¼ dY i APPENDIX B: STRICTLY CONVEX MANIFOLDS
zðTÞ Πi¼1
Dþ1
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

2πð1 − qÞ 1. General
 
Y⃗ 2 pffiffiffi Here, we evaluate the capacity of strictly convex mani-
× exp − Θ(gS ð qT⃗ þ YÞ
⃗ − κ); ðA16Þ
2ð1 − qÞ folds, starting from the expression for general manifolds,
Eq. (20). In a strictly convex manifold S, any point in
where we have written the fields Hi as the line segment connecting two points x⃗ and y⃗ , x⃗ , y⃗ ∈ S,
other than x⃗ and y⃗ , belongs to the interior of S. Thus,
pffiffiffi the boundary of the manifold does not contain edges or
Hi ¼ qT i þ Y i : ðA17Þ flats with spanning dimension k > 1. Indeed, the only
pffiffiffi possible spanning dimensions for the entire manifold are
Note that qT i represents the quenched random compo- k ¼ 0; 1, and D þ 1. Therefore, there are exactly two
nent due to the randomness in the uμi , and Y i is the “thermal” contributions to the inverse capacity. When t0 obeys
component due to the variability within the solution space. ttouch ð⃗ tÞ > t0 − κ > tfs ð⃗ tÞ, the integrand of Eq. (20) con-
The order parameter q is calculated via 0 ¼ ð∂G0 =∂qÞþ ⃗ − t0 þ κÞ2 =ð1 þ ks̃ðTÞk
tributes ½ð−⃗ t · s̃ðTÞ ⃗ 2 Þ. When t0 <
αð∂G1 =∂qÞ. κ þ tfs ð⃗ tÞ, the manifold is fully embedded. In this case,
Capacity:—In the limit where α → αM ðκÞ, the overlap v⃗ ¼ 0, and the integrand reduces to Eq. (26). In summary,
between the solutions becomes unity and the volume the capacity for convex manifolds can be written as
shrinks to zero. It is convenient to define Q ¼ q=ð1 − qÞ
Z Z κþt ð⃗tÞ
and study the limit of Q → ∞. In this limit, the leading touch ð−⃗ t · s̃ð⃗ t; t0 Þ − t0 þ κÞ2
order is α−1 ðκÞ ¼ D⃗ t Dt0
κþtfs ð⃗ tÞ 1 þ ks̃ð⃗ t; t0 Þk2
Z Z κþt ð⃗tÞ
fs

hlog Zi ¼
Q ⃗ ⃗ ;
½1 − αhFðTÞi ðA18Þ þ D⃗ t Dt0 ½ðt0 − κÞ2 þ k⃗ tk2 ; ðB1Þ
2 T −∞

where ttouch ð⃗ tÞ and tfs ð⃗ tÞ are given by Eqs. (22) and (25),
where the first term is the contribution from G0 → ðQ=2Þ. respectively, and s̃ð⃗ t; t0 Þ ¼ arg min⃗s ð⃗v · ⃗ sÞ, with v⃗ ¼
The second term comes from G1 → −ðQ=2ÞαhFðTÞi ⃗ ⃗ , ⃗ t þ ðv0 − t0 Þs̃.
T
where the average is over the Gaussian distribution of
the (D þ 1)-dimensional vector T, ⃗ and 2. l2 balls
In the case of l2 balls with D and radius R, gð⃗vÞ ¼
⃗ → − 2 log zðTÞ
FðTÞ ⃗ ðA19Þ −Rk⃗vk. Hence, ttouch ð⃗ tÞ ¼ Rk⃗ tk and tfs ð⃗ tÞ ¼ −R−1 k⃗ tk.
Q Thus, Eq. (B1) reduces to the capacity of balls is
Z ∞ Z κþtR
is independent of Q and is given by replacing the integrals ð−t0 þ tR þ κÞ2
α−1 ¼ dtχ ðtÞ Dt0
ð1 þ R2 Þ
Ball D
in Eq. (A16) by their saddle point, which yields 0 κ−tR−1
Z ∞ Z κ−tR−1
⃗ ¼ minfkV⃗ − Tk
⃗ 2 jgS ðVÞ
⃗ − κ ≥ 0g: þ dtχ D ðtÞ Dt0 ½ðt0 − κÞ2 þ t2 ; ðB2Þ
FðTÞ ðA20Þ 0 −∞
V⃗
where
At the capacity, log Z vanishes, and the capacity of a
21− 2 D−1 −1t2
D
general manifold with margin κ is given by
χ D ðtÞ ¼ t e 2 ; t≥0 ðB3Þ
ΓðD2 Þ
α−1 ⃗
M ðκÞ ¼ hFðTÞiT⃗ ðA21Þ
is the D-dimensional Chi probability density function,
reproducing the results of Ref. [15]. Furthermore,
⃗ ¼ minfkV⃗ − Tk
⃗ 2 jgS ðVÞ
⃗ − κ ≥ 0g: ðA22Þ Eq. (B2) can be approximated
FðTÞ
V⃗
pffiffiffiffiby the capacity of points
pffiffiffiffi of R D, i.e., αBall ðκ; RM ; DM Þ ≈
with a margin increase
ð1 þ R2 Þα0 ðκ þ R DÞ (details are given in Ref. [15]).
Finally, we note that the mean squared “annealed”
variability in the fields due to the entropy of solutions
3. l2 ellipsoids
vanishes at the capacity limit, as 1=Q; see Eq. (A16). Thus,
the quantity kV⃗ − Tk
⃗ 2 in the above equation represents the a. Anchor points and support regimes
annealed variability times Q, which remains finite in the With ellipsoids, Eq. (35), the support function in Eq. (15)
limit of Q → ∞. can be computed explicitly as follows. For a vector

031003-22
CLASSIFICATION AND GEOMETRY OF GENERAL … PHYS. REV. X 8, 031003 (2018)

V⃗ ¼ ð⃗v; v0 Þ, with nonzero v⃗ , the support function gð⃗vÞ is margin solution. In this case, the anchor point is antiparallel
minimized by a vector s⃗ , which occurs on the boundary of to ⃗ t at the interior point, Eq. (24), and its contribution to the
the ellipsoid;P i.e., it obeys the equality constraint fð⃗sÞ ¼ 0 capacity is as in Eq. (26).
2
with ρð⃗sÞ ¼ P i¼1 ðsi =Ri Þ − 1, Eq. (35). To evaluate g, we
D

differentiate − i¼1 si v⃗ i þ ρfð⃗sÞ) with respect to si , where


D
APPENDIX C: LIMIT OF LARGE MANIFOLDS
ρ is a Lagrange multiplier enforcing the constraint, yielding
In the large size limit, ks̃k → ∞, linear separation of
vi R2i manifolds reduces to linear separation of P random
s̃i ¼ ðB4Þ D-dimensional affine subspaces. When separating subspa-
gð⃗vÞ
ces, all of them must be fully embedded in the margin plane;
and otherwise, they would intersect it and violate the classifi-
cation constraints. However, the way large-size manifolds

gð⃗vÞ ¼ −k⃗v∘Rk; ðB5Þ approach this limit is subtle. To analyze this limit, we note
that, when ks̃k is large, gð⃗vÞ > −ks̃kk⃗vk and, from the
where R ⃗ is the D-dimensional vector of the ellipsoid
condition that gð ⃗vÞ≥ð−v0 þκÞ, we have k ⃗vk ≤ ½ð−v0 þ κÞ=
principal radii, and ∘ refers to the pointwise product of ks̃k; i.e., k⃗vk is Oðks̃k−1 Þ. A small k⃗vk implies that the
vectors, ð⃗v∘RÞ ⃗ i ¼ vi Ri . For a given ð⃗ t; t0 Þ, the vector affine basis vectors, except the center direction, are all either
ð⃗v; v0 Þ is determined by Eq. (13), and the analytic solution exactly or almost orthogonal to the solution weight vector.
above can be used to derive explicit expressions for s̃ð⃗ t; t0 Þ Since λs̃ ¼ −⃗ t þ v⃗ ≈ −⃗ t, it follows that s̃ is almost anti-
in the different regimes as follows. parallel to the Gaussian vector ⃗ t, and hence, DM → D; see
Interior regime:—In the interior regime λ ¼ 0, v⃗ ¼ ⃗ t, Eq. (29). To elucidate the manifold support structure, we
resulting in zero contribution to the inverse capacity. The note first that, by Eq. (22), ttouch ∝ −jj⃗ tjjjjs̃jj → −∞; hence,
anchor point is given by the following boundary point on the fractional volume of the interior regime is negligible, and
the ellipse, given by Eq. (B4) with v⃗ ¼ ⃗ t. This regime holds the statistics is dominated by the embedded regimes. In
for t0, obeying the inequality t0 − κ ≥ ttouch ð⃗ tÞ, with fact, the fully embedded transition is given by tembed ≈ −κ
Eq. (22), yielding [see Eq. (25)], so that the fractional volume of the fully
R∞
⃗ embedded regime is Hð−κÞ ¼ −κ Dt0 , and its contribution
ttouch ð⃗ tÞ ¼ k⃗ t ∘ Rk: ðB6Þ R∞
to inverse capacity is, therefore, −κ Dt0 ½hktk ⃗ 2 iþðt0 þκÞ2 ¼
Touching regime:—Here, the anchor point is given by Hð−κÞDþα−1 0 ðκÞ. The remaining summed probability of
Eq. (B4) where v⃗ ¼ ⃗ t þ λs̃. Substituting ti þ λs˜i for vi in the touching and partially embedded regimes (k ≥ 1)
the numerator of that equation, and gð⃗vÞ ¼ κ − v0 ¼ is, therefore, HðκÞ. In these regimes, λ2 ks̃k2 ≈ λ2 ks̃k2 ≈
κ − t0 − λ, yields kR⃗ tk2 , so that this regime contributes a factor of
−κ ⃗ 2
−∞ Dt0 hktk i ¼ HðκÞD. Combining these two contribu-
ti R2i
s̃touch;i ð⃗ t; t0 Þ ¼ − ; ðB7Þ tions, we obtain, for large sizes, α−1 −1
M ¼ D þ α0 ðκÞ, con-
λð1 þ R2i Þ þ ð−t0 þ κÞ sistent with Eq. (8).
where the parameter λ is determined by the ellipsoidal
constraint, APPENDIX D: HIGH-DIMENSIONAL
MANIFOLDS
X
D
t2i R2i
1¼ : ðB8Þ 1. High-dimensional l2 ball
i¼1
½λð1 þ R2i Þ þ ð−t0 þ κÞ2
Before we discuss general manifolds in high dimensions,
In this regime, the contribution to the capacity is given by we focus on the simple case of high-dimensional balls, the
Eqs. (16) and (17), with s̃ in Eq. (B7). capacity of which is given by Eq. (B2). pffiffiffiffiWhen D ≫ 1,
The touching regime holds for ttouch > t0 − κ > tfs, χ D ðtÞ, Eq. (B3) is centered around t ¼ D. Substituting
pffiffiffiffi
where λ → κ − t0 , v⃗ vanishes, and the anchor point is χ D ðtÞ ≈ δðt − D) yields
the point on the boundary of the ellipsoid antiparallel to ⃗ t. pffiffiffiffi
Z pffiffiffi
Substituting this value of λ in Eq. (B8) yields κþR D ðR D þ κ − tÞ20
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi α−1
Ball ðκ; R; DÞ ¼ p ffiffi Dt0
X ti 2 κ− D R2 þ 1
ffiffi
R
tfs ð⃗ tÞ ¼ − : ðB9Þ Z κ−
p
D

i
Ri þ
R
Dt0 ð½t0 − κ2 þ DÞ: ðD1Þ
−∞
Fully supporting regime:—When t0 − κ < tfs , we have pffiffiffiffi
v⃗ ¼ 0, v0 ¼ κ, and λ ¼ t0 − κ, implying that the center, as As long as R ≪ D, the second term in Eq. (D1) vanishes
well as the entire ellipsoid, is fully supporting the max and yields

031003-23
CHUNG, LEE, and SOMPOLINSKY PHYS. REV. X 8, 031003 (2018)

Z pffiffiffi pffiffiffiffi
κþR D ðR D þ κ − t0 Þ2 self-consistent statistics of the anchor points. This calcu-
α−1
Ball ðκ; R; DÞ ¼ Dt0 : ðD2Þ lation is simplified in high dimensions. In particular, λ,
−∞ R2 þ 1
Eq. (17), reduces to
Here, we note that the t0 term in the numerator is
significant only if R ¼ OðD−1=2 Þ or smaller, in which case h½κM þ κ − t0 þ it0
λ≈ : ðD7Þ
the denominator is just 1. When R is of order 1, the term t0 1 þ R2M
in the integrand is negligible. Hence, in both cases, we can
write Hence, it is approximately constant, independent of T. ⃗
 pffiffiffiffi In deriving the above approximations, we used self-
−1 −1 κpþ R D
ffiffiffiffiffiffiffiffiffiffiffiffiffiffi :
αBall ðκ; R; DÞ ≈ α0 ðD3Þ averaging in summations involving the D intrinsic coor-
1 þ R2 dinates. The full dependence on the longitudinal Gaussian
pffiffiffiffiffiffiffiffiffiffiffiffiffiffi t0 should remain. Thus, RM and DM should, in fact, be
In this form, the intuition beyond the factor 1 þ R2 is
substituted by RM ðt0 Þ and DM ðt0 Þ, denoting Eqs. (28) and
clear, stemming from the fact that the distance of a point
from the margin plane scales with its norm; hence, the (29), with averaging only over ⃗ t. This would yield a more
margin entering the capacity should factor out this norm. complicated expressions than Eqs. (D6) and (D7).
As stated above, Eq. (D3) implies The reason why we can replace them by average
pffiffiffiffi a finite capacity only quantities is the following: The potential dependence of
in the scaling
pffiffiffiffi regime, where R D ¼ Oð1Þ. If, on the other
the anchor radius and dimension on t0 is via λðt0 Þ.
hand, R D ≫ 1, Eq. (D2) implies
However, inspecting Eq. (D7), we note two scenarios. In
R2 D the first one, RM is small and κ M is of order 1. In this case,
α−1
Ball ¼ ðD4Þ because the manifold radius is small, the contribution λs̃ is
1 þ R2 small and can be neglected. This is the same argument why,
[where we have used the asymptote α−1 2 in this case, the geometry can be replaced by the Gaussian
0 ðxÞ → x for large
x], reducing to α−1 ¼ D for large R. geometry, which does not depend on t0 or κ. The second
The analysis above also highlightspthe scenario is that RM is of order 1 and κM ≫ 1, in which case
ffiffiffiffi support structure of the order 1 contribution from t0 is negligible.
the balls at large D. As long as R ≪ D, the fraction of balls
that lie fully in the margin
pffiffiffiffi plane is negligible, as implied by
the fact that tfs ≈ − D=R → −∞. The overall fraction of
pffiffiffiffi APPENDIX E: CAPACITY OF l2 BALLS
interior balls is Hðκ þ R DÞ, whereaspthe ffiffiffiffi fraction that WITH SPARSE LABELS
touch the margin planes is 1 − Hðκ þ R DÞ. Despite the
fact that there are no fully supporting balls, the touching First, we note that the capacity of sparsely labeled points
balls are almost parallel to the margin planes if R ≫ 1; is α0 ðf; κÞ ¼ maxb α0 ðf; κ; bÞ, where
hence, the capacity reaches its lower bound. Z κþb
Finally, the large manifold
pffiffiffiffi limit discussed in Appendix C α−1 Dtð−t þ κ þ bÞ2
0 ðf; κ; bÞ ¼ f
is realized for R ≪ D. Here, tfs ¼ −κ, and the system is −∞
either touching or fully supporting with probabilities HðκÞ Z κ−b
and Hð−κÞ, respectively. þ ð1 − fÞ Dtð−t þ κ − bÞ2 : ðE1Þ
−∞

2. General manifolds Optimizing b yields the following equation for b:


To analyze the limit of high-dimensional general mani- Z bþκ
folds, we utilize the self-averaging of terms in Eq. (20),
0¼f Dtð−t þ κ þ bÞ
involving sums of the D components of ⃗ t and s̃, i.e., −∞
Z b−κ
pffiffiffiffiffiffiffi
⃗ t · s̃ ≈ hjjs̃jjih⃗ t · ŝi ≈ RM DM ¼ κM : ðD5Þ þ ð1 − fÞ Dtð−t þ κ − bÞ: ðE2Þ
−∞

Also, ttouch ≈ ⃗ t · s̃ ≈ κ M ; hence, we obtain for the capacity In the limit of f → 0, b ≫ 1. The first equation
reduces to
h½κ M þ κ − t0 2þ it0
α−1
M ðκÞ ≈
R2M þ 1 α−1 2 2
0 ≈ fb þ exp ð−b =2Þ; ðE3Þ
≈ α−1
Ball ðκ; RM ; DM Þ; ðD6Þ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
yielding for the optimal b ≈ 2j log fj. With this b, the
where the average is with respect to the Gaussian t0 . inverse capacity is dominated by the first term, which
Evaluating RM and DM involves calculations of the is α−1
0 ≈ 2fj log fj.

031003-24
CLASSIFICATION AND GEOMETRY OF GENERAL … PHYS. REV. X 8, 031003 (2018)
Rb
The capacity for l2 balls of radius R, with sparse labels, case the integral of t0 is −Rðb̄−tÞ
Dt0 ≈ 1, and the integral
with sparsity f [Eqs. (53) and (B2)], is given by over t is from b̄ to ∞. Combining the two results yields the
Z Z following simple expression for the inverse capacity:
∞ ð−t0 þtRþbþκÞ2
κþtRþb
α−1
Ball ¼ f dtχ D ðtÞ· Dt0 Z ∞
0 κ−tR−1 þb ð1þR2 Þ −1 2
Z bþκ−tR−1
αBall ¼ f̄ b̄ þ dtχ D ðtÞðt − b̄Þ2 ; ðE10Þ

þ Dt0 ½ðt0 −b−κÞ2 þt2 
−∞
Z ∞ Z where scaled sparsity is f̄ ¼ fR2 . The optimal scaled bias is
2
ð−t0 þtRþκ −bÞ
κþtR−b given by
þð1−fÞ dtχ D ðtÞ· Dt0
0 ð1þR2 Þ
κ−tR−1 −b Z ∞
Z −bþκ−tR−1

f̄ b̄ ¼ dtχ D ðtÞðt − b̄Þ: ðE11Þ


þ Dt0 ½ðt0 −κ þbÞ2 þt2  ; ðE4Þ b̄
−∞
Note that R and f affect the capacity only through the
and the optimal bias b is given by ∂αBall =∂b ¼ 0. Here, we scaled sparsity. When f̄ → 0, capacity is proportional to
analyze these equations in various size regimes, assum- ðf̄j log f̄jÞ−1 . In a realistic regime of small f̄ (between 10−4
ing κ ¼ 0. and 1), capacity decreases with f̄ roughly as 1=f̄, with a
Small R:—For balls with small radius, the capacity is that proportionality constant that depends on D, as shown in the
of points, unless
pffiffiffiffi the dimensionality is high. Thus, if D is examples of Fig. 12(a). Finally, when R is sufficiently large
large, but R D ≲ 1, that f̄ > 1, b̄ is order 1 or smaller. In this case, the second term
Z pffiffiffi pffiffiffiffi in Eq. (E10) dominates and contributes ht2 i ¼ D, yielding a
ð−t0 þ R D þ bÞ2
R Dþb
capacity that saturates as D−1 , as shown in Fig. 11(a). It
α−1 ¼f Dt0
Ball
−∞ ð1 þ R2 Þ should be noted, however, that when b is not large, the
Z Rpffiffiffi pffiffiffiffi approximations in Eqs. (E10) and (E11) no longer hold.
D−b ð−t0 þ R D þ bÞ2
þ ð1 − fÞ Dt0 ; ðE5Þ
−∞ ð1 þ R2 Þ

that is, [1] J. J. DiCarlo and D. D. Cox, Untangling Invariant Object


pffiffiffiffi Recognition, Trends Cognit. Sci. 11, 333 (2007).
αBall ðf; R; DÞ ≈ α0 ðf; κ ¼ R DÞ: ðE6Þ [2] J. K. Bizley and Y. E. Cohen, The What, Where and How of
Auditory-Object Perception, Nat. Rev. Neurosci. 14, 693
As noted above, when f → 0, the optimal bias diverges; (2013).
hence the presence of order 1 in the induced margin [3] K. A. Bolding and K. M. Franks, Complementary Codes for
pffiffiffiffi Odor Identity and Intensity in Olfactory Cortex, eLife 6,
is noticeable only for moderate f, such that R D ¼
e22630 (2017).
Oðj log fjÞ. [4] H. S. Seung and D. D. Lee, The Manifold Ways of Percep-
Small f and large R:—Here, we analyze the above tion, Science 290, 2268 (2000).
equations in the limit of small f and large R. We assume [5] J. J. DiCarlo, D. Zoccolan, and N. C. Rust, How Does the
that f is sufficiently small so that the optimal bias b is large. Brain Solve Visual Object Recognition?, Neuron 73, 415
The contribution of the minority class to the inverse (2012).
capacity α−1Ball is dominated by
[6] B. Poole, S. Lahiri, M. Raghu, J. Sohl-Dickstein, and S.
Ganguli, Exponential Expressivity in Deep Neural Networks
Z ∞ Z b Through Transient Chaos, in Advances in Neural Information
f dtχ D ðtÞ Dt0 (ð−t0 þ bÞ2 þ t2 ) ≈ fb2 : ðE7Þ Processing Systems (Morgan Kaufmann, San Francisco, CA,
0 −∞ 2016), pp. 3360–3368.
[7] M. A. Ranzato, F. J. Huang, Y.-L. Boureau, and Y. LeCun,
The dominant contribution of the majority class to α−1
Ball is Unsupervised Learning of Invariant Feature Hierarchies
Z Z with Applications to Object Recognition, in Computer
∞ −bþtR ð−t0 þ tR − bÞ2 Vision and Pattern Recognition, 2007. CVPR’07. IEEE
ð1 − fÞ dtχ D ðtÞ Dt0 ðE8Þ
0 −b−Rt−1 ð1 þ R2 Þ Conference on (2007), pp. 1–8.
[8] Y. Bengio, Learning Deep Architectures for AI, Foundations
Z ∞ and Trends in Machine Learning 2, 1 (2009).
≈ð1 − fÞ dtχ D ðtÞðt − b̄Þ2 ; ðE9Þ [9] I. Goodfellow, H. Lee, Q. V. Le, A. Saxe, and A. Y.
b̄ Ng, Measuring Invariances in Deep Networks, in Advances
in Neural Information Processing Systems (Morgan
where b̄ ¼ b=R. In deriving the last equation, we used Kaufmann, San Francisco, CA, 2009), pp. 646–654.
½ð−t0 þ tR − bÞ2 =ð1 þ R2 Þ → ðt − b̄Þ2 as R, b → ∞. Second, [10] C. F. Cadieu, H. Hong, D. L. K. Yamins, N. Pinto, D. Ardila,
the integrals are of substantial value only if t ≥ b̄, in which E. A. Solomon, N. J. Majaj, and J. J. DiCarlo, Deep Neural

031003-25
CHUNG, LEE, and SOMPOLINSKY PHYS. REV. X 8, 031003 (2018)

Networks Rival the Representation of Primate IT Cortex for [25] R. Vershynin, Estimation in High Dimensions: A Geometric
Core Visual Object Recognition, PLoS Comput. Biol. 10, Perspective, in Sampling Theory, A Renaissance (Springer,
e1003963 (2014). New York, 2015), pp. 3–66.
[11] T. M. Cover, Geometrical and Statistical Properties of [26] K. Rajan, L. F. Abbott, and H. Sompolinsky, Stimulus-
Systems of Linear Inequalities with Applications in Pattern Dependent Suppression of Chaos in Recurrent Neural
Recognition, IEEE Trans. Electron. Comput. EC-14, 326 Networks, Phys. Rev. E 82, 011903 (2010).
(1965). [27] A. Litwin-Kumar, K. D. Harris, R. Axel, H. Sompolinsky,
[12] E. Gardner, The Space of Interactions in Neural Network and L. F. Abbott, Optimal Degrees of Synaptic Connectivity,
Models, J. Phys. A 21, 257 (1988). Neuron 93, 1153 (2017).
[13] E. Gardner, Maximum Storage Capacity in Neural Networks, [28] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,
Europhys. Lett. 4, 481 (1987). Imagenet: A Large-Scale Hierarchical Image Database, in
[14] V. Vapnik, Statistical Learning Theory (Wiley, New York, Computer Vision and Pattern Recognition, 2009. CVPR
1998). 2009. IEEE Conference on (IEEE, 2009), pp. 248–255.
[15] S. Chung, D. D. Lee, and H. Sompolinsky, Linear Readout [29] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D.
of Object Manifolds, Phys. Rev. E 93, 060301 (2016). Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich,
[16] See Supplemental Material at http://link.aps.org/ Going Deeper with Convolutions, in Proceedings of the
supplemental/10.1103/PhysRevX.8.031003 for the further IEEE Conference on Computer Vision and Pattern Recog-
details of theory, and the details of numerics such as nition (IEEE Computer Society, 2015), pp. 1–9.
algorithms and the simulation parameters. [30] S.-H. Kye, Faces for Two Qubit Separable States
[17] S. Boyd and L. Vandenberghe, Convex Optimization and the Convex Hulls of Trigonometric Moment Curves,
(Cambridge University Press, Cambridge, England, 2004). arXiv:1302.2226.
[18] F. Gerl and U. Krey, Storage Capacity and Optimal [31] Z. Smilansky, Convex Hulls of Generalized Moment Curves,
Learning of Potts-Model Perceptrons by a Cavity Method, Isr. J. Math. 52, 115 (1985).
J. Phys. A 27, 7353 (1994). [32] R. Monasson, Properties of Neural Networks Storing
[19] W. Kinzel and M. Opper, Dynamics of Learning, in Models Spatially Correlated Patterns, J. Phys. A 25, 3701
of Neural Networks (Springer, 1991), pp. 149–171. (1992).
[20] R. T. Rockafellar, Convex Analysis (Princeton University [33] B. Lopez, M. Schroder, and M. Opper, Storage of
Press, Princeton, NJ, 2015). Correlated Patterns in a Perceptron, J. Phys. A 28, L447
[21] J.-J. Moreau, Décomposition Orthogonale D’un Espace (1995).
Hilbertien Selon Deux Cônes Mutuellement Polaires, [34] S. T. Roweis and L. K. Saul, Nonlinear Dimensionality
C.R. Hebd. Seances Acad. Sci. 225, 238 (1962). Reduction by Locally Linear Embedding, Science 290,
[22] S. Chung, U. Cohen, H. Sompolinsky, and D. D. Lee, 2323 (2000).
Learning Data Manifolds with a Cutting Plane Method, [35] J. B. Tenenbaum, V. De Silva, and J. C. Langford, A Global
Neural Comput., DOI: 10.1162/neco_a_01119 (2018). Geometric Framework for Nonlinear Dimensionality Re-
[23] B. Grünbaum and G. C. Shephard, Convex Polytopes, Bull. duction, Science 290, 2319 (2000).
Lond. Math. Soc. 1, 257 (1969). [36] S.-i. Amari, N. Fujita, and S. Shinomoto, Four Types of
[24] A. A. Giannopoulos, V. D. Milman, and M. Rudelson, Learning Curves, Neural Comput. 4, 605 (1992).
Convex Bodies with Minimal Mean Width, in Geometric [37] S. Risau-Gusman and M. B. Gordon, Statistical Mechanics
Aspects of Functional Analysis (Springer, New York, 2000), of Learning with Soft Margin Classifiers, Phys. Rev. E 64,
pp. 81–93. 031907 (2001).

031003-26

You might also like