Professional Documents
Culture Documents
Perceptual manifolds arise when a neural population responds to an ensemble of sensory signals
associated with different physical features (e.g., orientation, pose, scale, location, and intensity) of the same
perceptual object. Object recognition and discrimination require classifying the manifolds in a manner that
is insensitive to variability within a manifold. How neuronal systems give rise to invariant object
classification and recognition is a fundamental problem in brain theory as well as in machine learning.
Here, we study the ability of a readout network to classify objects from their perceptual manifold
representations. We develop a statistical mechanical theory for the linear classification of manifolds with
arbitrary geometry, revealing a remarkable relation to the mathematics of conic decomposition. We show
how special anchor points on the manifolds can be used to define novel geometrical measures of radius and
dimension, which can explain the classification capacity for manifolds of various geometries. The general
theory is demonstrated on a number of representative manifolds, including l2 ellipsoids prototypical of
strictly convex manifolds, l1 balls representing polytopes with finite samples, and ring manifolds
exhibiting nonconvex continuous structures that arise from modulating a continuous degree of freedom.
The effects of label sparsity on the classification capacity of general manifolds are elucidated, displaying a
universal scaling relation between label sparsity and the manifold radius. Theoretical predictions are
corroborated by numerical simulations using recently developed algorithms to compute maximum margin
solutions for manifold dichotomies. Our theory and its extensions provide a powerful and rich framework
for applying statistical mechanics of linear classification to data arising from perceptual neuronal responses
as well as to artificial deep networks trained for object recognition tasks.
DOI: 10.1103/PhysRevX.8.031003 Subject Areas: Biological Physics, Complex Systems,
Statistical Physics
031003-2
CLASSIFICATION AND GEOMETRY OF GENERAL … PHYS. REV. X 8, 031003 (2018)
In particular, we define the manifold anchor radius equivalent to sparsely labeled l2 balls with radii
and dimension, RM and DM , respectively. These Rg and dimensions Dg , as demonstrated by our
quantities are relevant since the capacity of general numerical evaluations (Sec. VI). Conversely, when
manifolds can be well approximated by the capacity fR2g ≫ 1, the capacity is low and close to ð1=DÞ,
of l2 balls with radii RM and dimensions DM . where D is the dimensionality of the manifold affine
Interestingly, we show that, in the limit of small subspace, even for extremely small f.
manifolds, the anchor point statistics are dominated (vi) Our theory provides, for the first time, quantitative
by points on the boundary of the manifolds which and qualitative predictions for the perceptron clas-
have minimal overlap with Gaussian random vec- sification of realistic data structures. However,
tors. The resultant Gaussian radius Rg and dimen- application to real data may require further exten-
sion Dg are related to the well-known Gaussian sions of the theory and are discussed in Sec. VII.
mean width of convex bodies (Sec. IV). Beyond Together, the theory makes an important contribu-
understanding fundamental limits for classification tion to the development of statistical mechanical
capacity, these geometric measures offer new quan- theories of neural information processing in realistic
titative tools for assessing how perceptual manifolds conditions.
are reformatted in brain and artificial systems.
(iv) We apply the general theory to three examples, II. MODEL OF MANIFOLDS
representing distinct prototypical manifold classes.
One class consists of manifolds with strictly smooth Manifolds in affine subspaces:—We model a set of P
convex hulls, which do not contain facets and are perceptual manifolds corresponding to P perceptual objects.
exemplified by l2 ellipsoids. Another class is that of Each manifold M μ for μ ¼ 1; …; P consists of a compact
convex polytopes, which arise when the manifolds subset of an affine subspace of RN with affine dimension D
consist of a finite number of data points, and are with D < N. A point on the manifold xμ ∈ Mμ can be
exemplified by l1 ellipsoids. Finally, ring manifolds parametrized as
represent an intermediate class: smooth but non-
convex manifolds. Ring manifolds are continuous X
Dþ1
⃗ ¼
xμ ðSÞ Si uμi ; ð1Þ
nonlinear functions of a single intrinsic variable,
i¼1
such as object orientation angle. The differences
between these manifold types show up most clearly where uμi are a set of orthonormal bases of the (D þ 1)-
in the distinct patterns of the support dimensions. dimensional linear subspace containing M μ . The D þ 1
However, as we show, they share common trends. As components Si represent the coordinates of the manifold
the size of the manifold increases, the capacity and point within this subspace and are constrained to be in the set
geometrical measures vary smoothly, exhibiting a S⃗ ∈ S. The bold notation for xμ and uμi indicates that they are
smooth crossover from a small radius and dimension
with high capacity to a large radius and dimension vectors in RN , whereas the arrow notation for S⃗ indicates that
with lowpffiffiffiffiffiffi
capacity. This crossover occurs as it is a vector in RDþ1 . The set S defines the shape of the
Rg ∝ ð1= Dg Þ. Importantly, for many realistic manifolds and encapsulates the affine constraint. For sim-
plicity, we will first assume that the manifolds have the same
cases, when the size is smaller than the crossover
geometry so that the coordinate set S is the same for all the
value, the manifold dimensionality is substantially
manifolds; extensions that consider heterogeneous geom-
smaller than that computed from naive second-order
etries are provided in Sec. VI A.
statistics, highlighting the saliency and significance
of our measures for the anchor geometry. We study the separability of P manifolds into two classes,
(v) Finally, we treat the important case of the classi- denoted by binary labels yμ ¼ 1, by a linear hyperplane
fication of manifolds with imbalanced (sparse) passing through the origin. A hyperplane is described by a
labels, which commonly arise in problems of object weight vector w ∈ RN , normalized so kwk2 ¼ N, and the
recognition. It is well known that, in highly sparse hyperplane correctly separates the manifolds with a margin
labels, the classification capacity of random points κ ≥ 0 if it satisfies
increases dramatically as ðfj log fjÞ−1 , where f ≪ 1
y μ w · xμ ≥ κ ð2Þ
is the fraction of the minority labels. Our analysis of
sparsely labeled manifolds highlights the interplay for all μ and xμ ∈ Mμ . Since linear separability is a convex
between manifold size and sparsity. In particular, it problem, separating the manifolds is equivalent to separating
shows that sparsity enhances the capacity only ⃗ S⃗ ∈ convðSÞg, where
the convex hulls, convðM μ Þ ¼ fxμ ðSÞj
when fR2g ≪ 1, where Rg is the (Gaussian) manifold
radius. Notably, for a large regime of parameters, XDþ1 X
Dþ1
sparsely labeled manifolds can approximately convðSÞ ¼ αi S⃗ i jS⃗ i ∈ S; αi ≥ 0; αi ¼ 1 : ð3Þ
be described by a universal capacity function i¼1 i¼1
031003-3
CHUNG, LEE, and SOMPOLINSKY PHYS. REV. X 8, 031003 (2018)
031003-4
CLASSIFICATION AND GEOMETRY OF GENERAL … PHYS. REV. X 8, 031003 (2018)
These bounds highlight the fact that, in the large N limit, properties of the maximum margin solution, namely, the
the maximal number of separable finite-dimensional mani- solution for the largest load αM for a fixed margin κ, or
folds is proportional to N, even though each consists of an equivalently, the solution when the margin κ is maximized
infinite number of points. This sets the stage for a statistical for a given αM .
mechanical evaluation of the maximal α ¼ ðP=NÞ, where P As shown in Appendix A, we prove that the general form
is the number of manifolds, and is described in the of the inverse capacity, exact in the thermodynamic limit, is
following section.
α−1 ⃗
M ðκÞ ¼ hFðTÞiT⃗ ; ð11Þ
III. STATISTICAL MECHANICAL THEORY
where FðTÞ ⃗ 2 jV⃗ · S⃗ − κ ≥ 0; ∀ S⃗ ∈ Sg
⃗ ¼ min ⃗ fkV⃗ − Tk
V
In order to make theoretical progress beyond the bounds and h…iT⃗ is an average over random (D þ 1)-dimensional
above, we need to make additional statistical assumptions
⃗ whose components are i.i.d. normally distributed
vectors T,
about the manifold spaces and labels. Specifically, we will
assume that the individual components of uμi are drawn T i ∼ N ð0; 1Þ. The components of the vector V⃗ represent
independently and from identical Gaussian distributions the signed fields induced by the solution vector w on the
with zero mean and variance ð1=NÞ, and that the binary D þ 1 basis vectors of the manifold. The Gaussian vector T⃗
labels yμ ¼ 1 are randomly assigned to each manifold represents the part of the variability in V⃗ due to quenched
with equal probabilities. We will study the thermodynamic variability in the manifold’s basis vectors and the labels, as
limit where N, P → ∞, but with a finite load α ¼ ðP=NÞ. will be explained in detail below.
In addition, the manifold geometries as specified by the set The inequality constraints in F can be written equiva-
S in RDþ1 and, in particular, their affine dimension D are lently, as a constraint on the point on the manifold with
held fixed in the thermodynamic limit. Under these minimal projection on V.⃗ We therefore consider the concave
assumptions, the bounds in Eq. (7) can be extended to ⃗ ¼ min ⃗ fV⃗ · Sj ⃗ S⃗ ∈ Sg, which
support function of S, gS ðVÞ
the linear separability of general manifolds with finite S
margin κ, and characterized by the reciprocal of the critical can be used to write the constraint for FðTÞ ⃗ as
load ratio α−1
M ðκÞ,
⃗ ¼ min ⃗ fkV⃗ − Tk
FðTÞ ⃗ 2 jgS ðVÞ
⃗ − κ ≥ 0g: ð12Þ
V
α−1
0 ðκÞ ≤ α−1
M ðκÞ ≤ α−1
0 ðκÞ þ D; ð8Þ
⃗ is easily mapped to the
Note that this definition of gS ðVÞ
where α0 ðκÞ is the maximum load for separation of random conventional convex support function defined via the max
independent and identically distributed (i.i.d.) points, with operation [17].
a margin κ given by the Gardner theory [12], KKT conditions:—To gain insight into the nature of the
Z κ maximum margin solution, it is useful to consider the KKT
−1
α0 ðκÞ ¼ Dtðt − κÞ2 ; ð9Þ conditions of the convex optimization in Eq. (12) [17]. For
−∞ ⃗ the KKT conditions that characterize the unique
each T,
pffiffiffiffiffiffi 2 solution of V⃗ for FðT)
⃗ is given by
with Gaussian measure Dt ¼ ð1= 2π Þe−ðt =2Þ . For many
interesting cases, the affine dimension D is large, and the
V⃗ ¼ T⃗ þ λS̃ðTÞ;
⃗ ð13Þ
gap in Eq. (8) is overly loose. Hence, it is important to
derive an estimate of the capacity for manifolds with finite
where
sizes and evaluate the dependence of the capacity and the
nature of the solution on the geometrical properties of the
λ≥0
manifolds, as shown below.
⃗ −κ ≥0
gS ðVÞ
A. Mean-field theory of manifold separation capacity ⃗ − κ ¼ 0:
λ½gS ðVÞ ð14Þ
Following Gardner’s framework [12,13], we compute
the statistical average of log Z, where Z is the volume of the ⃗ is a subgradient of the support function at
The vector S̃ðTÞ
space of the solutions, which in our case can be written as ⃗ S̃ðTÞ
⃗ ∈ ∂gS ðVÞ⃗ [17]; i.e., it is a point on the convex hull
V,
Z ⃗ When the support
of S with minimal overlap with V.
Z ¼ dN wδðkwk2 − NÞΠμ;xμ ∈Mμ Θðyμ w · xμ − κÞ: ð10Þ ⃗ is unique
function is differentiable, the subgradient ∂gS ðVÞ
and is equivalent to the gradient of the support function,
Θð·Þ is the Heaviside function to enforce the margin
constraints in Eq. (2), along with the delta function to ⃗ ¼ ∇gS ðVÞ
S̃ðTÞ ⃗
⃗ ¼ arg min V⃗ ·S: ð15Þ
ensure kwk2 ¼ N. In the following, we focus on the ⃗
S∈S
031003-5
CHUNG, LEE, and SOMPOLINSKY PHYS. REV. X 8, 031003 (2018)
Since the support function is positively homogeneous, the appropriate statistics from self-consistent equations of
⃗ ¼ γgS ðVÞ
gS ðγ VÞ ⃗ for all γ > 0; thus, ∇gS ðVÞ ⃗ depends the fields on a single manifold. To see this, consider
only on the unit vector V̂. For values of V⃗ such that gS ðVÞ
⃗ is projecting the solution vector w onto the affine subspace
⃗ is of one of the manifolds, say, μ ¼ 1. We define a (D þ 1)-
not differentiable, the subgradient is not unique, but S̃ðTÞ
defined uniquely as the particular subgradient that obeys dimensional vector V⃗ 1 as V 1i ¼ y1 w · u1i , i ¼ 1; …; D þ 1,
the KKT conditions, Eqs. (13)–(14). In the latter case, which are the signed fields of the solution on the affine
⃗ ∈ convðSÞ may not reside in S itself. basis vectors of the μ ¼ 1 manifold. Then, Eq. (18) reduces
S̃ðTÞ
From Eq. (13), we see that the capacity can be written in to V⃗ 1 ¼ λ1 S̃1 þ T,
⃗ where T⃗ represents the contribution
⃗ as
terms of S̃ðTÞ from all the other manifolds and since their subspaces are
randomly oriented, this contribution is well described as a
⃗ ¼ kλS̃ðTÞk
FðTÞ ⃗ 2: ð16Þ random Gaussian vector. Finally, self consistency requires
⃗ S̃1 is such that it has a minimal overlap with
that for fixed T,
The scale factor λ is either zero or positive, corresponding to V⃗ 1 and represents a point residing on the margin hyper-
⃗ − κ is positive or zero. If gS ðVÞ
whether gS ðVÞ ⃗ − κ is posi- plane; otherwise, it will not contribute to the max margin
⃗ ⃗ ⃗
tive, then λ ¼ 0, meaning that V ¼ T and T · S̃ðTÞ−κ⃗ >0. If solution. Thus, Eq. (13) is just the decomposition of the
field induced on a specific manifold into the contribution
T⃗ · S̃ðTÞ
⃗ − κ < 0, then λ > 0 and V⃗ ≠ T. ⃗ In this case,
induced by that specific manifold along with the contri-
multiplying Eq. (13) with S̃ðTÞ ⃗ ⃗ 2¼
yields λkS̃ðTÞk butions coming from all the other manifolds. The self-
−T⃗ · S̃ðTÞ
⃗ þ κ. Thus, λ obeys the self-consistent equation, consistent Eqs. (14) and (17) relating λ to the Gaussian
statistics of T⃗ then naturally follow from the requirement
½−T⃗ · S̃ðTÞ
⃗ þ κþ
that S̃1 represents a support vector.
λ¼ ; ð17Þ
⃗ 2
kS̃ðTÞk
C. Anchor points and manifold supports
where the function ½xþ ¼ maxðx; 0Þ.
The vectors x̃μ contributing to the solution, Eq. (18), play
B. Mean-field interpretation of the KKT relations a key role in the theory. We will denote them or,
equivalently, their affine subspace components S̃μ as the
The KKT relations have a nice interpretation within the “manifold anchor points.” For a particular configuration of
framework of the mean-field theory. The maximum margin manifolds, the manifolds could be replaced by an equiv-
solution vector can always be written as a linear combi- alent set of P anchor points to yield the same maximum
nation of a set of support vectors. Although there are margin solution. It is important to stress, however, that an
infinite numbers of input points in each manifold, the individual anchor point is determined not only by the
solution vector can be decomposed into P vectors, one per
configuration of its associated manifold, but also by the
manifold,
random orientations of all the other manifolds. For a fixed
X
P manifold, the location of its anchor point will vary with the
w¼ λμ yμ x̃μ ; λμ ≥ 0; ð18Þ relative configurations of the other manifolds. This varia-
μ¼1
tion is captured in mean-field theory by the dependence of
where x̃μ ∈ convðMμ Þ is a vector in the convex hull of the the anchor point S̃ on the random Gaussian vector T. ⃗
μth manifold. In the large N limit, these vectors are In particular, the position of the anchor point in the
uncorrelated with each other. Hence, squaring this equa- convex hull of its manifold reflects the nature of the relation
tion, ignoring correlations between the different contribu- between that manifold and the margin planes. In general, a
tions (Using the “cavity” method, one can show that the λ’s fraction of the manifolds will intersect with the margin
appearing in the mean field KKT equations and the λ’s hyperplanes; i.e., they have nonzero λ. These manifolds are
appearing in Eq. (18) differ by an overall global scaling the support manifolds of the system. The nature of this
factor which accounts for the presence of correlations support varies and can be characterized by the dimension k
between the individual terms in Eq. (18), which we have of the span of the intersecting set of convðSÞ with the
neglected here for brevity; see [18,19]), yields kwk2 ¼ margin planes. Some support manifolds, which we call
P
N ¼ Pμ¼1 λ2μ kS̃μ k2 ; where S̃μ are the coordinates of x̃μ in “touching” manifolds, intersect with the margin hyperplane
the μth affine subspace of the μth manifold; see Eq. (1). only with their anchor point. They have a support dimen-
From this equation, it follows that α−1 ¼ ðN=PÞ ¼ sion k ¼ 1, and their anchor point S̃ is on the boundary of
hλ2 kS̃k2 i, which yields the KKT expression for the capac- S. The other extreme is “fully supporting” manifolds,
ity; see Eqs. (11) and (16). which completely reside in the margin hyperplane. They
The KKT relations above are self-consistent equations are characterized by k ¼ D þ 1. In this case, V⃗ is parallel to
for the statistics of λμ and S̃μ . The mean-field theory derives the translation vector c⃗ of S. Hence, all the points in S are
031003-6
CLASSIFICATION AND GEOMETRY OF GENERAL … PHYS. REV. X 8, 031003 (2018)
support vectors, and all have the same overlap κ with V. ⃗ κ ¼ 0, the Moreau decomposition theorem states that the
The anchor point, in this case, is the unique point in the two components are perpendicular: V⃗ · ðλS̃ðTÞÞ ⃗ ¼ 0 [21].
interior of convðSÞ that obeys the self-consistent equation, For nonzero κ, the two components need not be
Eq. (13), namely, that it balances the contribution from the perpendicular but obey V⃗ · ðλS̃ðTÞÞ ⃗ ¼ κλ.
other manifolds to zero out the components of V⃗ that are The position of vector T⃗ in relation to the cones, coneðSÞ
orthogonal to c⃗ . In the case of smooth convex hulls (i.e., and S ∘κ , gives rise to qualitatively different expressions for
when S is strongly convex), no other manifold support ⃗ and contributions to the solution weight vector and
FðTÞ
configurations exist. For other types of manifolds, there are inverse capacity. These correspond to the different support
also “partially supporting” manifolds, whose convex hull dimensions mentioned above. In particular, when −T⃗ lies
intersections with the margin hyperplanes consist of k-
dimensional faces with 1 < k < D þ 1. The associated inside S ∘κ , T⃗ ¼ V⃗ and λ ¼ 0, so the support dimension
anchor points then reside inside the intersecting face. k ¼ 0. On the other hand, when −T⃗ lies inside coneðSÞ,
For instance, k ¼ 2 implies that S̃ lies on an edge, whereas V⃗ ¼ κ⃗c, and the manifold is fully supporting, k ¼ D þ 1.
k ¼ 3 implies that S̃ lies on a planar two-face of the convex
hull. Determining the dimension of the support structure E. Numerical solution of the mean-field equations
that arises for various T⃗ is explained below. The solution of the mean-field equations consists of two
⃗ and then the
stages. First, S̃ is computed for a particular T,
D. Conic decomposition relevant contributions to the inverse capacity are averaged
The KKT conditions can also be interpreted in terms over the Gaussian distribution of T.⃗ For simple geometries
of the conic decomposition of T, ⃗ which generalizes such as l2 ellipsoids, the first step may be solved
the notion of the decomposition of vectors onto linear analytically. However, for more complicated geometries,
subspaces and their null spaces via Euclidean projection. both steps need to be performed numerically. The first step
The convex cone of the manifold S is defined as involves determining V⃗ and S̃ for a given T⃗ by solving the
PDþ1
coneðSÞ ¼ f i¼1 αi S⃗ i jS⃗ i ∈ S; αi ≥ 0g; see Fig. 3. The quadratic semi-infinite programming problem (QSIP),
shifted polar cone of S, denoted S ∘κ , is defined as the convex Eq. (12), over the manifold S, which may contain infinitely
set of points given by many points. A novel “cutting plane” method has been
developed to efficiently solve the QSIP problem; see SM
S ∘κ ¼ fU ⃗ · S⃗ þ κ ≤ 0;
⃗ ∈ RDþ1 jU ∀ S⃗ ∈ Sg ð19Þ [16] (Sec. S4). Expectations are computed by sampling the
and is illustrated for κ ¼ 0 and κ > 0 in Fig. 3. For κ ¼ 0, Gaussian T⃗ in D þ 1 dimensions and taking the appropriate
Eq. (19) is simply the conventional polar cone of S [20]. averages, similar to procedures for other mean-field meth-
Equation (13) can then be interpreted as the decomposition ods. The relevant quantities corresponding to the capacity
are quite concentrated and converge quickly with relatively
of −T⃗ into the sum of two component vectors: One few samples.
component is −V, ⃗ i.e., its Euclidean projection onto S ∘κ ; In the following sections, we will also show how the
the other component λS̃ðTÞ ⃗ is located in coneðSÞ. When mean-field theory compares with computer simulations that
numerically solve for the maximum margin solution of
(a) (b)
realizations of P manifolds in RN , given by Eq. (2), for a
variety of manifold geometries. Finding the maximum
margin solution is challenging, as standard methods to
solve SVM problems are limited to a finite number of input
points. We have recently developed an efficient algorithm
for finding the maximum margin solution in manifold
classification and have used this method in the present
work [see Ref. [22] and SM [16] (Sec. S5)].
031003-7
CHUNG, LEE, and SOMPOLINSKY PHYS. REV. X 8, 031003 (2018)
031003-8
CLASSIFICATION AND GEOMETRY OF GENERAL … PHYS. REV. X 8, 031003 (2018)
C. Effects of size and margin of the anchor points when the manifold size is small.
We now discuss the effect of changing manifold size Furthermore, it is natural to define the geometric properties
and the imposed margin on both capacity and geometry. As of manifolds in terms of centered manifolds where the
described by Eq. (4), a change in the manifold size manifolds are shifted within their affine subspace so that
corresponds to scaling every s̃ by a scalar r. the center and orthogonal translation vector coincide, i.e.,
Small size:—If r → 0, the manifold shrinks to a point s⃗ 0 ¼ c⃗ with k⃗s0 k ¼ k⃗ck ¼ 1. This means that all lengths
(the center), whose capacity reduces to that of isolated points. are defined relative to the distance of the centers from the
However, in the case where D ≫ 1, the capacity may be origin and the D-dimensional intrinsic vectors s⃗ give the
offset relative to the manifold center.
affected by the manifold structure even if r ≪ 1; see
Sec. IV F. Nevertheless, the underlying support structure
is simple. When r is small, the manifolds have only two D. Manifold anchor geometry
support configurations. For t0 − κ > 0, the manifold is The capacity equation, Eq. (20), motivates defining
interior with λ¼0, v0 ¼t0 , and ⃗v¼ t.⃗ When t0 − κ < 0, the geometrical measures of the manifolds, which we call
manifold becomes touching with support dimension k¼1. In “manifold anchor geometry.” Manifold anchor geometry is
that case, because s̃ has a small magnitude, v⃗ ≈ ⃗ t, λ ≈ κ − t0 , based on the statistics of the anchor points s̃ð⃗ t; t0 Þ induced
and v0 ≈ κ. Thus, in both cases, v⃗ is close to the Gaussian by the Gaussian random vector ð⃗ t; t0 Þ, which are relevant to
vector ⃗ t. The probability of other configurations vanishes. the capacity in Eq. (20). These statistics are sufficient for
Large size:—In the large size limit r → ∞, separating the determining the classification properties and supporting
manifolds is equivalent to separating the affine subspaces. structures associated with the maximum margin solution.
As we show in Appendix C, when r ≫ 1, there are R ∞two main
We accordingly define the manifold anchor radius and
support structures. With probability Hð−κÞ ¼ −κ Dt0, the dimension as follows.
manifolds are fully supporting; namely, the underlying Manifold anchor radius:—Denoted RM , it is defined by
affine subspaces are parallel to the margin plane. This is the mean squared length of s̃ð⃗ t; t0 Þ,
the first regime corresponding to fully supporting manifolds,
which contributes to the inverse capacity an amount R2M ¼ hks̃ð⃗ t; t0 Þk2 it;t
⃗ 0: ð28Þ
Hð−κÞD þ α−1 0 ðκÞ. The second regime corresponds to
touching or partially supporting manifolds, such that the Manifold anchor dimension: DM is given by
angle between the affine subspace and the margin plane is
almost zero, and contributes an amount HðκÞD to the inverse DM ¼ h(⃗ t · ŝðt;⃗ t0 Þ)2 it;t
⃗ 0; ð29Þ
capacity. Combining the two contributions, we obtain, for
large sizes, α−1 −1
M ¼ D þ α0 ðκÞ, consistent with Eq. (8).
where ŝ is a unit vector in the direction of s̃. The anchor
Large margin:—For a fixed T, ⃗ Eq. (25) implies that dimension measures the angular spread between ⃗ t and
larger κ increases the probability of being in the supporting the corresponding anchor point s̃ in D dimensions. Note
regimes. Increasing κ also shrinks the magnitude of s̃ that the manifold dimension obeys DM ≤ hk⃗ tk2 i ¼ D.
according to Eq. (24). Hence, when κ ≫ 1, the capacity Whenever there is no ambiguity, we will call RM and
becomes similar to that of P random points and the DM the manifold radius and dimension, respectively.
corresponding capacity is given by αM ≈ κ −2, independent These geometric descriptors offer a rich description of
of manifold geometry. the manifold properties relevant for classification. Since v⃗
Manifold centers:—The theory of manifold classification and s̃ depend in general on t0 − κ, the above quantities are
described in Sec. III does not require the notion of a averaged not only over ⃗ t but also over t0. For the same
manifold center. However, once we understand how scaling reason, the manifold anchor geometry also depends upon
the manifold sizes by a parameter r in Eq. (4) affects their the imposed margin.
capacity, the center points about which the manifolds are
scaled need to be defined. For many geometries, the center E. Gaussian geometry
is a point of symmetry such as for an ellipsoid. For general
We have seen that, for small manifold sizes, v⃗ ≈ ⃗ t, and
manifolds, a natural definition would be the center of mass
⃗ averaging over the Gaussian the anchor points s̃ can be approximated by s̃ð⃗ tÞ ¼ ∇gS ðt̂Þ.
of the anchor points S̃ðTÞ
Under these conditions, the geometry simplifies as shown
⃗
measure of T. We will adopt here a simpler definition for in Fig. 5. For each Gaussian vector ⃗ t ∈ RD, s̃ð⃗ tÞ is the point
the center provided by the Steiner point for convex bodies
on the manifold that first touches a hyperplane normal to −⃗ t
[23], S⃗ 0 ¼ ð⃗s0 ; 1Þ, with as it is translated from infinity towards the manifold. Aside
s⃗ 0 ¼ h∇gS ð⃗ tÞi⃗t ; ð27Þ from a set of measure zero, this touching point is a unique
point on the boundary of convðSÞ. This procedure is similar
and the expectation is over the Gaussian measure of to that used to define the well-known Gaussian mean width,
⃗ t ∈ RD . This definition coincides with the center of mass which in our notation equals wðSÞ ¼ −2hgS ð⃗ tÞi⃗t [24].
031003-9
CHUNG, LEE, and SOMPOLINSKY PHYS. REV. X 8, 031003 (2018)
031003-10
CLASSIFICATION AND GEOMETRY OF GENERAL … PHYS. REV. X 8, 031003 (2018)
of the angle varies between zero and ðπ=4Þ, with the of s̃ are small; hence v⃗ ≈ ⃗ t, and the Gaussian statistics for
manifold anchor geometry being more concentrated near the geometry suffice. Thus, we can replace RM and DM in
zero due to contributions from the fully supporting regime. Eqs. (32) and (33) by Rg and Dg , respectively. Note that,
In Sec. VI, we will show how the Gaussian geometry in the scaling regime, the factor proportional to 1 þ R2g in
becomes relevant even for larger manifolds when the labels Eq. (32) is the next-order correction to the overall capacity,
are highly imbalanced, i.e., sparse classification. since Rg is small. Notably, the margin in this regime,
pffiffiffiffiffiffi
κg ¼ Rg Dg , is equal to half the Gaussian mean width
F. Geometry and classification of
high-dimensional manifolds of convex bodies, κg ≈ 12 wðSÞ [25]. As for the support
structure, since the manifold size is small, the only signifi-
In general, linear classification as expressed in Eq. (20)
cant contributions arise from the interior (k ¼ 0) and
depends on high-order statistics of the anchor vectors
touching (k ¼ 1) supports.
s̃ð⃗ t; t0 Þ. Surprisingly, our analysis shows that, for high- Beyond the scaling regime:—When Rg is not small, the
dimensional manifolds, the classification capacity can be
anchor geometry, RM and DM , cannot be adequately
described in terms of the statistics of RM and DM alone.
described by the Gaussian statistics, Rg and Dg . In this case,
This is particularly relevant as we expect that, in many pffiffiffiffiffiffiffi
applications, the affine dimension of the manifolds is large. the manifold margin κM ¼ RM DM is large and Eq. (32)
More specifically, we define high-dimensional manifolds reduces to
as manifolds where the manifold dimension is large, i.e.,
DM ≫ 1 (but still finite in the thermodynamic limit). In 1 þ R−2
practice, we find that DM ≳ 5 is sufficient. Our analysis αM ≈ M
; ð34Þ
DM
below elucidates the interplay between size and dimension,
namely, how small RM needs to be for high-dimensional
manifolds to have a substantial classification capacity. where we have used α0 ðκÞ ≈ κ −2 for large margins and
In the high-dimensional regime, the mean-field equa- assumed that κ ≪ κM . For strictly convex high-dimensional
tions simplify due to self-averaging of terms involving manifolds with RM ¼ Oð1Þ, only the touching regime
sums of components ti and s̃i . The quantity, ⃗ t · s̃, that (k ¼ 1) contributes significantly to the geometry and, hence,
appears in the capacity, Eq. (20), can be approximated as to the capacity. For manifolds that are not strictly convex,
partially supporting solutions with k ≪ D can also contrib-
−⃗ t · s̃ ≈ κ M , where we introduce the effective manifold
pffiffiffiffiffiffiffi ute to the capacity.
margin κ M ¼ RM DM . Combined with ks̃k2 ≈ R2M , we
Finally, when RM is large, fully supporting regimes
obtain
with k ≈ D contribute to the geometry, in which case, the
κ þ κM manifold anchor dimension approaches the affine
αM ðκÞ ≈ α0 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ; ð32Þ dimension DM → D and Eqs. (32) and (34) reduce to αM ≈
1 þ R2M
ð1=DÞ, as expected.
where α0 is the capacity for P random points, Eq. (9). To
gain insight into this result, we note that the effective
V. EXAMPLES
margin on the center is its mean distance from the point
closest to the margin plane s̃, which is roughly the mean of
pffiffiffiffiffiffiffi A. Strictly convex manifolds: l2 ellipsoids
−⃗ t · s̃ ≈ RM DM . The denominator in Eq. (32) indicates The family of l2 ellipsoids are examples of manifolds
thatffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
p the margin needs to be scaled by the input norm, that are strictly convex. Strictly convex manifolds have
1 þ R2M . In Appendix B, we show that Eq. (32) can also smooth boundaries that do not contain corners, edges, or
be written as flats (see Appendix B). Thus, the description of the
anchor geometry and support structures is relatively
αM ðκÞ ≈ αBall ðκ; RM ; DM Þ; ð33Þ simple. The reason is that the anchor vectors s̃ð⃗ t; t0 Þ
correspond to interior (k ¼ 0), touching (k ¼ 1), or fully
namely, the classification capacity of a general high- supporting ðk ¼ D þ 1Þ, while partial support ð1 < k <
dimensional manifold is well approximated by that of l2 D þ 1Þ is not possible. The l2 ellipsoid geometry can be
balls with dimension DM and radius RM . solved analytically; nevertheless, because it has less
Scaling regime:—Equation (32) implies that, to obtain a symmetry than the sphere, it exhibits some salient pro-
finite capacity in the high-dimensional regime, the effective
pffiffiffiffiffiffiffi perties of manifold geometry, such as nontrivial dimen-
margin κ M ¼ RM DM needs to be of order unity, which sionality and nonuniform measure of the anchor points.
requires the radius to be small, scaling for large DM as We assume the ellipsoids are centered relative to their
−1
RM ¼ OðDM2 Þ. In this scaling regime, the calculation of the symmetry centers as described in Sec. IV, and they
P can be
capacity and geometric properties are particularly simple. parametrized by the set of points Mμ ¼ xμ0 þ D μ
i¼1 si ui ,
As argued above, when the radius is small, the components where
031003-11
CHUNG, LEE, and SOMPOLINSKY PHYS. REV. X 8, 031003 (2018)
X D 2
S ¼ s⃗ j
si
≤1 : ð35Þ R2i
λi ¼ ; ð41Þ
i¼1
Ri Zð1 þ R2i Þ
P
The components of uμi and of the ellipsoid centers x0 μ where the normalization factor Z2 ¼ D 2 2 2
i¼1 ½Ri =ð1 þ Ri Þ
are pi.i.d.
ffiffiffiffi Gaussian distributed with zero mean and variance (see Appendix B). The capacity for high-dimensional
ð1= N Þ so that they are orthonormal in the large N limit. ellipsoids is determined via Eq. (32) with manifold anchor
The radii Ri represent the principal radii of the ellipsoids radius
relative to the center. The anchor points s̃ð⃗ t; t0 Þ can be
computed explicitly (details in the Appendix B), corre- X
D
R2M ¼ λ2i ð42Þ
sponding to three regimes.
i¼1
The interior regime (k ¼ 0) occurs when t0 − κ ≥
ttouch ð⃗ tÞ, where and anchor dimension
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P
u D ð D λi Þ2
uX DM ¼ Pi¼1 ð43Þ
ttouch ð⃗ tÞ ¼ t
:
R2i t2i : ð36Þ 2
i¼1 λi
D
i¼1
Note that RM =r and DM are not invariant to scaling the
Here, λ ¼ 0, resulting in zero contribution to the inverse ellipsoid by a global factor r, reflecting the role of the fixed
capacity. centers.
The touching regime (k ¼ 1) holds when ttouch ð⃗ tÞ > t0 − The anchor covariance matrix:—We can also compute
κ > tfs ð⃗ tÞ and the covariance matrix of the anchor points. This matrix is
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi diagonal in the principal directions of the ellipsoid, and its
u D eigenvalues are λ2i . It is interesting to compare DM with a
uX ti 2
tfs ðtÞ ¼ −t
⃗ : ð37Þ well-known measure of an effective dimension of a
i¼1
Ri covariance matrix, the participation ratio [26,27]. Given
a spectrum of eigenvalues of the covariance matrix, in
Finally, the fully supporting regime ðk ¼ D þ 1Þ occurs our notation λ2i , we can define a generalized participation
P q 2 PD 2q
when tfs ð⃗ tÞ > t0 − κ. The full expression for the capacity ratio as PRq ¼ ½ð D i¼1 λi Þ = i¼1 λi , q > 0. The con-
for l2 ellipsoids is given in Appendix B. Below, we focus ventional participation ratio uses q ¼ 2, whereas DM
on the interesting cases of ellipsoids with D ≫ 1. uses q ¼ 1.
High-dimensional ellipsoids:—It is instructive to apply Scaling regime:—In the scaling regime where the radii
the general analysis of high-dimensional manifolds to are small, the radius and dimension are equivalent to the
ellipsoids with D ≫ 1. We will distinguish among differ- Gaussian geometry:
ent-size regimes by assuming that all the radii of the PD 4
ellipsoid are scaled by a global factor r. In the high- Ri
dimensional regime, due to self-averaging, the boundaries R2g ¼ Pi¼1
D R2
ð44Þ
i¼1 i
of the touching and fully supporting transitions can be
P
approximated by ð D R2i Þ2
Dg ¼ Pi¼1 ; ð45Þ
vffiffiffiffiffiffiffiffiffiffiffiffiffi D
i¼1 Ri
4
u D
uX
ttouch ¼t R2i ð38Þ and the effective margin is given as
i¼1
X
D 1=2
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
u D κg ¼ R2i : ð46Þ
uX
tfs ¼ −t R−2i ; ð39Þ i¼1
i¼1
As can be seen, Dg and ðRg =rÞ are invariant to scaling all
pffiffiffiffi the radii by r, as expected. It is interesting to compare the
both independent of ⃗ t. Then, as long as Ri D are not large above Gaussian geometric parameters with the statistics
(see below), tfs → −∞, and the probability of fully sup- induced by a uniform measure on the surface of the
porting vanishes. Discounting the fully supporting regime, ellipsoid. In this case, the covariance matrix has Peigenval-
the anchor vectors s̃ð⃗ t; t0 Þ are given by ues λ2i ¼ R2i , and the total variance is equal to i R2i . In
contrast, in the Gaussian geometry, the eigenvalues of the
s̃i ¼ −λi ti ð40Þ covariance matrix λ2i are proportional to R4i . This and the
031003-12
CLASSIFICATION AND GEOMETRY OF GENERAL … PHYS. REV. X 8, 031003 (2018)
r=1
corresponding expression (44) for the squared radius is (a) (b)
2
1
the result of a nonuniform induced measure on the
0.8 1.5
surface of the ellipse in the anchor geometry, even in its
0.6
Ri
Gaussian limit. 1
Beyond the scaling regime:—When Ri ¼ Oð1Þ, the 0.4
0.5
0.2
high-dimensional ellipsoids all become touching manifolds
since ttouch → −∞ and tfs → ∞. The capacity is small 0
50 100 150 200
0
10
-2
10
-1 0
10 10
1
10
2
because κ M ≫ 1 and is given by Eq. (34) with Eqs. (42) and Dimension r
(43). Finally, when all Ri ≫ 1, we have (c) 200 (d) 1
0.8
1
150
D
R2M ¼ PD
0.6
¼ −2 : ð47Þ
M
−2
hRi ii 100
D
i¼1 Ri 0.4
50 0.2
1 ≤ i ≤ 10, and Ri ¼ 0.1r, for 11 ≤ i ≤ 200 [Fig. 7(a)]. FIG. 7. Bimodal l2 ellipsoids. (a) Ellipsoidal radii Ri with
Their properties are shown as a function of the overall scale r ¼ 1. (b) Classification capacity as a function of scaling factor r:
r. Figure 7(b) shows numerical simulations of the capacity, full mean-field theory capacity (blue lines); approximation of the
the full mean-field solution, and the spherical high-dimen- capacity given by equivalent ball with RM and DM (black dashed
sional approximation (with RM and DM ). These calcula- lines); simulation capacity (circles), averaged over five repeti-
tions are all in good agreement, showing the accuracy of the tions, measured with 50 dichotomies per repetition. (c) Manifold
mean-field theory and spherical approximation. As seen in dimension DM as a function of r. (d) Manifold radius RM relative
Figs. 7(b)–7(d), the system is in the scaling regime for to r. (e) Fraction of manifolds with support dimension k for
r < 0.3. In this regime, the manifold dimension is constant different values of r: k ¼ 0 (interior), k ¼ 1 (touching), k ¼
D þ 1 (fully supporting). (e1) Small r ¼ 0.01, where most
and equals Dg ≈ 14, as predicted by the participation ratio,
manifolds are interior or touching. (e2) r ¼ 1, where most
Eq. (45), and the manifold radius Rg is linear with r, as manifolds are in the touching regime. (e3) r ¼ 100, where the
expected from Eq. (44). The ratio ðRg =rÞ ≈ 0.9 is close to fraction of fully supporting manifolds is 0.085, predicted by
pffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P −2
unity, indicating that, in the scaling regime, the system is Hð−tfs Þ ¼ Hð i Ri Þ [Eq. (39)].
dominated by the largest radii. For r > 0.3, the effective
margin is larger than unity, and the system becomes dimension N. In realistic data, it is likely that the data
increasingly affected by the full affine dimensionality of manifolds are technically full rank, i.e., D þ 1 ¼ N, raising
the ellipsoid, as seen by the marked increase in dimension, the question of whether our mean-field theory is still valid
as well as a corresponding decrease in ðRM =rÞ. For larger r, in such cases. We investigate this scenario by computing
DM approaches D and α−1 M ¼ D. Figures 7(e1)–7(e3) show the capacity on ellipsoids containing a realistic distribution
the distributions of the support dimension 0 ≤ k ≤ D þ 1. of radii. We have taken, as examples, a class of images from
In the scaling regime, the interior and touching regimes the ImageNet data set [28], and analyzed the SVD spectrum
each have probability close to 12, and the fully supporting of the representations of these images in the last layer of a
regime is negligible. As r increases beyond the scaling deep convolutional network, GoogLeNet [29]. The com-
regime, the interior probability decreases, and the solution puted radii are shown in Fig. 8(a) and yield a value
is almost exclusively in the touching regime. For very high D ¼ N − 1 ¼ 1023. In order to explore the properties of
values of r, the fully supporting solution gains a substantial such manifolds, we have scaled all the radii by an overall
probability. Note that the capacity decreases to approx- factor r in our analysis. Because of the decay in the
imately D1 for a value of r below which a substantial fraction distribution of radii, the Gaussian dimension for the
of solutions are fully supporting. In this case, the touching ellipsoid is only about Dg ≈ 15, much smaller than D or
ellipsoids all have a very small angle with the margin plane. N, implying that, for small r, the manifolds are effectively
Until now, we have assumed that the manifold affine low dimensional and the geometry is dominated by a
subspace dimension D is finite in the limit of large ambient small number of radii. As r increases above r ≳ 0.03,
031003-13
CHUNG, LEE, and SOMPOLINSKY PHYS. REV. X 8, 031003 (2018)
(a) r=1 (b) fxμ0 Rk uμk ; k ¼ 1; …; Dg. The vectors uμi specify the
0
10
1 10 principal axes of the l1 ellipsoids. For simplicity, we
consider the case of l1 balls when all the radii are equal:
10-1
Ri ¼ R. We will concentrate on the cases when l1 balls are
Ri
100
10
-2 high dimensional; the case for l1 balls with D ¼ 2 was
-1 briefly described in Ref. [15]. The analytical expression of
10
200 400 600 800 1000 the capacity is complex due to the presence of contributions
10-3 10-2 10-1 100 101 102
Dimension r from all types of supports, 1 ≤ k ≤ D þ 1. We address
(c) (d)
important aspects of the high-dimensional solution below.
1000 12 High-dimensional l1 balls, scaling regime:—In the
750 10
scaling regime, we have v⃗ ≈ ⃗ t. In this case, we can write
R /r
8
M
M
the solution for the subgradient as
D
500 6
250 4
2 −Rsignðti Þ; jti j > jtj j ∀ j ≠ i
0 0 s̃i ð⃗ tÞ ¼ : ð49Þ
10-2 100 102 -3 -2
10 10 10
-1 0 1
10 10 10
2 0 otherwise
r r
In other words, s̃ð⃗ tÞ is a vertex of the polytope correspond-
FIG. 8. Ellipsoids with radii computed from realistic image ing to the component of ⃗ t with the largest magnitude; see
data. (a) SVD spectrum Ri , taken from the readout layer of Fig. 5(c). The components of ⃗ t are i.i.d. Gaussian random
GoogLeNet from a class of ImageNet images with N ¼ 1024 and
variables and, for large D, its maximum component tmax is
D ¼ 1023. The radii are scaled by a factor r: R0i ¼ rRi for pffiffiffiffiffiffiffiffiffiffiffiffiffiffi
(b) Classification capacity as a function of r: full mean-field concentrated around 2 log D. Hence, Dg ¼ hðŝ · ⃗ tÞ2 i ¼
theory capacity (blue lines); approximation of the capacity as that ht2max i ¼ 2 log D, which is much smaller than D. This result
of a ball with RM and DM from the theory for ellipsoids (black is consistent with the fact that the Gaussian
pffiffiffiffiffiffiffiffiffiffiffi mean width of a
dashed lines); simulation capacity (circles), averaged over 5 D-dimensional l1 ball scales with log D and not with D
repetitions, measured with 50 random dichotomies per each [25]. Since all the points have norm R, we have Rg ¼ R,
repetition. (c) Manifold dimension as a function of r. (d) Manifold pffiffiffiffiffiffiffiffiffiffiffiffiffiffi
radius relative to the scaling factor r, RM =r as a function of r. and the effective margin is then given by κ g ¼ R 2 log D,
which is order unity in the scaling regime. In this
031003-14
CLASSIFICATION AND GEOMETRY OF GENERAL … PHYS. REV. X 8, 031003 (2018)
P
Mμ ¼ xμ0 þ D μ μ
(a) (b) (c)
i¼1 si ui , where x0 represents the mean (over
θ) of the population response to the object. The different D
components correspond to the different Fourier compo-
nents of the angular response, so that
031003-15
CHUNG, LEE, and SOMPOLINSKY PHYS. REV. X 8, 031003 (2018)
D=20, Uniform R n
(a)
2
(b) demonstrate that, in many cases, when the size is smaller
20
than the crossover value, the manifold dimensionality is
1.5 15 substantially smaller than that expected from naive second-
DM
1 10 order statistics, highlighting the saliency and significance
0.5 5 of our anchor geometry.
0 0
-2 -1 0 1 2 1 2
10 10 10 10 10 10 -2 10 -1 10 0 10 10 VI. MANIFOLDS WITH SPARSE LABELS
r r
(c) 1 (d) So far, we have assumed that the number of manifolds
12
0.8
with positive labels is approximately equal to the number of
10 manifolds with negative labels. In this section, we consider
RM /r
0.6 DM
0.4 8 the case where the two classes are unbalanced such that the
0.2 6
number of positively labeled manifolds is far less than
the negatively labeled manifolds (the opposite scenario is
0
10-2 10-1 100 101 102 6 8 10 12 equivalent). This is a special case of the problem of
r classification of manifolds with heterogeneous statistics,
(e1) (e2) (e3) where manifolds have different geometries or label statis-
0.6 tics. We begin by addressing the capacity of mixtures of
0.4 0.3
0.4 manifolds and then focus on sparsely labeled manifolds.
0.2
0.2 0.2
0.1
A. Mixtures of manifold geometries
0 0 0
0 10 20 0 10 20 0 10 20
Our theory of manifold classification is readily extended
Support dimension (k) Support dimension (k) Support dimension (k)
to a heterogeneous ensemble of manifolds, consisting of
FIG. 10. Linear classification L distinct classes. In the replica theory, the shape of the
pffiffiffiffiffiffiffiffiffiffiffiffiffi of D-dimensional ring manifolds, manifolds appear only in the free energy term G1 [see
with uniform Rn ¼ ð2=DÞr. (a) Classification capacity for D ¼
20 as a function of r with m ¼ 200 test samples (for details of Appendix, Eq. (A13)]. For a mixture statistics, the com-
numerical simulations, see SM [16], Sec. S9): mean-field theory bined free energy is given by simply averaging the
(blue lines), spherical approximation (black dashed lines), individual free energy terms over each class l. Recall that
numerical simulations (black circles). Inset: Illustration of a ring this free energy term determines the capacity for each
manifold. (b) Manifold dimension DM , which shows DM ∼ D in shape, giving an individual inverse critical load α−1l . The
the large r limit, showing the orthogonality of the solution. inverse capacity of the heterogeneous mixture is then
(c) Manifold radius relative to the scaling factor r, ðRM =rÞ, as a
function of r. The fact that ðRM =rÞ becomes small implies that the α−1 ¼ hα−1
manifolds are “fully supporting” to the hyperplane, showing the l il ; ð52Þ
small radius structure. (d) Manifold dimension p grows with affine
ffiffiffiffiffiffiffiffiffiffiffiffiffiffi where the average is over the fractional proportions of the
dimension, D, as 2 log D for small r ¼ ð1=2 2 log DÞ in the
scaling regime. (e1)–(e3) Distribution of support dimensions. different manifold classes. This remarkably simple but
(d1) r ¼ 0.01, where most manifolds are either interior or generic theoretical result enables analyzing diverse mani-
touching; (d2) r ¼ 20, where the support dimension has a peaked fold classification problems, consisting of mixtures of
distribution, truncated at ðD=2Þ; (d3) r ¼ 200, where most manifolds with varying dimensions, shapes, and sizes.
support dimensions are ðD=2Þ or fully supporting at D þ 1. Equation (52) is adequate for classes that differ in their
geometry, independent of the assigned labels. In general,
classes may differ in the label statistics (as in the sparse case
pair of (real and imaginary) Fourier harmonics. The ring studied below) or in geometry that is correlated with labels.
manifolds are closely related to the trigonometric moment For instance, the positively labeled manifolds may consist
curve, whose convex hull geometrical properties have been of one geometry, and the negatively labeled manifolds may
extensively studied [30,31]. have a different geometry. How do structural differences
In conclusion, the smoothness of the convex hulls between the two classes affect the capacity of the linear
becomes apparent in the distinct patterns of the support classification? A linear classifier can take advantage of
dimensions [compare Figs. 8(d) and 10(e)]. However, we these correlations by adding a nonzero bias. Previously, it
see that all manifolds with Dg ∼ 5 or larger share common was assumed that the optimal separating hyperplane passes
trends. As the size of the manifold increases, the capacity through the origin; this is reasonable when the two classes
and geometry vary smoothly, exhibiting a smooth crossover are statistically the same. However, when there are stat-
from a high capacity with low radius and dimension istical differences between the label assignments of the two
to a low capacity with large radius
pffiffiffiffi and dimension. This classes, Eq. (2) should be replaced by yμ ðw · xμ − bÞ ≥ κ,
crossover occurs as Rg ∝ 1= Dg . Also, our examples where the bias b is chosen to maximize the mixture
031003-16
CLASSIFICATION AND GEOMETRY OF GENERAL … PHYS. REV. X 8, 031003 (2018)
capacity, Eq. (52). The effect of optimizing the bias is seen in the case of sparsely labeled l2 balls (Appendix E).
discussed in detail in the next section for the sparse labels Here, we summarize the main results for general manifolds.
and in other scenarios in the SM [16] (Sec. S3). Sparsity and size:—There is a complex interplay
between label sparsity and manifold size. Our analysis
B. Manifolds with sparse labels yields three qualitatively different regimes.
Low Rg :—When the Gaussian radius of the manifolds are
We define the sparsity parameter f as the fraction of small, i.e., Rg < 1, the extent of the manifolds is noticeable
positively labeled manifolds so that f ¼ 0.5 corresponds to only when the dimension is high. Similar to our previous
having balanced labels. From the theory of the classifica- analysis of high-dimensional manifolds, we find here that the
tion of a finite set of random points, it is known that having sparse capacity is equivalent to the capacity of sparsely
sparse labels with f ≪ 0.5 can drastically increase the labeled
capacity [12]. In this section, we investigate how the pffiffiffiffiffiffirandom points, with an effective margin given by
Rg Dg ,
sparsity of manifold labels improves the manifold classi-
fication capacity. pffiffiffiffiffiffi
If the separating hyperplane is constrained to go through αM ðfÞ ≈ α0 ðf; κ ¼ Rg Dg Þ; ð55Þ
the origin and the distribution of inputs is symmetric
around the origin, the labeling yμ is immaterial to the where α0 ðκ; fÞ ¼ maxb α0 ðκ; f; bÞ from the Gardner theory.
capacity. Thus, the effect of sparse labels is closely tied to It should be noted that κ g in the above equation has a
having a nonzero bias. We thus consider inequality con- noticeable effect only for moderate sparsity. It has a negli-
straints of the form yμ ðw · xμ − bÞ ≥ κ, and we define the gible effect when f → 0, since the bias is large and dominates
bias-dependent capacity of general manifolds with label over the margin.
sparsity f, margin κ, and bias b, as αM ðκ; f; bÞ. Next, we Moderate sizes, Rg > 2:—In this case, the equivalence to
observe that the bias acts as a positive contribution to the the capacity of points breaks down. Remarkably, we find
margin for the positively labeled population and as a that the capacity of general manifolds with substantial size
negative contribution to the negatively labeled population. is well approximated by that of equivalent l2 balls with the
Thus, the dependence of αM ðκ; f; bÞ on both f and b can be same sparsity f and with dimension and radius equal to the
expressed as Gaussian dimension and radius of the manifolds, namely,
where αM ðxÞ is the classification capacity with zero bias Surprisingly, unlike the nonsparse approximation, where
(and, hence, equivalent to the capacity with f ¼ 0.5) for the the equivalence of general manifold to balls, Eq. (33), is
same manifolds. Note that Eq. (53) is similar to Eq. (52) for valid only for high-dimensional manifolds, in the sparse
mixtures of manifolds. The actual capacity with sparse limit, the spherical approximation is not restricted to large
labels is given by optimizing the above expression with D. Another interesting result is that the relevant statistics
respect to b, i.e., are given by the Gaussian geometry, Rg and Dg , even when
Rg is not small. The reason is that, for small f, the bias is
αM ðκ; fÞ ¼ maxb αM ðκ; f; bÞ: ð54Þ large. In that case, the positively labeled manifolds have
large positive margin b and are fully supporting giving a
In the following, we consider for simplicity the effect of contribution to the inverse capacity of b2 , regardless of their
sparsity for zero margin, κ ¼ 0. detailed geometry. On the other hand, the negatively
Importantly, if D is not large, the effect of the manifold labeled manifolds have large negative margin, implying
geometry in sparsely labeled manifolds can be much larger that most of them are far from the separating plane (interior)
than that for nonsparse labels. For nonsparse labels, the and a small fraction have touching support. The fully
capacity ranges between 2 and ðD þ 0.5Þ−1 , while for sparse supporting configurations have negligible probability;
manifolds, the upper bound can be much larger. Indeed, hence, the overall geometry is well approximated by the
small-sized manifolds are expected to have capacity that Gaussian quantities Dg and Rg .
increases upon decreasing f as αM ð0; fÞ ∝ ð1=fj log fjÞ, Scaling relationship between sparsity and size:—A
similar to P uncorrelated points [12]. This potential increase further analysis of the capacity of balls with sparse labels
in the capacity of sparse labels is, however, strongly con- shows it retains a simple form αðf̄; DÞ [see Appendix E,
strained by the manifold size, since, when the manifolds are Eq. (E10)], which depends on R and f only through the
large, the solution has to be orthogonal to the manifold scaled sparsity f̄ ¼ fR2. The reason for the scaling of f with
directions so that αM ð0; fÞ ≈ ð1=DÞ. Thus, the geometry of R2 is as follows. When the labels are sparse, the dominant
the manifolds plays an important role in controlling the contribution to the inverse capacity comes from the minority
effect of sparse labels on capacity. These aspects are already class, and so the capacity is α−1 2
Ball ≈ fb for large b. On the
031003-17
CHUNG, LEE, and SOMPOLINSKY PHYS. REV. X 8, 031003 (2018)
other hand, the optimal value of b depends on the balance scaled sparsity f̄ ¼ fð1 þ R2 Þ. In each example, there is a
between the contributions from both classes and scales good agreement between the three calculations for the
linearly with R, as it needs to overcome the local fields from range of f̄ < 1. Furthermore, the drop in α with increasing
the spheres. Thus, α−1 ∝ fR2 . f̄ is similar in all cases, except for an overall vertical shift,
Combining Eq. (E10) with Eq. (56) yields, for general which is due to the different Dg , similar to the effect of
sparsely labeled manifolds, dimension in l2 balls [Fig. 11(a)]. In the regime of
Z ∞ moderate radii, results for different f and r all fall on a
α−1
M ¼ f̄ b̄2þ dtχ Dg ðtÞðt − b̄Þ2 ; ð57Þ universal curve, which depends only on f̄, as predicted by
b̄
the theory. For small r, the capacity deviates from this
where scaled sparsity is f̄ ¼ fð1 þ R2g Þ ≲ 1. scaling, as it is dominated by f alone, similar to sparsely
Note that we have defined the scaled sparsity f̄ ¼ labeled points. When f̄ > 1, the curves deviate from the
fð1þR2g Þ rather than fR2 to yield a smoother crossover spherical approximation. The true capacity (as revealed by
to the small Rg regime. Similarly, we define the optimal simulations and full mean field) rapidly decreases with f̄
qffiffiffiffiffiffiffiffiffiffiffiffiffiffi
scaled bias as b̄ ¼ b 1 þ R2g ≫ 1. Qualitatively, the func- and saturates at ð1=DÞ, rather than to the ð1=Dg Þ limit of the
spherical approximation. Finally, we note that our choice of
tion α is roughly proportional to f̄ −1 with a proportionality parameters in Fig. 11(c) (SM [16] Sec. VII) was such that
constant that depends on Dg . In the extreme limit of Rg (entering in the scaled sparsity) was significantly
j log f̄j ≫ Dg , we obtain the sparse limit α ∝ jf̄ log f̄j−1 . different from simply an average radius. Thus, the agree-
Note that, if f is sufficiently small, the gain in capacity due ment with the theory illustrates the important role of the
to sparsity occurs even for large manifolds as long as f̄ < 1. Gaussian geometry in the sparse case.
Large Rg regime, f̄ > 1:—Finally, when Rg is suffi- As discussed above, in the (high and moderate) sparse
ciently large such that f̄ increases above 1, b̄ is of order 1 or regimes, a large bias alters the anchor geometry of the two
smaller, and the capacity is small with a value that depends classes in different ways. To illustrate this important aspect,
on the detailed geometry of the manifold. In particular, we show in Fig. 12 the effect of sparsity and bias on the
when f̄ ≫ 1, the capacity of the manifold approaches
αM ðfÞ → ð1=DÞ, not ð1=Dg Þ.
(a) (b)
To demonstrate these remarkable predictions, Eq. (56),
we present in Fig. 11 the capacity of three classes of
sparsely labeled manifolds: l2 balls, l1 ellipsoids, and ring
manifolds. In all cases, we show the results of numerical
simulations of the capacity, the full mean-field solution,
and the spherical approximation, Eq. (56), across several
orders of magnitude of sparsity and size as a function of the
031003-18
CLASSIFICATION AND GEOMETRY OF GENERAL … PHYS. REV. X 8, 031003 (2018)
geometry of the l1 ellipsoids studied in Fig. 11(c). Here, This new theory is important for the understanding of
we show the evolution of RM and DM for the majority and how sensory neural systems perform invariant perceptual
minority classes as f̄ increases. Note that, despite the fact discrimination and recognition tasks of realistic stimuli.
that the shape of the manifolds is the same for all f, their Furthermore, beyond estimating the classification capacity,
manifold anchor geometry depends on both class member- our theory provides theoretically based geometric measures
ship and sparsity levels, because these measures depend on for assessing the quality of the neural representations of the
the margin. When f̄ is small, the minority class has DM ¼ D, perceptual manifolds. There are a variety of ways for
as seen in Fig. 12(c); i.e., the minority class manifolds are defining the geometry of manifolds. Our geometric mea-
close to be fully supporting due to the large positive margin. sures are unique in that they determine the ability to linearly
This can also be seen in the distributions of support separate the manifold, as our theory shows.
dimension shown in Figs. 12(a) and 12(b). On the other Our theory focuses on linear classification. However, it
hand, the majority class has DM ≈ Dg ¼ 2 log D1 ≪ D, has broad implications for nonlinear systems and, in
and these manifolds are mostly in the interior regime. As particular, for deep networks. First, most models of sensory
f increases, the geometrical statistics for the two classes discrimination and recognition in biological and artificial
become more similar. This is seen in Figs. 12(c) and 12(d), deep architectures model the readouts of the networks as
where DM and RM for both majority and minority classes linear classifiers operating on the top sensory layers. Thus,
converge to the zero margin value for large f̄ ¼ fð1 þ R2g Þ. our manifold classification capacity and geometry can be
applied to understand the performance of the deep network.
Furthermore, the computational advantage of having multi-
VII. SUMMARY AND DISCUSSION ple intermediate layers can be assessed by comparing the
Summary.— We have developed a statistical mechanical performance of a hypothetical linear classifier operating on
theory for linear classification of inputs organized in these layers. In addition, the changes in the quality of
perceptual manifolds, where all points in a manifold share representations across the deep layers can be assessed by
the same label. The notion of perceptual manifolds is comparing the changes in the manifold’s geometries across
critical in a variety of contexts in computational neurosci- layers. Indeed, previous discussions of sensory processing
ence modeling and in signal processing. Our theory is not in deep networks hypothesized that neural object repre-
restricted to manifolds with smooth or regular geometries; sentations become increasingly untangled as the signal
it applies to any compact subset of a D-dimensional affine propagates along the sensory hierarchy. However, no
subspace. Thus, the theory is applicable to manifolds concrete measure of untangling has been provided. The
arising from any variation in neuronal responses with a geometric measures derived from our theory can be used to
continuously varying physical variable, or from any quantify the degree of entanglement, tracking how the
sampled set arising from experimental measurements on perceptual manifolds are nonlinearly reformatted as they
a limited number of stimuli. propagate through the multiple layers of a neural network
The theory describes the capacity of a linear classifier to to eventually allow for linear classification in the top layer.
separate a dichotomy of general manifolds (with a given Notably, the statistical, population-based nature of our
margin) by a universal set of mean-field equations. These geometric measures renders them ideally suited for com-
equations may be solved analytically for simple geom- parison between layers of different sizes and nonlinearities,
etries, but, for more complex geometries, we have devel- as well as between different deep networks or between
oped iterative algorithms to solve the self-consistent artificial networks and biological ones. Lastly, our theory
equations. The algorithms are efficient and converge fast, can suggest new algorithms for building deep networks, for
as they involve only solving for OðDÞ variables of a single instance, by imposing successive reduction of manifold
manifold, rather than invoking simulations of a full system dimensions and radii as part of an unsupervised learning
of P manifolds embedded in RN . strategy.
Applications:—The statistical mechanical theory of per- We have discussed above the domain of visual object
ceptron learning has long provided a basis for under- classification and recognition, which has received immense
standing the performance and fundamental limitations of attention in recent years. However, we would like to
single-layer neural architectures and their kernel exten- emphasize that our theory can be applied for modeling
sions. However, the previous theory only considered a neural sensory processing tasks in other modalities. For
finite number of random points, with no underlying geo- instance, it can be used to provide insight on how the
metric structure, and could not explain the performance of olfactory system performs discrimination and recognition
linear classifiers on a large, possibly infinite number of of odor identity in the presence of orders-of-magnitude
inputs organized as distinct manifolds by the variability due variations in odor concentrations. Neuronal responses to
to changes in the physical parameters of objects. The theory sensory signals are not static, but vary in time. Our theory
presented in this work can explain the capacity and can be applied to explain how the brain correctly decodes
limitations of the linear classification of general manifolds. the stimulus identity despite its temporal nonstationarity.
031003-19
CHUNG, LEE, and SOMPOLINSKY PHYS. REV. X 8, 031003 (2018)
Some of these applications may require further extensions In conclusion, we believe the application of this theory
of the present theory. The most important ones (currently the and its corollary extensions will precipitate novel insights
subject of ongoing work) include the following. into how perceptual systems, biological or artificial, effi-
Correlations:— In the present work, we have assumed ciently code and process complex sensory information.
that the directions of the affine subspaces of the different
manifolds are uncorrelated. In realistic situations, we ACKNOWLEDGMENTS
expect to see correlations in the manifold geometries,
We would like to thank Uri Cohen, Ryan Adams, Leslie
mainly of two types. One is center-center correlations.
Valiant, David Cox, Jim DiCarlo, Doris Tsao, and Yoram
Such correlations can be harmful for linear separability Burak for helpful discussions. The work is partially sup-
[32,33]. Another is correlated variability, in which the ported by the Gatsby Charitable Foundation, the Swartz
directions of the affine subspaces are correlated but not the Foundation, the Simons Foundation (SCGB Grant
centers. Positive correlations of the latter form are benefi- No. 325207), the National Institutes of Health (Grant
cial for separability. In the extreme case when the manifolds No. 1U19NS104653), the MAFAT Center for Deep
share a common affine subspace, the rank of the union of Learning, and the Human Frontier Science Program
the subspaces is Dtot ¼ D, rather than Dtot ¼ PD, and the (Project No. RGP0015/2013). D. D. Lee also acknowledges
solution weight vector need only lie in the null space of this the support of the U.S. National Science Foundation, Army
smaller subspace. Further work is needed to extend the Research Laboratory, Office of Naval Research, Air Force
present theory to incorporate more general correlations. Office of Scientific Research, and Department of
Generalization performance:— We have studied the Transportation.
separability of manifolds with known geometries. In many
realistic problems, this information is not readily available APPENDIX A: REPLICA THEORY OF
and only samples reflecting the natural variability of input MANIFOLD CAPACITY
patterns are provided. These samples can be used to
estimate the underlying manifold model (using manifold In this section, we outline the derivation of the mean-
learning techniques [34,35]) and/or to train a classifier field replica theory summarized in Eqs. (11) and (12). We
based upon a finite training set. Generalization error define the capacity of linear classification of manifolds,
describes how well a classifier trained on a finite number αM ðκÞ, as the maximal load, α ¼ ðP=NÞ, for which with
of samples would perform on other test points drawn from high probability a solution to yμ w · xμ ≥ κ exists for a given
the manifolds [36]. It would be important to extend our κ. Here, xμ are points on the P manifolds Mμ , Eq. (1), and
theory to calculate the expected generalization error we assume that all NPðD þ 1Þ components of fuμi g are
achieved by the maximum margin solution trained on point drawn independently from a Gaussian distribution with
cloud manifolds, as a function of the size of the training set zero mean and variance ð1=NÞ, and that the binary labels
and the geometry of the underlying full manifolds. yμ ¼ 1 are randomly assigned to each manifold with
Unrealizable classification:—Throughout the present equal probabilities. We consider the thermodynamic limit,
work, we have assumed that the manifolds are separable where N, P → ∞, but α ¼ ðP=NÞ, and D is finite.
by a linear classifier. In realistic problems, the load Note that the geometric margin κ0 , defined as the distance
μ μ 0
may be above the capacity for linear separation, i.e., from
0
pffiffiffiffithe solution hyperplane, is given by y w·x ≥κ kwk¼
α > αM ðκ ¼ 0Þ. Alternatively, neural noise may cause κ N . However, this distance depends on the scale of the
the manifolds to be unbounded in extent, with the tails input vectors xμ . The correct scaling ofpthe ffiffiffiffi margin in the
of their distribution overlapping so that they are not thermodynamic limit is κ 0 ¼ ðkxk= N Þκ. Since we
separable with zero error. There are several ways to handle adopted the normalization of kxμ k ¼ Oð1Þ, the correct
this issue in supervised learning problems. One possibility scaling of the margin is yμ w · xμ ≥ κ.
is to nonlinearly map the unrealizable inputs to a higher- Evaluation of solution volume:—Following Gardner’s
dimensional feature space, via a multilayer network or replica framework, we first consider the volume Z of the
nonlinear kernel function, where the classification can be solution space for α < αM ðκÞ. We define the signed projec-
performed with zero error. The design of multilayer net- tions of the ith direction vector uμi on the solution weight as
pffiffiffiffi
works could be facilitated using manifold processing Hμi ¼ N yμ w · uμi , where i ¼ 1;…; D þ 1 and μ ¼ 1; …; P.
principles uncovered by our theory. Then, the separability constraints can be written as
Another possibility is to introduce an optimization PDþ1 μ
i¼1 Si H i ≥ κ. Hence, the volume can be written as
problem allowing a small training error, for example, using Z
a SVM with complementary slack variables [14]. These ⃗ μ Þ − κÞ;
Z ¼ dN wδðw2 − NÞΠPμ¼1 Θμ ðgS ðH ðA1Þ
procedures raise interesting theoretical challenges, includ-
ing understanding how the geometry of manifolds changes
as they undergo nonlinear transformations, as well as where ΘðxÞ is a Heaviside step function. gS is the
investigating, by statistical mechanics, the performance ⃗ ¼
support function of S defined for Eq. (12) as gS ðVÞ
of a linear classifier of manifolds with slack variables [37]. ⃗ S⃗ ∈ Sg.
minS⃗ fV⃗ · Sj
031003-20
CLASSIFICATION AND GEOMETRY OF GENERAL … PHYS. REV. X 8, 031003 (2018)
The volume defined above depends on the quenched and we have used the fact that all manifolds contribute the
random variables uμi and yμ through H μi . It is well known same factor.
that, in order to obtain the typical behavior in the thermo- We proceed by making the replica symmetric ansatz
dynamic limit, we need to average log Z, which we carry on the order parameter qαβ at its saddle point, qαβ ¼
out using the replica trick, hlog Zi ¼ limn→0 ðhZn i − 1Þ=n, ð1 − qÞδαβ þ q, from which one obtains, in the n → 0 limit,
where hi refers to the average over uμi and yμ . For natural n,
1 q
we need to evaluate q−1
αβ ¼ δαβ − ðA8Þ
Z Y 1−q ð1 − qÞ2
n YP Z
hZn i ¼ dwα δðw2α − NÞ DH⃗ μα
and
α μ
DY nq
þ1 pffiffiffiffiffiffi μα log det q ¼ n logð1 − qÞ þ : ðA9Þ
μ T μ
× 2π δðHi − y wα ui Þ ; ðA2Þ 1−q
i uμi ;yμ
Thus, the exponential term in X can be written as
where we have used the notation
pffiffiffi X 2
1 X ðHαi Þ2 1 X q α
DH Dþ1 dH i
⃗ ¼ Πi¼1 ⃗ − κÞ:
pffiffiffiffiffiffi ΘðgS ðHÞ ðA3Þ exp − þ H : ðA10Þ
2 αi 1 − q 2 i 1 − q α i
2π
Using a Fourier representation of the delta functions, we Using the Hubbard-Stratonovich transformation, we
obtain obtain
Z Y Z Z pffiffiffi
n
n P Z
Y 1 H⃗ 2 q
X ¼ DT ⃗ ⃗
DH exp − þ ⃗ ⃗
hZn i ¼ dwα δðw2α − NÞ DH⃗ μα 21 − q 1 − q
H·T ;
α μ
þ1 Z
ðA11Þ
Y
D
dĤμα
ffi hexpfiĤ μα
pffiffiffiffiffi μα μ T μ pffiffiffiffiffiffi
i ðH i − y wα ui Þgiuμi ;yμ :
i
×
2π where DT⃗ ¼ Πi ðdT i = 2π Þ exp ½−ðT 2i =2Þ. Completing
i¼1 R
the square in the exponential and using ⃗ n¼
DTA
ðA4Þ R
exp n DT⃗ log A in the n → 0 limit, we obtain X ¼
R
Performing the average over the Gaussian distribution of exp f½nqðD þ 1Þ=2ð1 − qÞg þ n DT⃗ log zðTÞÞ,⃗ with
uμi (each of the N components has zero mean and variance Z n 1 o
ð1=NÞ) yields ⃗ ¼ DH
zðTÞ ⃗ exp − ⃗ − pffiffiffi
jjH ⃗ 2 :
qTjj ðA12Þ
2ð1 − qÞ
Dþ1 X
X X
N
μα μ j μ
exp μα
iĤi −y wα ui;j Combining these terms, we write the last factor in
i¼1 j¼1 uμi ;yμ Eq. (A6) as exp nPG1 , where
1 X X μα μβ Z
¼ exp − qαβ Ĥi Ĥi ; ðA5Þ ðD þ 1Þ
2 αβ G1 ¼ DT⃗ log zðTÞ
⃗ − logð1 − qÞ: ðA13Þ
iμ
2
PN
where qαβ ¼ ð1=NÞ j¼1 wjα wjβ . Thus, integrating the var- The first factors in hZn i, Eq. (A6), can be written as
iables Ĥμα
i yields
exp nNG0 , whereas, in the Gardner theory, the entropic
term in the thermodynamic limit is
Z Y
n Z
hZn i ¼ dwα δðw2α − NÞ dqαβ Παβ 1 q
α¼1 G0 ðqÞ ¼ lnð1 − qÞ þ ðA14Þ
P 2 2ð1 − qÞ
ðD þ 1Þ
· δðNqαβ − wα wβ Þ exp −
T
logdetq X ;
2 and represents the constraints on the volume of wα due to
normalization and the order parameter q. Combining the G0
ðA6Þ
and G1 contributions, we have
where
hZn it0 ;t ¼ eNn½G0 ðqÞþαG1 ðqÞ : ðA15Þ
Z Y
⃗ α 1 X α −1 β
X¼ DH exp − H ðq Þαβ Hi ; ðA7Þ The classification constraints contribute αG1 , with
α
2 i;α;β i
Eq. (A13), and
031003-21
CHUNG, LEE, and SOMPOLINSKY PHYS. REV. X 8, 031003 (2018)
Z
⃗ ¼ dY i APPENDIX B: STRICTLY CONVEX MANIFOLDS
zðTÞ Πi¼1
Dþ1
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi
2πð1 − qÞ 1. General
Y⃗ 2 pffiffiffi Here, we evaluate the capacity of strictly convex mani-
× exp − Θ(gS ð qT⃗ þ YÞ
⃗ − κ); ðA16Þ
2ð1 − qÞ folds, starting from the expression for general manifolds,
Eq. (20). In a strictly convex manifold S, any point in
where we have written the fields Hi as the line segment connecting two points x⃗ and y⃗ , x⃗ , y⃗ ∈ S,
other than x⃗ and y⃗ , belongs to the interior of S. Thus,
pffiffiffi the boundary of the manifold does not contain edges or
Hi ¼ qT i þ Y i : ðA17Þ flats with spanning dimension k > 1. Indeed, the only
pffiffiffi possible spanning dimensions for the entire manifold are
Note that qT i represents the quenched random compo- k ¼ 0; 1, and D þ 1. Therefore, there are exactly two
nent due to the randomness in the uμi , and Y i is the “thermal” contributions to the inverse capacity. When t0 obeys
component due to the variability within the solution space. ttouch ð⃗ tÞ > t0 − κ > tfs ð⃗ tÞ, the integrand of Eq. (20) con-
The order parameter q is calculated via 0 ¼ ð∂G0 =∂qÞþ ⃗ − t0 þ κÞ2 =ð1 þ ks̃ðTÞk
tributes ½ð−⃗ t · s̃ðTÞ ⃗ 2 Þ. When t0 <
αð∂G1 =∂qÞ. κ þ tfs ð⃗ tÞ, the manifold is fully embedded. In this case,
Capacity:—In the limit where α → αM ðκÞ, the overlap v⃗ ¼ 0, and the integrand reduces to Eq. (26). In summary,
between the solutions becomes unity and the volume the capacity for convex manifolds can be written as
shrinks to zero. It is convenient to define Q ¼ q=ð1 − qÞ
Z Z κþt ð⃗tÞ
and study the limit of Q → ∞. In this limit, the leading touch ð−⃗ t · s̃ð⃗ t; t0 Þ − t0 þ κÞ2
order is α−1 ðκÞ ¼ D⃗ t Dt0
κþtfs ð⃗ tÞ 1 þ ks̃ð⃗ t; t0 Þk2
Z Z κþt ð⃗tÞ
fs
hlog Zi ¼
Q ⃗ ⃗ ;
½1 − αhFðTÞi ðA18Þ þ D⃗ t Dt0 ½ðt0 − κÞ2 þ k⃗ tk2 ; ðB1Þ
2 T −∞
where ttouch ð⃗ tÞ and tfs ð⃗ tÞ are given by Eqs. (22) and (25),
where the first term is the contribution from G0 → ðQ=2Þ. respectively, and s̃ð⃗ t; t0 Þ ¼ arg min⃗s ð⃗v · ⃗ sÞ, with v⃗ ¼
The second term comes from G1 → −ðQ=2ÞαhFðTÞi ⃗ ⃗ , ⃗ t þ ðv0 − t0 Þs̃.
T
where the average is over the Gaussian distribution of
the (D þ 1)-dimensional vector T, ⃗ and 2. l2 balls
In the case of l2 balls with D and radius R, gð⃗vÞ ¼
⃗ → − 2 log zðTÞ
FðTÞ ⃗ ðA19Þ −Rk⃗vk. Hence, ttouch ð⃗ tÞ ¼ Rk⃗ tk and tfs ð⃗ tÞ ¼ −R−1 k⃗ tk.
Q Thus, Eq. (B1) reduces to the capacity of balls is
Z ∞ Z κþtR
is independent of Q and is given by replacing the integrals ð−t0 þ tR þ κÞ2
α−1 ¼ dtχ ðtÞ Dt0
ð1 þ R2 Þ
Ball D
in Eq. (A16) by their saddle point, which yields 0 κ−tR−1
Z ∞ Z κ−tR−1
⃗ ¼ minfkV⃗ − Tk
⃗ 2 jgS ðVÞ
⃗ − κ ≥ 0g: þ dtχ D ðtÞ Dt0 ½ðt0 − κÞ2 þ t2 ; ðB2Þ
FðTÞ ðA20Þ 0 −∞
V⃗
where
At the capacity, log Z vanishes, and the capacity of a
21− 2 D−1 −1t2
D
general manifold with margin κ is given by
χ D ðtÞ ¼ t e 2 ; t≥0 ðB3Þ
ΓðD2 Þ
α−1 ⃗
M ðκÞ ¼ hFðTÞiT⃗ ðA21Þ
is the D-dimensional Chi probability density function,
reproducing the results of Ref. [15]. Furthermore,
⃗ ¼ minfkV⃗ − Tk
⃗ 2 jgS ðVÞ
⃗ − κ ≥ 0g: ðA22Þ Eq. (B2) can be approximated
FðTÞ
V⃗
pffiffiffiffiby the capacity of points
pffiffiffiffi of R D, i.e., αBall ðκ; RM ; DM Þ ≈
with a margin increase
ð1 þ R2 Þα0 ðκ þ R DÞ (details are given in Ref. [15]).
Finally, we note that the mean squared “annealed”
variability in the fields due to the entropy of solutions
3. l2 ellipsoids
vanishes at the capacity limit, as 1=Q; see Eq. (A16). Thus,
the quantity kV⃗ − Tk
⃗ 2 in the above equation represents the a. Anchor points and support regimes
annealed variability times Q, which remains finite in the With ellipsoids, Eq. (35), the support function in Eq. (15)
limit of Q → ∞. can be computed explicitly as follows. For a vector
031003-22
CLASSIFICATION AND GEOMETRY OF GENERAL … PHYS. REV. X 8, 031003 (2018)
V⃗ ¼ ð⃗v; v0 Þ, with nonzero v⃗ , the support function gð⃗vÞ is margin solution. In this case, the anchor point is antiparallel
minimized by a vector s⃗ , which occurs on the boundary of to ⃗ t at the interior point, Eq. (24), and its contribution to the
the ellipsoid;P i.e., it obeys the equality constraint fð⃗sÞ ¼ 0 capacity is as in Eq. (26).
2
with ρð⃗sÞ ¼ P i¼1 ðsi =Ri Þ − 1, Eq. (35). To evaluate g, we
D
i
Ri þ
R
Dt0 ð½t0 − κ2 þ DÞ: ðD1Þ
−∞
Fully supporting regime:—When t0 − κ < tfs , we have pffiffiffiffi
v⃗ ¼ 0, v0 ¼ κ, and λ ¼ t0 − κ, implying that the center, as As long as R ≪ D, the second term in Eq. (D1) vanishes
well as the entire ellipsoid, is fully supporting the max and yields
031003-23
CHUNG, LEE, and SOMPOLINSKY PHYS. REV. X 8, 031003 (2018)
Z pffiffiffi pffiffiffiffi
κþR D ðR D þ κ − t0 Þ2 self-consistent statistics of the anchor points. This calcu-
α−1
Ball ðκ; R; DÞ ¼ Dt0 : ðD2Þ lation is simplified in high dimensions. In particular, λ,
−∞ R2 þ 1
Eq. (17), reduces to
Here, we note that the t0 term in the numerator is
significant only if R ¼ OðD−1=2 Þ or smaller, in which case h½κM þ κ − t0 þ it0
λ≈ : ðD7Þ
the denominator is just 1. When R is of order 1, the term t0 1 þ R2M
in the integrand is negligible. Hence, in both cases, we can
write Hence, it is approximately constant, independent of T. ⃗
pffiffiffiffi In deriving the above approximations, we used self-
−1 −1 κpþ R D
ffiffiffiffiffiffiffiffiffiffiffiffiffiffi :
αBall ðκ; R; DÞ ≈ α0 ðD3Þ averaging in summations involving the D intrinsic coor-
1 þ R2 dinates. The full dependence on the longitudinal Gaussian
pffiffiffiffiffiffiffiffiffiffiffiffiffiffi t0 should remain. Thus, RM and DM should, in fact, be
In this form, the intuition beyond the factor 1 þ R2 is
substituted by RM ðt0 Þ and DM ðt0 Þ, denoting Eqs. (28) and
clear, stemming from the fact that the distance of a point
from the margin plane scales with its norm; hence, the (29), with averaging only over ⃗ t. This would yield a more
margin entering the capacity should factor out this norm. complicated expressions than Eqs. (D6) and (D7).
As stated above, Eq. (D3) implies The reason why we can replace them by average
pffiffiffiffi a finite capacity only quantities is the following: The potential dependence of
in the scaling
pffiffiffiffi regime, where R D ¼ Oð1Þ. If, on the other
the anchor radius and dimension on t0 is via λðt0 Þ.
hand, R D ≫ 1, Eq. (D2) implies
However, inspecting Eq. (D7), we note two scenarios. In
R2 D the first one, RM is small and κ M is of order 1. In this case,
α−1
Ball ¼ ðD4Þ because the manifold radius is small, the contribution λs̃ is
1 þ R2 small and can be neglected. This is the same argument why,
[where we have used the asymptote α−1 2 in this case, the geometry can be replaced by the Gaussian
0 ðxÞ → x for large
x], reducing to α−1 ¼ D for large R. geometry, which does not depend on t0 or κ. The second
The analysis above also highlightspthe scenario is that RM is of order 1 and κM ≫ 1, in which case
ffiffiffiffi support structure of the order 1 contribution from t0 is negligible.
the balls at large D. As long as R ≪ D, the fraction of balls
that lie fully in the margin
pffiffiffiffi plane is negligible, as implied by
the fact that tfs ≈ − D=R → −∞. The overall fraction of
pffiffiffiffi APPENDIX E: CAPACITY OF l2 BALLS
interior balls is Hðκ þ R DÞ, whereaspthe ffiffiffiffi fraction that WITH SPARSE LABELS
touch the margin planes is 1 − Hðκ þ R DÞ. Despite the
fact that there are no fully supporting balls, the touching First, we note that the capacity of sparsely labeled points
balls are almost parallel to the margin planes if R ≫ 1; is α0 ðf; κÞ ¼ maxb α0 ðf; κ; bÞ, where
hence, the capacity reaches its lower bound. Z κþb
Finally, the large manifold
pffiffiffiffi limit discussed in Appendix C α−1 Dtð−t þ κ þ bÞ2
0 ðf; κ; bÞ ¼ f
is realized for R ≪ D. Here, tfs ¼ −κ, and the system is −∞
either touching or fully supporting with probabilities HðκÞ Z κ−b
and Hð−κÞ, respectively. þ ð1 − fÞ Dtð−t þ κ − bÞ2 : ðE1Þ
−∞
Also, ttouch ≈ ⃗ t · s̃ ≈ κ M ; hence, we obtain for the capacity In the limit of f → 0, b ≫ 1. The first equation
reduces to
h½κ M þ κ − t0 2þ it0
α−1
M ðκÞ ≈
R2M þ 1 α−1 2 2
0 ≈ fb þ exp ð−b =2Þ; ðE3Þ
≈ α−1
Ball ðκ; RM ; DM Þ; ðD6Þ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
yielding for the optimal b ≈ 2j log fj. With this b, the
where the average is with respect to the Gaussian t0 . inverse capacity is dominated by the first term, which
Evaluating RM and DM involves calculations of the is α−1
0 ≈ 2fj log fj.
031003-24
CLASSIFICATION AND GEOMETRY OF GENERAL … PHYS. REV. X 8, 031003 (2018)
Rb
The capacity for l2 balls of radius R, with sparse labels, case the integral of t0 is −Rðb̄−tÞ
Dt0 ≈ 1, and the integral
with sparsity f [Eqs. (53) and (B2)], is given by over t is from b̄ to ∞. Combining the two results yields the
Z Z following simple expression for the inverse capacity:
∞ ð−t0 þtRþbþκÞ2
κþtRþb
α−1
Ball ¼ f dtχ D ðtÞ· Dt0 Z ∞
0 κ−tR−1 þb ð1þR2 Þ −1 2
Z bþκ−tR−1
αBall ¼ f̄ b̄ þ dtχ D ðtÞðt − b̄Þ2 ; ðE10Þ
b̄
þ Dt0 ½ðt0 −b−κÞ2 þt2
−∞
Z ∞ Z where scaled sparsity is f̄ ¼ fR2 . The optimal scaled bias is
2
ð−t0 þtRþκ −bÞ
κþtR−b given by
þð1−fÞ dtχ D ðtÞ· Dt0
0 ð1þR2 Þ
κ−tR−1 −b Z ∞
Z −bþκ−tR−1
031003-25
CHUNG, LEE, and SOMPOLINSKY PHYS. REV. X 8, 031003 (2018)
Networks Rival the Representation of Primate IT Cortex for [25] R. Vershynin, Estimation in High Dimensions: A Geometric
Core Visual Object Recognition, PLoS Comput. Biol. 10, Perspective, in Sampling Theory, A Renaissance (Springer,
e1003963 (2014). New York, 2015), pp. 3–66.
[11] T. M. Cover, Geometrical and Statistical Properties of [26] K. Rajan, L. F. Abbott, and H. Sompolinsky, Stimulus-
Systems of Linear Inequalities with Applications in Pattern Dependent Suppression of Chaos in Recurrent Neural
Recognition, IEEE Trans. Electron. Comput. EC-14, 326 Networks, Phys. Rev. E 82, 011903 (2010).
(1965). [27] A. Litwin-Kumar, K. D. Harris, R. Axel, H. Sompolinsky,
[12] E. Gardner, The Space of Interactions in Neural Network and L. F. Abbott, Optimal Degrees of Synaptic Connectivity,
Models, J. Phys. A 21, 257 (1988). Neuron 93, 1153 (2017).
[13] E. Gardner, Maximum Storage Capacity in Neural Networks, [28] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,
Europhys. Lett. 4, 481 (1987). Imagenet: A Large-Scale Hierarchical Image Database, in
[14] V. Vapnik, Statistical Learning Theory (Wiley, New York, Computer Vision and Pattern Recognition, 2009. CVPR
1998). 2009. IEEE Conference on (IEEE, 2009), pp. 248–255.
[15] S. Chung, D. D. Lee, and H. Sompolinsky, Linear Readout [29] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D.
of Object Manifolds, Phys. Rev. E 93, 060301 (2016). Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich,
[16] See Supplemental Material at http://link.aps.org/ Going Deeper with Convolutions, in Proceedings of the
supplemental/10.1103/PhysRevX.8.031003 for the further IEEE Conference on Computer Vision and Pattern Recog-
details of theory, and the details of numerics such as nition (IEEE Computer Society, 2015), pp. 1–9.
algorithms and the simulation parameters. [30] S.-H. Kye, Faces for Two Qubit Separable States
[17] S. Boyd and L. Vandenberghe, Convex Optimization and the Convex Hulls of Trigonometric Moment Curves,
(Cambridge University Press, Cambridge, England, 2004). arXiv:1302.2226.
[18] F. Gerl and U. Krey, Storage Capacity and Optimal [31] Z. Smilansky, Convex Hulls of Generalized Moment Curves,
Learning of Potts-Model Perceptrons by a Cavity Method, Isr. J. Math. 52, 115 (1985).
J. Phys. A 27, 7353 (1994). [32] R. Monasson, Properties of Neural Networks Storing
[19] W. Kinzel and M. Opper, Dynamics of Learning, in Models Spatially Correlated Patterns, J. Phys. A 25, 3701
of Neural Networks (Springer, 1991), pp. 149–171. (1992).
[20] R. T. Rockafellar, Convex Analysis (Princeton University [33] B. Lopez, M. Schroder, and M. Opper, Storage of
Press, Princeton, NJ, 2015). Correlated Patterns in a Perceptron, J. Phys. A 28, L447
[21] J.-J. Moreau, Décomposition Orthogonale D’un Espace (1995).
Hilbertien Selon Deux Cônes Mutuellement Polaires, [34] S. T. Roweis and L. K. Saul, Nonlinear Dimensionality
C.R. Hebd. Seances Acad. Sci. 225, 238 (1962). Reduction by Locally Linear Embedding, Science 290,
[22] S. Chung, U. Cohen, H. Sompolinsky, and D. D. Lee, 2323 (2000).
Learning Data Manifolds with a Cutting Plane Method, [35] J. B. Tenenbaum, V. De Silva, and J. C. Langford, A Global
Neural Comput., DOI: 10.1162/neco_a_01119 (2018). Geometric Framework for Nonlinear Dimensionality Re-
[23] B. Grünbaum and G. C. Shephard, Convex Polytopes, Bull. duction, Science 290, 2319 (2000).
Lond. Math. Soc. 1, 257 (1969). [36] S.-i. Amari, N. Fujita, and S. Shinomoto, Four Types of
[24] A. A. Giannopoulos, V. D. Milman, and M. Rudelson, Learning Curves, Neural Comput. 4, 605 (1992).
Convex Bodies with Minimal Mean Width, in Geometric [37] S. Risau-Gusman and M. B. Gordon, Statistical Mechanics
Aspects of Functional Analysis (Springer, New York, 2000), of Learning with Soft Margin Classifiers, Phys. Rev. E 64,
pp. 81–93. 031907 (2001).
031003-26