You are on page 1of 89

Deep learning: a statistical viewpoint

Peter L. Bartlett∗ Andrea Montanari† Alexander Rakhlin‡


peter@berkeley.edu montanar@stanford.edu rakhlin@mit.edu

March 17, 2021


arXiv:2103.09177v1 [math.ST] 16 Mar 2021

Abstract
The remarkable practical success of deep learning has revealed some major surprises from a theoreti-
cal perspective. In particular, simple gradient methods easily find near-optimal solutions to non-convex
optimization problems, and despite giving a near-perfect fit to training data without any explicit effort
to control model complexity, these methods exhibit excellent predictive accuracy. We conjecture that
specific principles underlie these phenomena: that overparametrization allows gradient methods to find
interpolating solutions, that these methods implicitly impose regularization, and that overparametriza-
tion leads to benign overfitting, that is, accurate predictions despite overfitting training data. In this
article, we survey recent progress in statistical learning theory that provides examples illustrating these
principles in simpler settings. We first review classical uniform convergence results and why they fall
short of explaining aspects of the behavior of deep learning methods. We give examples of implicit
regularization in simple settings, where gradient methods lead to minimal norm functions that perfectly
fit the training data. Then we review prediction methods that exhibit benign overfitting, focusing on
regression problems with quadratic loss. For these methods, we can decompose the prediction rule into a
simple component that is useful for prediction and a spiky component that is useful for overfitting but,
in a favorable setting, does not harm prediction accuracy. We focus specifically on the linear regime for
neural networks, where the network can be approximated by a linear model. In this regime, we demon-
strate the success of gradient flow, and we consider benign overfitting with two-layer networks, giving an
exact asymptotic analysis that precisely demonstrates the impact of overparametrization. We conclude
by highlighting the key challenges that arise in extending these insights to realistic deep learning settings.

Contents
1 Introduction 2
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Generalization and uniform convergence 6


2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Uniform laws of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Faster rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Complexity regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Computational complexity of empirical risk minimization . . . . . . . . . . . . . . . . . . . . 10
2.6 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.7 Large margin classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.8 Real prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.9 The mismatch between benign overfitting and uniform convergence . . . . . . . . . . . . . . . 16
∗ Departments of Statistics and EECS, UC Berkeley
† Departments of EE and Statistics, Stanford University
‡ Department of Brain & Cognitive Sciences and Statistics & Data Science Center, MIT

1
3 Implicit regularization 17

4 Benign overfitting 20
4.1 Local methods: Nadaraya-Watson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2 Linear regression in the interpolating regime . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3 Linear regression in Reproducing Kernel Hilbert Spaces . . . . . . . . . . . . . . . . . . . . . 26
4.3.1 The Laplace kernel with constant dimension . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3.2 Kernels on Rd with d  nα . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3.3 Kernels on Rd with d  n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5 Efficient optimization 32
5.1 The linear regime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2 Beyond the linear regime? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.3 Other approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6 Generalization in the linear regime 39


6.1 The implicit regularization of gradient-based training . . . . . . . . . . . . . . . . . . . . . . . 40
6.2 Ridge regression in the linear regime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.3 Random features model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.3.1 Polynomial scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.3.2 Proportional scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.4 Neural tangent model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

7 Conclusions and future directions 51

A Kernels on Rd with d  n 64
A.1 Bound on the variance of the minimum-norm interpolant . . . . . . . . . . . . . . . . . . . . 64
A.2 Exact characterization in the proportional asymptotics . . . . . . . . . . . . . . . . . . . . . . 64
A.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
A.2.2 An estimate on the entries of the resolvent . . . . . . . . . . . . . . . . . . . . . . . . 71
A.2.3 Proof of Theorem 4.13: Variance term . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
A.2.4 Proof of Theorem 4.13: Bias term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
A.2.5 Consequences: Proof of Corollary 4.14 . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

B Optimization in the linear regime 83

1 Introduction
The past decade has witnessed dramatic advances in machine learning that have led to major breakthroughs
in computer vision, speech recognition, and robotics. These achievements are based on a powerful and diverse
toolbox of techniques and algorithms that now bears the name ‘deep learning’; see, for example, [GBC16].
Deep learning has evolved from the decades-old methodology of neural networks: circuits of parametrized
nonlinear functions, trained by gradient-based methods. Practitioners have made major architectural and
algorithmic innovations, and have exploited technological advances, such as increased computing power,
distributed computing architectures, and the availability of large amounts of digitized data. The 2018
Turing Award celebrated these advances, a reflection of their enormous impact [LBH15].
Broadly interpreted, deep learning can be viewed as a family of highly nonlinear statistical models that
are able to encode highly nontrivial representations of data. A prototypical example is a feed-forward neural
network with L layers, which is a parametrized family of functions x 7→ f (x; θ) defined on Rd by

f (x; θ) := σL (W L σL−1 (W L−1 · · · σ1 (W 1 x) · · · )), (1)

2
where the parameters are θ = (W 1 , . . . , W L ) with W l ∈ Rdl ×dl−1 and d0 = d, and σl : Rdl → Rdl are fixed
nonlinearities, called activation functions. Given a training sample (x1 , y1 ), . . . , (xn , yn ) ∈ Rd × RdL , the
parameters θ are typically chosen by a gradient method to minimize the empirical risk,
n
1X
L(θ)
b := `(f (xi ; θ), yi ),
n i=1

where ` is a suitable loss function. The aim is to ensure that this model generalizes well, in the sense that
f (x; θ) is an accurate prediction of y on a subsequent (x, y) pair. It is important to emphasize that deep
learning is a data-driven approach: these are rich but generic models, and the architecture, parametrization
and nonlinearities are typically chosen without reference to a specific model for the process generating the
data.
While deep learning has been hugely successful in the hands of practitioners, there are significant gaps
in our understanding of what makes these methods successful. Indeed, deep learning reveals some major
surprises from a theoretical perspective: deep learning methods can find near-optimal solutions to highly
non-convex empirical risk minimization problems, solutions that give a near-perfect fit to noisy training data,
but despite making no explicit effort to control model complexity, these methods lead to excellent prediction
performance in practice.
To put these properties in perspective, it is helpful to recall the three competing goals that statistical
prediction methods must balance: they require expressivity, to allow the richness of real data to be effectively
modelled; they must control statistical complexity, to make the best use of limited training data; and they
must be computationally efficient. The classical approach to managing this trade-off involves a rich, high-
dimensional model, combined with some kind of regularization, which encourages simple models but allows
more complexity if that is warranted by the data. In particular, complexity is controlled so that performance
on the training data, that is, the empirical risk, is representative of performance on independent test data,
specifically so that the function class is simple enough that sample averages L(θ) b converge to expectations
L(θ) := E`(f (x; θ), y) uniformly across the function class. And prediction methods are typically formulated
as convex optimization problems—for example with a convex loss ` and parameters θ that enter linearly—
which can be solved efficiently.
The deep learning revolution built on two surprising empirical discoveries that are suggestive of radically
different ways of managing these trade-offs. First, deep learning exploits rich and expressive models, with
many parameters, and the problem of optimizing the fit to the training data appears to simplify dramatically
when the function class is rich enough, that is, when it is sufficiently overparametrized. In this regime,
simple, local optimization approaches, variants of stochastic gradient methods, are extraordinarily successful
at finding near-optimal fits to training data, even though the nonlinear parametrization—see equation (1)—
implies that the optimization problems that these simple methods solve are notoriously non-convex. A
posteriori, the idea that overparametrization could lead to tractability might seem natural, but it would
have seemed completely foolish from the point of view of classical learning theory: the resulting models are
outside the realm of uniform convergence, and therefore should not be expected to generalize well.
The second surprising empirical discovery was that these models are indeed outside the realm of uni-
form convergence. They are enormously complex, with many parameters, they are trained with no explicit
regularization to control their statistical complexity, and they typically exhibit a near-perfect fit to noisy
training data, that is, empirical risk close to zero. Nonetheless this overfitting is benign, in that they produce
excellent prediction performance in a number of settings. Benign overfitting appears to contradict accepted
statistical wisdom, which insists on a trade-off between the complexity of a model and its fit to the data.
Indeed, the rule of thumb that models fitting noisy data too well will not generalize is found in most classical
texts on statistics and machine learning [FHT01, Was13]. This viewpoint has become so prevalent that the
word ‘overfitting’ is often taken to mean both fitting data better than should be expected and also giving
poor predictive accuracy as a consequence. In this paper, we use the literal meaning of the word ‘overfitting’;
deep learning practice has demonstrated that poor predictive accuracy is not an inevitable consequence.
This paper reviews some initial steps towards understanding these two surprising aspects of the success
of deep learning. We have two working hypotheses:

3
Tractability via overparametrization. Classically, tractable statistical learning is achieved by restricting
to linearly parametrized classes of functions and convex objectives. A fundamentally new principle
appears to be at work in deep learning. Although the objective is highly non-convex, we conjecture
that the hardness of the optimization problem depends on the relationship between the dimension of the
parameter space (the number of optimization variables) and the sample size (which, when we aim for
a near-perfect fit to training data, we can think of as the number of constraints), that is, tractability
is achieved if and only if we choose a model that is sufficiently under-constrained or, equivalently,
overparametrized.
Generalization via implicit regularization. Even if overparametrized models simplify the optimization
task, classically we would have believed that good generalization properties would be restricted to
either an underparametrized regime or a suitably regularized regime. Statistical wisdom suggests
that a method that takes advantage of too many degrees of freedom by perfectly interpolating noisy
training data will be poor at predicting new outcomes. In deep learning, training algorithms appear
to induce a bias that breaks the equivalence among all the models that interpolate the observed data.
Because these models interpolate noisy data, the classical statistical perspective would suggest that
this bias cannot provide sufficient regularization to give good generalization, but in practice it does.
We conjecture that deep learning models can be decomposed into a low-complexity component for
which classical uniform convergence occurs and a high-complexity component that enables a perfect
fit to training data, and if the model is suitably overparameterized, this perfect fit does not have a
significant impact on prediction accuracy.
As we shall see, both of these hypotheses are supported by results in specific scenarios, but there are
many intriguing open questions in extending these results to realistic deep learning settings.
It is worth noting that none of the results that we review here make a case for any optimization or
generalization benefits of increasing depth in deep learning. Although it is not the focus here, another
important aspect of deep learning concerns how deep neural networks can effectively and parsimoniously
express natural functions that are well matched to the data that arise in practice. It seems likely that depth
is crucial for these issues of expressivity.

1.1 Overview
Section 2 starts by reviewing some results from classical statistical learning theory that are relevant to
the problem of prediction with deep neural networks. It describes an explicit probabilistic formulation of
prediction problems. Consistent with the data-driven perspective of deep learning, this formulation assumes
little more than that the (x, y) pairs are sampled independently from a fixed probability distribution. We
explain the role played by uniform bounds on deviations between risk and empirical risk,

sup L(f ) − L(f
b ) ,

f ∈F

in the analysis of the generalization question for functions chosen from a class F. We show how a partition
of a rich function class F into a complexity hierarchy allows regularization methods that balance the statis-
tical complexity and the empirical risk to enjoy the best bounds on generalization implied by the uniform
convergence results. We consider consequences of these results for general pattern classification problems,
for easier “large margin” classification problems and for regression problems, and we give some specific ex-
amples of risk bounds for feed-forward networks. Finally, we consider the implications of these results for
benign overfitting: If an algorithm chooses an interpolating function to minimize some notion of complexity,
what do the uniform convergence results imply about its performance? We see that there are very specific
barriers to analysis of this kind in the overfitting regime; an analysis of benign overfitting must make stronger
assumptions about the process that generates the data.
In Section 3, we review results on the implicit regularization that is imposed by the algorithmic approach
ubiquitous in deep learning: gradient methods. We see examples of function classes and loss functions where

4
gradient methods, suitably initialized, return the empirical risk minimizers that minimize certain parameter
norms. While all of these examples involve parameterizations of linear functions with convex losses, we shall
see in Section 5 that this linear/convex viewpoint can be important for nonconvex optimization problems
that arise in neural network settings.
Section 4 reviews analyses of benign overfitting. We consider extreme cases of overfitting, where the
prediction rule gives a perfect interpolating fit to noisy data. In all the cases that we review where this
gives good predictive accuracy, we can view the prediction rule as a linear combination of two components:
fb = fb0 + ∆. The first, fb0 , is a simple component that is useful for prediction, and the second, ∆, is a spiky
component that is useful for overfitting. Classical statistical theory explains the good predictive accuracy
of the simple component. The other component is not useful for prediction, but equally it is not harmful
for prediction. The first example we consider is the classical Nadaraya-Watson kernel smoothing method
with somewhat strange, singular kernels, which lead to an interpolating solution that, for a suitable choice
of the kernel bandwidth, enjoys minimax estimation rates. In this case, we can view fb0 as the prediction
of a standard kernel smoothing method and ∆ as a spiky component that is harmless for prediction but
allows interpolation. The other examples we consider are for high-dimensional linear regression. Here,
‘linear’ means linearly parameterized, which of course allows for the richness of highly nonlinear features, for
instance the infinite dimensional feature vectors that arise in reproducing kernel Hilbert spaces (RKHSs).
Motivated by the results of Section 3, we study the behavior of the minimum norm interpolating linear
function. We see that it can be decomposed into a prediction component and an overfitting component,
with the split determined by the eigenvalues of the data covariance matrix. The prediction component
corresponds to a high-variance subspace and the overfitting component to the orthogonal, low-variance
subspace. For sub-Gaussian features, benign overfitting occurs if and only if the high-variance subspace
is low-dimensional (that is, the prediction component is simple enough for the corresponding subspace of
functions to exhibit uniform convergence) and the low-variance subspace has high effective dimension and
suitably low energy. In that case, we see a self-induced regularization: the projection of the data on the
low-variance subspace is well-conditioned, just as it would be if a certain level of statistical regularization
were imposed, so that even though this subspace allows interpolation, it does not significantly deteriorate the
predictive accuracy. (Notice that this self-induced regularization is a consequence of the decay of eigenvalues
of the covariance matrix, and should not be confused with the implicit regularization, which is a consequence
of the gradient optimization method and leads to the minimum norm interpolant.) Using direct arguments
that avoid the sub-Gaussian assumption, we see similar behavior of the minimum norm interpolant in certain
infinite-dimensional RKHSs, including an example of an RKHS with fixed input dimension where benign
overfitting cannot occur and examples of RKHSs where it does occur for suitably increasing input dimension,
again corresponding to decompositions into a simple subspace—in this case, a subspace of polynomials, with
dimension low enough for uniform convergence—and a complex high-dimensional orthogonal subspace that
allows benign overfitting.
In Section 5, we consider a specific regime where overparametrization allows a non-convex empirical risk
minimization problem to be solved efficiently by gradient methods: a linear regime, in which a parameterized
function can be accurately approximated by its linearization about an initial parameter vector. For a suitable
parameterization and initialization, we see that a gradient method remains in the linear regime, enjoys linear
convergence of the empirical risk, and leads to a solution whose predictions are well approximated by the
linearization at the initialization. In the case of two-layer networks, suitably large overparametrization
and initialization suffice. On the other hand, the mean-field limit for wide two-layer networks, a limit
that corresponds to a smaller—and perhaps more realistic—initialization, exhibits an essentially different
behavior, highlighting the need to extend our understanding beyond linear models.
Section 6 returns to benign overfitting, focusing on the linear regime for two specific families of two-layer
networks: a random features model, with randomly initialized first-layer parameters that remain constant
throughout training, and a neural tangent model, corresponding to the linearization about a random ini-
tialization. Again, we see decompositions into a simple subspace (of low-degree polynomials) that is useful
for prediction and a complex orthogonal subspace that allows interpolation without significantly harming
prediction accuracy.

5
Section 7 outlines future directions. Specifically, for the two working hypotheses of tractability via over-
parametrization and generalization via implicit regularization, this section summarizes the insights from the
examples that we have reviewed—mechanisms for implicit regularization, the role of dimension, decompo-
sitions into prediction and overfitting components, data-adaptive choices of these decompositions, and the
tractability benefits of overparameterization. It also speculates on how these might extend to realistic deep
learning settings.

2 Generalization and uniform convergence


This section reviews uniform convergence results from statistical learning theory and their implications for
prediction with rich families of functions, such as those computed by neural networks. In classical statistical
analyses, it is common to posit a specific probabilistic model for the process generating the data and to
estimate the parameters of that model; see, for example, [BD07]. In contrast, the approach in this section
is motivated by viewing neural networks as defining rich, flexible families of functions that are useful for
prediction in a broad range of settings. We make only weak assumptions about the process generating the
data, for example, that it is sampled independently from an unknown distribution, and we aim for the best
prediction accuracy.

2.1 Preliminaries
Consider a prediction problem in a probabilistic setting, where we aim to use data to find a function f
mapping from an input space X (for example, a representation of images) to an output space Y (for example,
a finite set of labels for those images). We measure the quality of the predictions that f : X → Y makes on
an (x, y) pair using the loss `(f (x), y), which represents the cost of predicting f (x) when the actual outcome
is y. For example, if f (x) and y are real-valued, we might consider the square loss, `(f (x), y) = (f (x) − y)2 .
We assume that we have access to a training sample of input-output pairs (x1 , y1 ), . . . , (xn , yn ) ∈ X × Y,
chosen independently from a probability distribution P on X × Y. These data are used to choose fb : X → Y,
and we would like fb to give good predictions of the relationship between subsequent (x, y) pairs in the sense
that the risk of fb, denoted
L(fb) := E`(fb(x), y),
is small, where (x, y) ∼ P and E denotes expectation (and if fb is random, for instance because it is chosen
based on random training data, we use L(fb) to denote the conditional expectation given fb). We are interested
in ensuring that the excess risk of fb,
L(fb) − inf L(f ),
f

is close to zero, where the infimum is over all measurable functions. Notice that we assume only that (x, y)
pairs are independent and identically distributed; in particular, we do not assume any functional relationship
between x and y.
Suppose that we choose fb from a set of functions F ⊆ Y X . For instance, F might be the set of functions
computed by a deep network with a particular architecture and with particular constraints on the parameters
in the network. A natural approach to using the sample to choose fb is to minimize the empirical risk over
the class F. Define
fberm ∈ argmin L(f
b ), (2)
f ∈F

where the empirical risk,


n
b ) := 1
X
L(f `(f (xi ), yi ),
n i=1
is the expectation of the loss under the empirical distribution defined by the sample. Often, we consider
classes of functions x 7→ f (x; θ) parameterized by θ, and we use L(θ) and L(θ)
b to denote L(f (·; θ)) and
L(f (·; θ)), respectively.
b

6
We can split the excess risk of the empirical risk minimizer fberm into two components,
   
L(ferm ) − inf L(f ) = L(ferm ) − inf L(f ) + inf L(f ) − inf L(f ) ,
b b (3)
f f ∈F f ∈F f

the second reflecting how well functions in the class F can approximate an optimal prediction rule and the
first reflecting the statistical cost of estimating such a prediction rule from the finite sample. For a more
complex function class F, we should expect the approximation error to decrease and the estimation error to
increase. We focus on the estimation error, and on controlling it using uniform laws of large numbers.

2.2 Uniform laws of large numbers


Without any essential loss of generality, suppose that a minimizer fF∗ ∈ arg minf ∈F L(f ) exists. Then we
can split the estimation error of an empirical risk minimizer fberm defined in (2) into three components:

L(fberm ) − inf L(f )


f ∈F

= L(fberm ) − L(fF∗ )
h i h i h i
= L(fberm ) − L(
b fberm ) + L( b ∗ ) + L(f
b fberm ) − L(fF
b ∗ ) − L(f ∗ ) .
F F (4)

The second term cannot be positive since fberm minimizes empirical risk. The third term converges to zero
by the law of large numbers (and if the random variable `(fF∗ (x), y) is sub-Gaussian, then with probability
exponentially close to 1 this term is O(n−1/2 ); see, for example, [BLM13, Chapter 2] and [Ver18] for the
definition of sub-Gaussian and for a review of concentration inequalities of this kind). The first term is more
interesting. Since fberm is chosen using the data, L( b fberm ) is a biased estimate of L(fberm ), and so we cannot simply
apply a law of large numbers. One approach is to use the crude upper bound

L(fberm ) − L(
b fberm ) ≤ sup L(f ) − L(f
b ) , (5)
f ∈F

and hence bound the estimation error in terms of this uniform bound. The following theorem shows that such
uniform bounds on deviations between expectations and sample averages are intimately related to a notion
of complexity of the loss class `F = {(x, y) 7→ `(f (x), y) : f ∈ F} known as the Rademacher complexity. For
a probability distribution P on a measurable space Z, a sample z 1 , . . . , z n ∼ P, and a function class G ⊂ RZ ,
define the Rademacher complexity of G as
n
1 X
Rn (G) := E sup i g(z i ) ,

g∈G n i=1

where 1 , . . . , n ∈ {±1} are independent and uniformly distributed.


Theorem 2.1. For any G ⊂ [0, 1]Z and any probability distribution P on Z,
r
1 log 2
Rn (G) − ≤ E sup Eg − Eg
b ≤ 2Rn (G),

2 2n g∈G

b = n−1 Pn g(z i ) and z 1 , . . . , z n are chosen i.i.d. according to P. Furthermore, with probability
where Eg i=1
at least 1 − 2 exp(−22 n) over z 1 , . . . , z n ,

E sup Eg − Eg b −  ≤ sup Eg − Eg
b ≤ E sup Eg − Eg
b + .

g∈G g∈G g∈G

as
Thus, Rn (G) → 0 if and only if supg∈G Eg − Eg
b → 0.

7
See [KP00, Kol01, BBL02, BM02] and [Kol06]. This theorem shows that for bounded losses, a uniform
bound
sup L(f ) − L(f
b )

f ∈F

on the maximal deviations between risks and empirical risks of any f in F is tightly concentrated around its
expectation, which is close to the Rademacher complexity Rn (`F ). Thus, we can bound the excess risk of
fberm in terms of the sum of the approximation error inf f ∈F L(f ) − inf f L(f ) and this bound on the estimation
error.

2.3 Faster rates


Although the approach (5) of bounding the deviation between the risk and empirical risk of fberm by the
maximum for any f ∈ F of this deviation appears to be very coarse, there are many situations where it
cannot be improved by more than a constant factor without stronger assumptions (we will see examples later
in this section). However, there are situations where it can be significantly improved. As an illustration,
provided F contains functions f for which the variance of `(f (x), y) is positive, it is easy to see that
Rn (`F ) = Ω(n−1/2 ). Thus, the best bound on the estimation error implied by Theorem 2.1 must go to zero
no faster than n−1/2 , but it is possible for the risk of the empirical minimizer to converge to the optimal
value L(fF∗ ) faster than this. For example, when F is suitably simple, this occurs for a nonnegative bounded
loss, ` : Y × Y → [0, 1], when there is a function fF∗ in F that gives perfect predictions, in the sense that
almost surely `(fF∗ (x), y) = 0. In that case, the following theorem is an example that gives a faster rate in
terms of the worst-case empirical Rademacher complexity,
" #
1 Xn
R̄n (F) = sup E sup i f (xi ) x1 , . . . , xn .

x1 ,...,xn ∈X f ∈F n i=1

Notice that, for any probability distribution on X , Rn (F) ≤ R̄n (F).


Theorem 2.2. There is a constant c > 0 such that for a bounded function class F ⊂ [−1, 1]X , for `(b y , y) =
y −y)2 , and for any distribution P on X ×[−1, 1], with probability at least 1−δ, a sample (x1 , y1 ), . . . , (xn , yn )
(b
satisfies for all f ∈ F,
b ) + c (log n)4 R̄n2 (F) + c log(1/δ) .
L(f ) ≤ (1 + c)L(f
n
In particular, when L(fF∗ ) = 0, the empirical minimizer has L( b fberm ) = 0, and so with high probability,
L(fberm ) = Õ R̄n2 (F) , which can be as small as Õ(1/n) for a suitably simple class F.


Typically, faster rates like these arise when the variance of the excess loss is bounded in terms of its
expectation, for instance
2
E [`(f (x), y) − `(fF∗ (x), y)] ≤ cE [`(f (x), y) − `(fF∗ (x), y)] .

For a bounded nonnegative loss with L(fF∗ ) = 0, this so-called Bernstein property is immediate, and it
has been exploited in that case to give fast rates for prediction with binary-valued [VC71, VC74] and real-
valued [Hau92, Pol95, BL99] function classes. Theorem 2.2, which follows from [SST10, Theorem 1] and the
AM-GM inequality1 , relies on the smoothness of the quadratic loss to give a bound for that case in terms of
the worst-case empirical Rademacher complexity. There has been a significant body of related work over the
last thirty years. First, for quadratic loss in this well-specified setting, that is, when f ∗ (x) = E[y|x] belongs
to the class F, faster rates have been obtained even without L(f ∗ ) = 0 [vdG90]. Second, the Bernstein
property can occur without the minimizer of L being in F; indeed, it arises for convex F with quadratic
loss [LBW96] or more generally strongly convex losses [Men02], and this has been exploited to give fast rates
1 The exponent on the log factor in Theorem 2.2 is larger than the result in the cited reference; any exponent larger than 3

suffices. See [RV06, Equation (1.4)].

8
based on several other notions of complexity [BBM05, Kol06, LRS15]. Recent techniques [Men20] eschew
concentration bounds and hence give weaker conditions for convergence of L(fberm ) to L(fF∗ ), without the
requirement that the random variables `(f (x), y) have light tails. Finally, while we have defined F as the
class of functions used by the prediction method, if it is viewed instead as the benchmark (that is, the aim
is to predict almost as well as the best function in F, but the prediction method can choose a prediction
rule fb that is not necessarily in F), then similar fast rates are possible under even weaker conditions, but
the prediction method must be more complicated than empirical risk minimization; see [RST17].

2.4 Complexity regularization


The results we have seen give bounds on the excess risk of fberm in terms of a sum of approximation error
and a bound on the estimation error that depends on the complexity of the function class F. Rather than
choosing the complexity of the function class F in advance, we could instead split a rich class F into a
complexity hierarchy and choose the appropriate complexity based on the data, with the aim of managing
this approximation-estimation tradeoff. We might define subsets Fr of a rich class F, indexed by a complexity
parameter r. We call each Fr a complexity class, and we say that it has complexity r.
There are many classical examples of this approach. For instance, support vector machines (SVMs) [CV95]
use a reproducing kernel Hilbert space (RKHS) H, and the complexity class Fr is the subset of functions in
H with RKHS norm no more than r. As another example, Lasso [Tib96] uses the set F of linear functions
on a high-dimensional space, with the complexity classes Fr defined by the `1 norm of the parameter vector.
Both SVMs and Lasso manage the approximation-estimation trade-off by balancing the complexity of the
prediction rule and its fit to the training data: they minimize a combination of empirical risk and some
increasing function of the complexity r.
The following theorem gives an illustration of the effectiveness of this kind of complexity regularization. In
the first part of the theorem, the complexity penalty for a complexity class is a uniform bound on deviations
between expectations and sample averages for that class. We have seen that uniform deviation bounds of
this kind imply upper bounds on the excess risk of the empirical risk minimizer in the class. In the second
part of the theorem, the complexity penalty appears in the upper bounds on excess risk that arise in settings
where faster rates are possible. In both cases, the theorem shows that when the bounds hold, choosing the
best penalized empirical risk minimizer in the complexity hierarchy leads to the best of these upper bounds.
Theorem 2.3. For each Fr ⊆ F, define an empirical risk minimizer
r
fberm ∈ argmin L(f
b ).
f ∈Fr

Among these, select the one with complexity rb that gives an optimal balance between the empirical risk and
a complexity penalty pr :
 
r b fbr ) + pr .
fb = fberm
b
, rb ∈ argmin L( erm
(6)
r

1. In the event that the complexity penalties are uniform deviation bounds:

for all r, sup L(f ) − L(f
b ) ≤ pr , (7)

f ∈Fr

then we have the oracle inequality


 
L(fb) − inf L(f ) ≤ inf inf L(f ) − inf L(f ) + 2pr . (8)
f r f ∈Fr f

2. Suppose that the complexity classes and penalties are ordered, that is,

r ≤ s implies Fr ⊆ Fs and pr ≤ ps ,

9
and fix fr∗ ∈ arg minf ∈Fr L(f ). In the event that the complexity penalties satisfy the uniform relative
deviation bounds
  
for all r, sup L(f ) − L(fr∗ ) − 2 L(f b r∗ ) ≤ 2pr /7
b ) − L(f (9)
f ∈Fr
 
and sup L(f b r∗ ) − 2 (L(f ) − L(fr∗ )) ≤ 2pr /7,
b ) − L(f
f ∈Fr

then we have the oracle inequality


 
L(fb) − inf L(f ) ≤ inf inf L(f ) − inf L(f ) + 3pr . (10)
f r f ∈Fr f

These are called oracle inequalities because (8) (respectively (10)) gives the error bound that follows
from the best of the uniform bounds (7) (respectively (9)), as if we have access to an oracle who knows the
complexity that gives the best bound. The proof of the first part is a straightforward application of the same
decomposition as (4); see, for example, [BBL02]. The proof of the second part, which allows significantly
smaller penalties pr when faster rates are possible, is also elementary; see [Bar08]. In both cases, the broad
approach to managing the trade-off between approximation error and estimation error is qualitatively the
same: having identified a complexity hierarchy {Fr } with corresponding excess risk bounds pr , these results
show the effectiveness of choosing from the hierarchy a function f that balances the complexity penalty pr
r
with the fit to the training data L(
b fberm ).
Later in this section, we will see examples of upper bounds on estimation error for neural network classes
Fr indexed by a complexity parameter r that depends on properties of the network, such as the size of the
parameters. Thus, a prediction method that trades off the fit to the training data with these measures of
complexity would satisfy an oracle inequality.

2.5 Computational complexity of empirical risk minimization


To this point, we have considered the statistical performance of the empirical risk minimizer fberm without
considering the computational cost of solving this optimization problem. The classical cases where it can
be solved efficiently involve linearly parameterized function classes, convex losses, and convex complexity
penalties, so that penalized empirical risk minimization is a convex optimization problem. For instance,
SVMs exploit a linear function class (an RKHS, H), a convex loss,

`(f (x), y) := (1 − yf (x)) ∨ 0 for f : X → R and y ∈ {±1},

and a convex complexity penalty,



Fr = {f ∈ H : kf kH ≤ r}, pr = r/ n,

and choosing fb according to (6) corresponds to solving a quadratic program. Similarly, Lasso involves linear
functions on Rd , quadratic loss, and a convex penalty,
p
Fr = {x 7→ hx, βi : kβk1 ≤ r}, pr = r log(d)/n.

Again, minimizing complexity-penalized empirical risk corresponds to solving a quadratic program.


On the other hand, the optimization problems that arise in a classification setting, where functions map
to a discrete set, have a combinatorial flavor, and are often computationally hard in the worst case. For
instance, empirical risk minimization over the set of linear classifiers

F = x 7→ sign (hw, xi) : w ∈ Rd




is NP-hard [JP78, GJ79]. In contrast, if there is a function in this class that classifies all of the training
data correctly, finding an empirical risk minimizer is equivalent to solving a linear program, which can be

10
solved efficiently. Another approach to simplifying the algorithmic challenge of empirical risk minimization
is to replace the discrete loss for this family of thresholded linear functions with a surrogate convex loss for
the family of linear functions. This is the approach used in SVMs: replacing a nonconvex loss with a convex
loss allows for computational efficiency, even when there is no thresholded linear function that classifies all
of the training data correctly.
However, the corresponding optimization problems for neural networks appear to be more difficult. Even
when L(b fberm ) = 0, various natural empirical risk minimization problems over families of neural networks are
NP-hard [Jud90, BR92, DSS95], and this is still true even for convex losses [Vu98, BBD02].
In the remainder of this section, we focus on the statistical complexity of prediction problems with neural
network function classes (we shall return to computational complexity considerations in Section 5). We
review estimation error bounds involving these classes, focusing particularly on the Rademacher complexity.
The Rademacher complexity of a loss class `F can vary dramatically with the loss `. For this reason, we
consider separately discrete losses, such as those used for classification, convex upper bounds on these losses,
like the SVM loss and other large margin losses used for classification, and Lipschitz losses used for regression.

2.6 Classification
We first consider loss classes for the problem of classification. For simplicity, consider a two-class classification
problem, where Y = {±1}, and define the ±1 loss, `±1 (b y . Then for F ⊂ {±1}X , Rn (`F ) = Rn (F),
y , y) = −yb
since the distribution of i `±1 (f (xi ), yi ) = −i yi f (xi ) is the same as that of i f (xi ). The following theorem
shows that the Rademacher complexity depends on a combinatorial dimension of F, known as the VC-
dimension [VC71].
Theorem 2.4. For F ⊆ [−1, 1]X and for any distribution on X ,
r
2 log(2ΠF (n))
Rn (F) ≤ ,
n
where
ΠF (n) = max {|{(f (x1 ), . . . , f (xn )) : f ∈ F}| : x1 , . . . , xn ∈ X } .
If F ⊆ {±1}X and n ≥ d = dV C (F), then
d
ΠF (n) ≤ (en/d) ,

where dV C (F) := max d : ΠF (d) = 2d . In that case, for any distribution on X ,
r !
d log(n/d)
Rn (F) = O ,
n
p 
and conversely, for some probability distribution, Rn (F) = Ω d/n .

These bounds imply that, for the worst casepprobability distribution, the uniform deviations between
sample averages and expectations grow like Θ̃( dV C (F)/n), a result of [VC71]. The log factor in the
upper bound can be removed; see [Tal94]. Classification problems are an example where the crude upper
bound (5) cannot be improved without stronger assumptions: the minimax excess risk is essentially the
same as these uniform deviations. In particular, these results
p show that empirical risk minimization leads,
for any probability distribution, to excess risk that is O( dV C (F)/n), but conversely, for every method that
p
predicts a fb ∈ F, there is a probability distribution for which the excess risk is Ω( dV C (F)/n) [VC74].
When there is a prediction rule in F that predicts perfectly, that is L(fF∗ ) = 0, the upper and lower bounds
can be improved to Θ̃(dV C (F)/n) [BEHW89, EHKV89].
These results show that dV C (F) is critical for uniform convergence of sample averages to probabilities,
and more generally for the statistical complexity of classification with a function class F. The following

11
theorem summarizes the known bounds on the VC-dimension of neural networks with various piecewise-
polynomial nonlinearities. Recall that a feed-forward neural network with L layers is defined by a sequence
of layer widths d1 , . . . , dL and functions σl : Rdl → Rdl for l = 1, . . . , L. It is a family of RdL -valued functions
on Rd parameterized by θ = (W 1 , . . . , W L ); see (1). We often consider scalar nonlinearities σ : R → R
applied componentwise, that is, σl (v)i := σ(vi ). For instance, σ might be the scalar nonlinearity used in the
ReLU (rectified linear unit), σ(α) = α ∨ 0. We say that this family has p parameters if there is a total of p
entries in the matrices W 1 , . . . , W L . We say that σ is piecewise polynomial if it can be written as a sum of
a constant number of polynomials,
Xk
σ(x) = 1 [x ∈ Ii ] pi (x),
i=1

where the intervals I1 , . . . , Ik form a partition of R and the pi are polynomials.


Theorem 2.5. Consider feed-forward neural networks FL,σ with L layers, scalar output (that is, dL = 1),
output nonlinearity σL (α) = sign(α), and scalar nonlinearity σ at every other layer. Define

dL,σ,p = max {dV C (FL,σ ) : FL,σ has p parameters} .

1. For σ piecewise constant, dL,σ,p = Θ̃ (p).

2. For σ piecewise linear, dL,σ,p = Θ̃ (pL).

3. For σ piecewise polynomial, dL,σ,p = Õ pL2 .




Part 1 is from [BH89]. The upper bound in part 2 is from [BHLM19]. The lower bound in part 2
and the bound in part 3 are from [BMM98]. There are also upper bounds for the smooth sigmoid σ(α) =
1/(1 + exp(−α)) that are quadratic in p; see [KM97]. See Chapter 8 of [AB99] for a review.
The theorem shows that the VC-dimension of these neural networks grows at least linearly with the
number of parameters in the network, and hence to achieve small excess risk or uniform convergence of
sample averages to probabilities for discrete losses, the sample size must be large compared to the number
of parameters in these networks.
There is an important caveat to this analysis: it captures arbitrarily fine-grained properties of real-valued
functions, because the operation of thresholding these functions is very sensitive to perturbations, as the
following example shows.
Example 2.6. For α > 0, define the nonlinearity r̃(x) := (x + α sin x) ∨ 0 and the following one-parameter
class of functions computed by two-layer networks with these nonlinearities:

Fr̃ := {x 7→ sign(π + r̃(wx) − r̃(wx + π)) : w ∈ R} .

Then dV C (Fr̃ ) = ∞.
Indeed, provided wx ≥ α, r̃(wx) = wx + α sin(wx), hence π + r̃(wx) − r̃(wx + π) = 2α sin(wx). This
shows that the set of functions in Fr̃ restricted to N contains

{x 7→ sign(sin(wx)) : w ≥ α} = {x 7→ sign(sin(wx)) : w ≥ 0} ,

and the VC-dimension of the latter class of functions on N is infinite; see, for example, [AB99, Lemma 7.2].
Thus, with an arbitrarily small perturbation of the ReLU nonlinearity, the VC-dimension of this class changes
from a small constant to infinity. See also [AB99, Theorem 7.1], which gives a similar result for a slightly
perturbed version of a sigmoid nonlinearity.
As we have seen, the requirement that the sample size grows with the number of parameters is at odds
with empirical experience: deep networks with far more parameters than the number of training examples
routinely give good predictive accuracy. It is plausible that the algorithms used to optimize these networks
are not exploiting their full expressive power. In particular, the analysis based on combinatorial dimensions

12
captures arbitrarily fine-grained properties of the family of real-valued functions computed by a deep network,
whereas algorithms that minimize a convex loss might not be significantly affected by such fine-grained
properties. Thus, we might expect that replacing the discrete loss `±1 with a convex surrogate, in addition
to computational convenience, could lead to reduced statistical complexity. The empirical success of gradient
methods with convex losses for overparameterized thresholded real-valued classifiers was observed both in
neural networks [MP90], [LGT97], [CLG01] and in related classification methods [DC95], [Qui96], [Bre98]. It
was noticed that classification performance can improve as the number of parameters is increased even after
all training examples are classified correctly [Qui96], [Bre98].2 These observations motivated large margin
analyses [Bar98], [SFBL98], which reduce classification problems to regression problems.

2.7 Large margin classification


Although the aim of a classification problem is to minimize the expectation of a discrete loss, if we consider
classifiers such as neural networks that consist of thresholded real-valued functions obtained by minimizing
a surrogate loss—typically a convex function of the real-valued prediction—then it turns out that we can
obtain bounds on estimation error by considering approximations of the class of real-valued functions. This
is important because the statistical complexity of that function class can be considerably smaller than that of
the class of thresholded functions. In effect, for a well-behaved surrogate loss, fine-grained properties of the
real-valued functions are not important. If the surrogate loss ` satisfies a Lipschitz property, we can relate
the Rademacher complexity of the loss class `F to that of the function class F using the Ledoux-Talagrand
contraction inequality [LT91, Theorem 4.12].
Theorem 2.7. Suppose that, for all y, yb 7→ `(b
y , y) is c-Lipschitz and satisfies `(0, y) = 0. Then Rn (`F ) ≤
2cRn (F).
Notice that the assumption that `(0, y) = 0 is essentially without loss of generality: adding√a fixed
function to `F by replacing `(b y , y) − `(0, y) shifts the Rademacher complexity by O (1/ n).
y , y) with `(b
For classification with y ∈ {−1, 1}, the hinge loss `(b y , y) = (1 − yb y ) ∨ 0 used by SVMs and the logistic
loss `(by , y) = log (1 + exp(−yb
y )) are examples of convex, 1-Lipschitz surrogate losses. The quadratic loss
`(b
y , y) = (by − y)2 and the exponential loss `(b y , y) := exp(−yb y ) used by AdaBoost [FS97] (see Section 3) are
also convex, and they are Lipschitz when functions in F have bounded range.
We can write all of these surrogate losses as `φ (b y , y) := φ(b y y) for some function φ : R → [0, ∞). The
following theorem relates the excess risk to the excess surrogate risk. It is simpler to state when φ is convex
and when, rather than `±1 , we consider a shifted, scaled version, defined as `01 (b y 6= y]. We use
y , y) := 1 [b
L01 (f ) and Lφ (f ) to denote E`01 (f (x), y) and E`φ (f (x), y) respectively.
Theorem 2.8. For a convex function φ : R → [0, ∞), define `φ (b y , y) := φ(b
y y) and Cθ (α) := (1 + θ)φ(α)/2 +
(1 − θ)φ(−α)/2, and define ψφ : [0, 1] → [0, ∞) as ψφ (θ) := inf {Cθ (α) : α ≤ 0} − inf {Cθ (α) : α ∈ R}. Then
we have the following.
1. For any measurable fb : X → R and any probability distribution P on X × Y,
 
ψφ L01 (fb) − inf L01 (f ) ≤ Lφ (fb) − inf Lφ (f ),
f f

where the infima are over measurable functions f .


2. For |X | ≥ 2, this inequality cannot hold if ψφ is replaced by any larger function:

sup inf Lφ (fb) − inf Lφ (f ) − ψφ (θ) :
θ f

P, fb satisfy L01 (fb) − inf L01 (f ) = θ = 0.
f
2 Both phenomena were observed more recently in neural networks; see [ZBH+ 17] and [NTSS17].

13
3. ψφ (θi ) → 0 implies θi → 0 if and only if both φ is differentiable at zero and φ0 (0) < 0.
For example, for the hinge loss φ(α) = (1 − α) ∨ 0, the relationship between excess risk and excess φ-risk
2 2
is given by ψφ (θ) = |θ|, for the
√ quadratic loss φ(α) = (1 − α) , ψφ (θ) = θ , and for the exponential loss
2
φ(α) = exp(−α), ψφ (θ) = 1 − 1 − θ . Theorem 2.8 is from [BJM06]; see also [Lin04, LV04] and [Zha04].
Using (4), (5), and Theorems 2.1 and 2.7 to bound E`φ,fb − inf f E`φ,f in terms of Rn (F) and combining
with Theorem 2.8 shows that, if φ is 1-Lipschitz then with high probability,
   
1
ψφ L01 (fb) − inf L01 (f ) ≤ 4Rn (F) + O √ + inf Lφ (f ) − inf Lφ (f ). (11)
f n f ∈F f

Notice that in addition to the Rademacher complexity of the real-valued class F, this bound includes an
approximation error term defined in terms of the surrogate loss; the binary-valued prediction problem has
been reduced to a real-valued problem.
y , y) ≤
Alternatively, we could consider more naive bounds: If a loss satisfies the pointwise inequality `01 (b
y , y), then we have an upper bound on risk in terms of surrogate risk: L01 (fb) ≤ Lφ (fb). In fact, Theo-
`φ (b
rem 2.8 implies that pointwise inequalities like this are inevitable for any reasonable convex loss. Define a
surrogate loss φ as classification-calibrated if any f that minimizes the surrogate risk Lφ (f ) will also min-
imize the classification risk L01 (f ). Then part 3 of the theorem shows that if a convex surrogate loss φ is
classification-calibrated then it satisfies
`φ (b
y , y) φ(b
y y)
for all yb, y, = ≥ 1[b
y y ≤ 0] = `01 (b
y , y).
φ(0) φ(0)

Thus, every classification-calibrated convex surrogate loss, suitably scaled so that φ(0) = 1, is an upper
bound on the discrete loss `01 , and hence immediately gives an upper bound on risk in terms of surrogate
risk: L01 (fb) ≤ Lφ (fb). Combining this with Theorems 2.1 and 2.7 shows that, if φ is also 1-Lipschitz then
with high probability,  
b φ (fb) + 4Rn (F) + O √1 .
L01 (fb) ≤ L (12)
n

2.8 Real prediction


For a real-valued function class F, there is an analog of Theorem 2.4 with the VC-dimension of F replaced
by the pseudodimension of F, which is the VC-dimension of {(x, y) 7→ 1 [f (x) ≥ y] : f ∈ F}; see [Pol90].
Theorem 2.5 is true with the output nonlinearity σL of FL,σ replaced by any Lipschitz nonlinearity and
with dV C replaced by the pseudodimension. However, using this result to obtain bounds on the excess risk
of an empirical risk minimizer would again require the sample size to be large compared to the number of
parameters.
Instead, we can bound Rn (`F ) more directly in many cases. With a bound on Rn (F) for a class F of
real-valued functions computed by neural networks, we can then apply Theorem 2.7 to relate Rn (`F ) to
Rn (F), provided the loss is a Lipschitz function of its first argument. This is the case, for example, for
y , y) = |b
absolute loss `(b y − y|, or for quadratic loss `(b y − y)2 when Y and the range of functions in
y , y) = (b
F are bounded.
The following result gives a bound on Rademacher complexity for neural networks that use a bounded,
Lipschitz nonlinearity, such as the sigmoid function
1 − exp(−x)
σ(x) = .
1 + exp(−x)

Theorem 2.9. For two-layer neural networks defined on X = [−1, 1]d ,


( k
)
X
FB = x 7→ bi σ (hwi , xi) : kbk1 ≤ 1, kwi k1 ≤ B, k ≥ 1 ,
i=1

14
where the nonlinearity σ : R → [−1, 1] is 1-Lipschitz and has σ(0) = 0,
r
2 log 2d
Rn (FB ) ≤ B .
n
Thus, for example, applying (11) in this case with a Lipschitz convex loss `φ and the corresponding ψφ
defined by Theorem 2.8, shows that with high probability the minimizer fberm in FB of E`b f satisfies
  r !
log d
ψφ L01 (ferm ) − inf L01 (f ) ≤ O B
b + inf Lφ (f ) − inf Lφ (f ).
f n f ∈FB f

If, in addition, `φ is scaled so that it is an upper bound on `01 , applying (12) shows that with high probability
every f ∈ FB satisfies r !
log d
L01 (f ) ≤ L
b φ (f ) + O B .
n
Theorem 2.9 is from [BM02]. The proof uses the contraction inequality (Theorem 2.7) and elementary
properties of Rademacher complexity.
The following theorem gives similar error bounds for networks with Lipschitz nonlinearities that, like the
ReLU nonlinearity, do not necessarily have a bounded range. The definition of the function class includes
deviations of the parameter matrices W i from fixed ‘centers’ M i .
Theorem 2.10. Consider a feed-forward network with L layers, fixed vector nonlinearities σi : Rdi → Rdi
and parameter θ = (W 1 , . . . , W L ) with W i ∈ Rdi ×di−1 , for i = 1, . . . , L, which computes functions

f (x; θ) = σL (W L σL−1 (W L−1 · · · σ1 (W 1 x) · · · )),

where d0 = d and dL = 1. Define d¯ = d0 ∨ · · · ∨ dL . Fix matrices M i ∈ Rdi ×di−1 , for i = 1, . . . , L, and


define the class of functions on the unit Euclidean ball in Rd ,

> 2/3
!3/2 
 YL XL
kW >i − M i k2,1

Fr = f (·, θ) : kW i k ≤ r ,

i=1 i=1
kW i k2/3 

where kAk denotes the spectral norm of the matrix A and kAk2,1 denotes the sum of the 2-norms of its
columns. If the σi are all 1-Lipschitz and the surrogate loss `φ is a b-Lipschitz upper bound on the classifi-
cation loss `01 , then with probability at least 1 − δ, every f ∈ Fr has
!
rb log d¯ + log(1/δ)
p
L01 (f ) ≤ Lφ (f ) + Õ
b √ .
n

Theorem 2.10 is from [BFT17]. The proof uses different techniques (covering numbers rather than the
Rademacher complexity) to address the key technical difficulty, which is controlling the scale of vectors that
appear throughout the network.
When the nonlinearity has a 1-homogeneity property, the following result gives a simple direct bound on
the Rademacher complexity in terms of the Frobenius norms of the weight matrices (although it is worse than
Theorem 2.10, even with M i = 0, unless the ratios kW i kF /kW i k are close to 1). We say that σ : R → R is
1-homogeneous if σ(αx) = ασ(x) for all x ∈ R and α ≥ 0. Notice that the ReLU nonlinearity σ(x) = x ∨ 0
has this property.
Theorem 2.11. Let σ̄ : R → R be a fixed 1-homogeneous nonlinearity, and define the componentwise version
σi : Rdi → Rdi via σi (x)j = σ̄(xj ). Consider a network with L layers of these nonlinearities and parameters
θ = (W 1 , . . . , W L ), which computes functions

f (x; θ) = σL (W L σL−1 (W L−1 · · · σ1 (W 1 x) · · · )).

15
Define the class of functions on the unit Euclidean ball in Rd ,
FB = {f (·; θ) : kW i kF ≤ B} ,
where kW i kF denotes the Frobenius norm of W i . Then we have
√ L
LB
Rn (FB ) . √ .
n

This result is from [GRS18], which also shows that it is possible to remove the L factor at the cost of
a worse dependence on n. See also [NTS15].

2.9 The mismatch between benign overfitting and uniform convergence


It is instructive to consider the implications of the generalization bounds we have reviewed in this section for
the phenomenon of benign overfitting, which has been observed in deep learning. For concreteness, suppose
that ` is the quadratic loss. Consider a neural network function fb ∈ F chosen so that L( b fb) = 0. For an
S
appropriate complexity hierarchy F = r Fr , suppose that f is chosen to minimize the complexity r(fb),
b
defined as the smallest r for which fb ∈ Fr , subject to the interpolation constraint L(fb ) = 0. What do the
bounds based on uniform convergence imply about the excess risk L(fb) − inf f ∈F L(f ) of this minimum-
complexity interpolant?
Theorems 2.9, 2.10, and 2.11 imply upper bounds on risk in terms of various notions of scale of network
parameters. For these bounds to be meaningful for a given probability distribution, there must be an
interpolating fb for which the complexity r(fb) grows suitably slowly with the sample size n so that the excess
risk bounds converge to zero.
An easy example is when there is an f ∗ ∈ Fr with L(f ∗ ) = 0, where r is a fixed complexity. Notice that
this implies not just that the conditional expectation is in Fr , but that there is no noise, that is, almost
surely y = f ∗ (x). In that case, if we choose fb as the interpolant L( b fb) = 0 with minimum complexity,

then its complexity will certainly satisfy r(f ) ≤ r(f ) = r. And then as the sample size n increases,
b
L(fb) will approach zero. In fact, since L( b fb) = 0, Theorem 2.2 implies a faster rate in this case: L(fb) =
4 2
O((log n) R̄n (Fr )).
Theorem 2.3 shows that if we were to balance the complexity with the fit to the training data, then we
can hope to enjoy excess risk as good as the best bound for any Fr in the complexity hierarchy. If we always
choose a perfect fit to the data, there is no trade-off between complexity and empirical risk, but when there
is a prediction rule f ∗ with finite complexity and zero risk, then once the sample size is sufficiently large,
the best trade-off does correspond to a perfect fit to the data. To summarize: when there is no noise, that
is, when y = f ∗ (x), and f ∗ ∈ F, classical theory shows that a minimum-complexity interpolant fb ∈ F will
have risk L(fb) converging to zero as the sample size increases.
But what if there is noise, that is, there is no deterministic relationship between x and y? Then it turns
out that the bounds on the excess risk L(fb) − L(fF∗ ) presented in this section must become vacuous: they
can never decrease below a constant, no matter how large the sample size. This is because these bounds do
not rely on any properties of the distribution on X , and hence are also true in a fixed design setting, where
the excess risk is at least the noise level.
To make this precise, fix x1 , . . . , xn ∈ X and define the fixed design risk
n
1X
L|x (f ) := E [ `(f (xi ), y)| x = xi ] .
n i=1

Then the decomposition (4) extends to this risk: for any fb and f ∗ ,

L|x (fb) − L|x (f ∗ )


h i h i h i
= L|x (fb) − L( b fb) + L( b ∗ ) + L(f
b fb) − L(f b ∗ ) − L|x (f ∗ ) .

16
For a nonnegative loss, the second term is nonpositive when L( b fb) = 0, and the last term is small for any fixed
∗ ∗
f . Fix f (x) = E[y|x], and suppose we choose fb from a class Fr . The same proof as that of Theorem 2.1
gives a Rademacher complexity bound on the first term above, and [LT91, Theorem 4.12] implies the same
contraction inequality as in Theorem 2.7 when yb 7→ `(b y , y) is c-Lipschitz:
" #
1 X n
b ) ≤ 2E sup
E sup L|x (f ) − L(f i `(f (xi ), yi ) x1 , . . . , xn

f ∈Fr f ∈Fr n i=1

≤ 4cR̄n (Fr ).
Finally, although Theorems 2.9 and Theorem 2.11 are stated as bounds on the Rademacher complexity of
Fr , they are in fact bounds on R̄n (Fr ), the worst-case empirical Rademacher complexity of F.
Consider the complexity hierarchy defined in Theorem 2.9 or Theorem 2.11. For the minimum-complexity
interpolant fb, these theorems give bounds that depend on the complexity r(fb), that is, bounds of the form
L(fb) − L(f ∗ ) ≤ B(r(fb)) (ignoring the fact that that the minimum complexity r(fb) is random; making the
bounds uniform over r would give a worse bound). Then these observations imply that
h i
E L|x (fb) − L|x (f ∗ ) = EL|x (fb) − L(f ∗ ) ≤ EB(r(fb)).

But then
n 2
h i 1 X b
EB(r(fb)) ≥ E L|x (fb) − L(f ∗ ) = E f (xi ) − f ∗ (xi ) = L(f ∗ ).
n i=1

Thus, unless there is no noise, the upper bound on excess risk must be at least as big as a constant.
[BL20b] use a similar comparison between prediction problems in random design and fixed design settings
to demonstrate situations where benign overfitting occurs but a general family of excess risk bounds—those
that depend only on properties of fb and do not increase too quickly with sample size—must sometimes be very
loose. [NK19] present a scenario where, with high probability, a classification method gives good predictive
accuracy but uniform convergence bounds must fail for any function class that contains the algorithm’s
output. Algorithmic stability approaches—see [DW79] and [BE02]—also aim to identify sufficient conditions
for closeness of risk and empirical risk, and appear to be inapplicable in the interpolation regime. These
examples illustrate that to understand benign overfitting, new analysis approaches are necessary that exploit
additional information. We shall review results of this kind in Section 4, for minimum-complexity interpolants
in regression settings. The notion of complexity that is minimized is obviously of crucial importance here;
this is the topic of the next section.

3 Implicit regularization
When the model F is complex enough to ensure zero empirical error, such as in the case of overparametrized
neural networks, the set of empirical minimizers may be large. Therefore, it may very well be the case
that some empirical minimizers generalize well while others do not. Optimization algorithms introduce a
bias in this choice: an iterative method may converge to a solution with certain properties. Since this
bias is a by-product rather than an explicitly enforced property, we follow the recent literature and call it
implicit regularization. In subsequent sections, we shall investigate statistical consequences of such implicit
regularization.
Perhaps the simplest example of implicit regularization is gradient descent on the square-loss objective
with linear functions:
1 2
θ t+1 = θ t − ηt ∇L(θ
b t ), L(θ)
b = kXθ − yk2 , θ 0 = 0 ∈ Rd , (13)
n
where X = [x1 , . . . , xn ] T ∈ Rn×d and y = [y1 , . . . , yn ] T are the training data, and ηt > 0 is the step size.
While the set of minimizers of the square-loss objective in the overparametrized (d > n) regime is an affine

17
subspace of dimension at least d − n, gradient descent (with any choice of step size that ensures convergence)
converges to a very specific element of this subspace: the minimum-norm solution
n o
b = argmin kθk2 : hθ, xi i = yi for all i ≤ n .
θ (14)
θ

This minimum-norm interpolant can be written in closed form as


b = X † y,
θ (15)

where X † denotes the pseudoinverse. It can also be seen as a limit of ridge regression
1 2 2
θ λ = argmin kXθ − yk2 + λ kθk2 (16)
θ n

as λ → 0+ . The connection between minimum-norm interpolation (14) and the “ridgeless” limit of ridge
regression will be fruitful in the following sections when statistical properties of these methods are analyzed
and compared.
To see that the iterations in (13) converge to the minimum-norm solution, observe that the Karush-Kuhn-
Tucker (KKT) conditions for the constrained optimization problem (14) are Xθ = y and θ + X T µ = 0 for
Lagrange multipliers µ ∈ Rn . Both conditions are satisfied (in finite time or in the limit) by any procedure
that interpolates the data while staying in the span of the rows of X, including (13). It should be clear
= n−1 i `(hθ, xi i , yi ) under appropriate
P
that a similar statement holds for more general objectives L(θ) b
assumptions on `. Furthermore, if started from an arbitrary θ 0 , gradient descent (if it converges) selects a
solution that is closest to the initialization with respect to k·k2 .
Boosting is another notable example of implicit regularization arising from the choice of the optimization
algorithm, this time for the problem of classification. Consider the linear classification objective
n
b 01 (θ) = 1
X
L 1 [−yi hθ, xi i ≥ 0] (17)
n i=1

where y1 , . . . , yn ∈ {±1}. In the classical formulation of the boosting problem, the coordinates of vectors
xi correspond to features computed by functions in some class of base classifiers. Boosting was initially
proposed as a method for minimizing empirical classification loss (17) by iteratively updating θ. In particular,
AdaBoost [FS97] corresponds to coordinate descent on the exponential loss function
n
1X
θ 7→ exp{−yi hθ, xi i} (18)
n i=1

[Bre98, Fri01]. Notably, the minimizer of this surrogate loss does not exist in the general separable case,
and there are multiple directions along which the objective decreases to 0 as kθk → ∞. The AdaBoost
optimization procedure and its variants were observed empirically to shift the distribution of margins (the
values yi hθ t , xi i, i = 1, . . . , n) during the optimization process in the positive direction even after empirical
classification error becomes zero, which in part motivated the theory of large margin classification [SFBL98].
In the separable case, convergence to the direction of the maximizing `1 margin solution
n o
b = argmin kθk1 : yi hθ, xi i ≥ 1 for all i ≤ n
θ (19)
θ

was shown in [ZY05] and [Tel13] assuming small enough step size, where separability means positivity of the
margin

max min yi hθ, xi i . (20)


kθk1 =1 i∈[n]

18
More recently, [SHN+ 18] and [JT18] have shown that gradient (rather than coordinate) descent on (18)
and separable data lead to a solution with direction approaching that of the maximum `2 (rather than `1 )
margin separator
n o
b = argmin kθk2 : yi hθ, xi i ≥ 1 for all i ≤ n .
θ (21)
θ

We state the next theorem from [SHN+ 18] for the case of logistic loss, although essentially the same
statement—up to a slightly modified step size upper bound—holds for any smooth loss function that has
appropriate exponential-like tail behavior, including `(u) = e−u [SHN+ 18, JT18].
Theorem 3.1. Assume the data X, y are linearly separable. For logistic loss `(u) = log(1 + exp{−u}), any
step size η ≤ 8λ−1
max (n
−1
X T X), and any initialization θ 0 , the gradient descent iterations
n
1X
θ t+1 = θ t − η∇L(θ
b t ), L(θ)
b = `(yi hxi , θi)
n i=1
b · log t + ρt where θ
satisfy θ t = θ b is the `2 max-margin solution in (21). Furthermore, the residual grows at
most as kρt k = O(log log t), and thus
θt θ
b
lim = .
t→∞ kθ t k kθk
2
b 2
These results have been extended to multi-layer fully connected neural networks and convolutional neural
networks (without nonlinearities) in [GLSS18b]. On the other hand, [GLSS18a] considered the implicit bias
arising from other optimization procedures, including mirror descent, steepest descent, and AdaGrad, both
in the case when the global minimum is attained (as for the square loss) and when the global minimizers are
at infinity (as in the classification case with exponential-like tails of the loss function). We refer to [JT19]
and [NLG+ 19] and references therein for further studies on faster rates of convergence to the direction of
the max margin solution (with more aggressive time-varying step sizes) and on milder assumptions on the
loss function.
In addition to the particular optimization algorithm being employed, implicit regularization arises from
the choice of model parametrization. Consider re-parametrizing the least-squares objective in (13) as
2
min kXθ(u) − yk2 , (22)
u∈Rd

where θ(u)i = u2i is the coordinate-wise square. [GWB+ 17] show that if θ ∞ (α) is the limit point of gradient
flow on (22) with initialization α1 and the limit θ
b = limα→0 θ ∞ (α) exists and satisfies X θ
b = y, then it must
be that
n o
b ∈ argmin kθk : hθ, xi i = yi for all i ≤ n .
θ (23)
1
θ∈Rd
+

In other words, in that case, gradient descent on the reparametrized problem with infinitesimally small step
sizes and infinitesimally small initialization converges to the minimum `1 norm solution in the original space.
More generally, [GWB+ 17] and [LMZ18] proved an analogue of this statement for matrix-valued θ and xi ,
establishing convergence to the minimum nuclear-norm solution under additional assumptions on the xi .
The matrix version of the problem can be written as
n
X
min `(hU V T , xi i , yi ),
U ,V
i=1

which can be viewed, in turn, as an empirical risk minimization objective for a two-layer neural network
with linear activation functions.
In summary, in overparametrized problems that admit multiple minimizers of the empirical objective,
the choice of the optimization method and the choice of parametrization both play crucial roles in selecting a
minimizer with certain properties. As we show in the next section, these properties of the solution can ensure
good generalization properties through novel mechanisms that go beyond the realm of uniform convergence.

19
4 Benign overfitting
We now turn our attention to generalization properties of specific solutions that interpolate training data.
As emphasized in Section 2, mechanisms of uniform convergence alone cannot explain good statistical per-
formance of such methods, at least in the presence of noise.
For convenience, in this section we focus our attention on regression problems with square loss `(f (x), y) =
(f (x) − y)2 . In this case, the regression function f ∗ = E[y|x] is a minimizer of L(f ), and excess loss can be
written as
2
L(f ) − L(f ∗ ) = E(f (x) − f ∗ (x))2 = kf − f ∗ kL2 (P) .
We assume that for any x, conditional variance of the noise ξ = y − f ∗ (x) is at most σξ2 , and we write
ξi = yi − f ∗ (xi ).
As in the previous section, we say that a solution fb is interpolating if

fb(xi ) = yi , i = 1, . . . , n. (24)

For learning rules fb expressed in closed form—such as local methods and linear and kernel regression—it is
convenient to employ a bias-variance decomposition that is different from the approximation-estimation error
decomposition (3) in Section 2. First, for X = [x1 , . . . , xn ] T ∈ Rn×d and y = [y1 , . . . , yn ] T , conditionally on
X, define
 2  2
2
d = Ex f ∗ (x) − Ey fb(x) ,
bias d = Ex,y fb(x) − Ey fb(x) .
var (25)

It is easy to check that


h i
2
Ekfb − f ∗ k2L2 (P) = EX bias
 
d + EX vard . (26)
Pn
In this section we consider linear (in y) estimators of the form fb(x) = i=1 yi ωi (x). For such estimators we
have !2
n
2
X
d = Ex f (x) −∗ ∗
bias f (xi )ωi (x) (27)
i=1

and !2
n n
2
X X
d = Ex,ξ
var ξi ωi (x) ≤ σξ2 Ex (ωi (x)) , (28)
i=1 i=1

with equality if conditional noise variances are equal to σξ2 at each x.


In classical statistics, the balance between bias and variance is achieved by tuning an explicit parameter.
Before diving into the more unexpected interpolation results, where the behavior of bias and variance are
driven by novel self-regularization phenomena, we discuss the bias-variance tradeoff in the context of one of
the oldest statistical methods.

4.1 Local methods: Nadaraya-Watson


Consider arguably the simplest nontrivial interpolation procedure, the 1-nearest neighbour (1-NN) fb(x) =
ynn(x) , where nn(x) is the index of the datapoint closest to x in Euclidean distance. While we could view
fb as an empirical minimizer in some effective class F of possible functions (as a union for all possible
{x1 , . . . , xn }), this set is large and growing with n. Exploiting the particular form of 1-NN is, obviously,
crucial. Since typical distances to the nearest neighbor in Rd decay as n−1/d for i.i.d. data, in the noiseless
case (σξ = 0) one can guarantee consistency and nonparametric rates of convergence of this interpolation
procedure under continuity and smoothness assumptions on f ∗ and the underlying measure. Perhaps more
interesting is the case when the ξi have non-vanishing variance. Here 1-NN is no longer consistent in general

20
(as can be easily seen by taking f ∗ = 0 and independent Rademacher ξi at random xi ∈ [0, 1]), although
its asymptotic risk is at most 2L(f ∗ ) [CH67]. The reason for inconsistency is insufficient averaging of the
y-values, and this deficiency can be addressed by averaging over the k nearest neighbors with k growing with
n. Classical smoothing methods generalize this idea of local averaging; however, averaging forgoes empirical
fit to data in favor of estimating the regression function under smoothness assumptions. While this has been
the classical view, estimation is not necessarily at odds with fitting the training data for these local methods,
as we show next.
The Nadaraya-Watson (NW) smoothing estimator [Nad64, Wat64] is defined as
n
X K((x − xi )/h)
fb(x) = yi ωi (x), ωi (x) = Pn , (29)
i=1 j=1 K((x − xj )/h)

where K(u) : Rd → R≥0 is a kernel and h > 0 is a bandwidth parameter. For standard kernels used in
practice—such as the Gaussian, uniform, or Epanechnikov kernels—the method averages the y-values in
a local neighborhood around x, and, in general, does not interpolate. However, as noted by [DGK98], a
−d
kernel that is singular at 0 does interpolate the data. While the Hilbert kernel K(u) = kuk2 , suggested in
[DGK98], does not enjoy non-asymptotic rates of convergence, its truncated version
−a
K(u) = kuk2 1 [kuk2 ≤ 1] , u ∈ Rd (30)
with a smaller power 0 < a < d/2 was shown in [BRT19] to lead to minimax optimal rates of estimation
under the corresponding smoothness assumptions. Notably, the NW estimator with the kernel in (30) is
necessarily interpolating the training data for any choice of h.
Before stating the formal result, define the Hölder class H(β, L), for β ∈ (0, 1], as the class of functions
f : Rd → R satisfying
β
∀x, x0 ∈ Rd , |f (x) − f (x0 )| ≤ L kx − x0 k2 .
The following result appears in [BRT19]; see also [BHM18]:
Theorem 4.1. Let f ∗ ∈ H(β, L) for β ∈ (0, 1] and L > 0. Suppose the marginal density p of x satisfies
0 < pmin ≤ p(x) ≤ pmax for all x in its support. Then the estimator (29) with kernel (30) satisfies3
h i
2
d . h2β , d . σξ2 (nhd )−1 .
 
EX bias EX var (31)

The result can be extended to smoothness parameters β > 1 [BRT19]. The choice of h = n−1/(2β+d)
balances the two terms and leads to minimax optimal rates for Hölder classes [Tsy08].
In retrospect, Theorem 4.1 should not be surprising, and we mention it here for pedagogical purposes. It
should be clear from the definition (29) that the behavior of the kernel at 0, and in particular the presence
of a singularity, determines whether the estimator fits the training data exactly. This is, however, decoupled
from the level of smoothing, as given by the bandwidth parameter h. In particular, it is the choice of h alone
that determines the bias-variance tradeoff, and the value of the empirical loss cannot inform us whether the
estimator is over-smoothing or under-smoothing the data.
The NW estimator with the singular kernel can be also viewed as adding small “spikes” at the datapoints
on top of the general smooth estimate that arises from averaging the data in a neighborhood of radius h. This
suggests a rather obvious scheme for changing any estimator fb0 into an interpolating one by adding small
deviations around the datapoints: fb(x) := fb0 (x) + ∆(x), where ∆(xj ) = yi − fb0 (xj ) but k∆kL2 (P) = o(1).
The component fb0 is useful for prediction because it is smooth, whereas the spiky component ∆ is useful for
interpolation but does not harm the predictions of fb. Such combinations have been observed experimentally
in other settings and described as “spiked-smooth” estimates [WOBM17]. The examples that we see below
suggest that interpolation may be easier to achieve with high-dimensional data than with low-dimensional
data, and this is consistent with the requirement that the overfitting component ∆ is benign: it need not be
too “irregular” in high dimensions, since typical distances between datapoints in Rd scale at least as n−1/d .
3 In the remainder of this paper, the symbol . denotes inequality up to a multiplicative constant.

21
4.2 Linear regression in the interpolating regime
In the previous section, we observed that the spiky part of the NW estimator, which is responsible for
interpolation, does not hurt the out-of-sample performance when measured in L2 (P). The story for minimum-
norm interpolating linear and kernel regression is significantly more subtle: there is also a decomposition into
a prediction component and an overfitting component, but there is no explicit parameter that trades off bias
and variance. The decomposition depends on the distribution of the data, and the overfitting component
provides a self-induced regularization 4 , similar to the regularization term in ridge regression (16), and this
determines the bias-variance trade-off.
Consider the problem of linear regression in the over-parametrized regime. We assume that the regression
function f ∗ (x) = f (x; θ ∗ ) = hθ ∗ , xi with θ ∗ , x ∈ Rd . We also assume Ex = 0. (While we present the results
for finite d > n, all the statements in this section hold for separable Hilbert spaces of infinite dimension.)
It is easy to see that the excess square loss can be written as
 2
b − L(θ ∗ ) = E f (θ)
L(θ) b − f (θ ∗ ) = kθb − θ ∗ k2 ,
Σ

where we write kvk2Σ := v T Σv and Σ = Exx T . Since d > n, there is not enough data to learn all the d
directions of θ ∗ reliably, unless Σ has favorable spectral properties. To take advantage of such properties,
classical methods—as described in Section 2—resort to explicit regularization (shrinkage) or model com-
plexity control, which inevitably comes at the expense of not fitting the noisy data exactly. In contrast, we
are interested in estimates that interpolate the data. Motivated by the properties of the gradient descent
method (13), we consider the minimal norm linear function that fits the data X, y exactly:
n o
b = argmin kθk2 : hθ, xi i = yi for all i ≤ n .
θ (32)
θ

The solution has a closed form and yields the estimator


b xi = hX † y, xi = (Xx) T (XX T )−1 y,
fb(x) = hθ, (33)
Pn
which can also be written as fb(x) = i=1 yi ωi (x), with

ωi (x) = (x T X † )i = (Xx) T (XX T )−1 ei . (34)

Thus, from (27), the bias term can be written as

2
d = Ex P ⊥ x, θ ∗ =

2
2 1/2 ⊥ ∗
bias Σ P θ , (35)
2

where P ⊥ = Id − X T (XX T )−1 X, and from (28), the variance term is


2
d ≤ σξ2 · Ex (XX T )−1 (Xx) 2 = σξ2 · tr (XX T )−2 XΣX T .

var (36)

We now state our assumptions.


Assumption 4.2. Suppose z = Σ−1/2 x is 1-sub-Gaussian. Without loss of generality, assume Σ =
diag(λ1 , . . . , λd ) with λ1 ≥ · · · ≥ λd .
The central question now is: Are there mechanisms that can ensure small bias and variance of the
minimum-norm interpolant? Surprisingly, we shall see that the answer is yes. To this end, choose an index
k ∈ {1, . . . , d} and consider the subspace spanned by the top k eigenvectors corresponding to λ1 , . . . , λk .
Write x T = [x≤k T
, x>k
T
]. For an appropriate choice of k, it turns out the decomposition of the minimum-norm
4 This is not to be confused with implicit regularization, discussed in Section 3, which describes the properties of the particular

empirical risk minimizer that results from the choice of an optimization algorithm. Self-induced regularization is a statistical
property that also depends on the data-generating mechanism.

22
interpolant as hθ,
b xi = hθ
b≤k , x≤k i + hθ
b>k , x>k i corresponds to a decomposition into a prediction component
and an interpolation component. Write the data matrix as X = [X ≤k , X >k ] and

XX T = X ≤k X ≤k
T T
+ X >k X >k . (37)

Observe that if the eigenvalues of the second part were to be contained in an interval [γ/c, cγ] for some γ
and a constant c, we could write
T
X ≤k X ≤k + γM , (38)

where c−1 In  M  cIn . If we replace M with the approximation In and substitute this expression into
(33), we see that γ would have an effect similar to explicit regularization through a ridge penalty: if that
approximation were precise, the first k components of θ
b would correspond to

b≤k = argmin kX ≤k θ − yk2 + γ kθk2 ,


θ (39)
2 2
θ∈Rk

T
since this has the closed-form solution X ≤k T
(X ≤k X ≤k + γIn )−1 y. Thus, if γ is not too large, we might
expect this approximation to have a minimal impact on the bias and variance of the prediction component.
It is, therefore, natural to ask when to expect such a near-isotropic behavior arising from the “tail”
features. The following lemma provides an answer to this question [BLLT20]:

Lemma 4.3. Suppose coordinates of Σ−1/2 x are independent. Then there exists a constant c > 0 such that,
with probability at least 1 − 2 exp{−n/c},
1X T
λi − cλk+1 n ≤ λmin (X >k X >k )
c
i>k
!
X
T
≤ λmax (X >k X >k ) ≤ c λi + λk+1 n .
i>k

The condition of independence of coordinates in Lemma 4.3 is satisfied for Gaussian x. It can be relaxed
to the following small-ball assumption:
2 2
∃c > 0 : P(c kxk2 ≥ E kxk2 ) ≥ 1 − δ. (40)

Under this assumption, the conclusion of Lemma 4.3 still holds with probability at least 1−2 exp{−n/c}−nδ
[TB20].
T
P An appealing consequence of Lemma 4.3 is the small condition number of X >k X >k for any k such that
i>k λi & λk+1 n. Define the effective rank for a given index k by
P
λi
rk (Σ) = i>k .
λk+1
T
We see that rk (Σ) ≥ bn for some constant b implies that the set of eigenvalues of X >k X >k lies in the
interval [γ/c, cγ] for X
γ= λi ,
i>k

and thus the scale of the self-induced regularization in (38) is the sum of the tail eigenvalues of the covariance
T
operator. Interestingly, the reverse implication also holds: if for some k the condition number of X >k X >k
is at most κ with probability at least 1 − δ, then effective rank rk (Σ) is at least cκ n with probability at least
1 − δ − c exp{−n/c} for some constants c, cκ . Therefore, the condition rk (Σ) & n characterizes
P the indices k
T
such that X >k X >k behaves as a scaling of Id , and the scaling is proportional to i>k λi . We may call the
smallest such index k the effective dimension, for reasons that will be clear in a bit.

23
How do the estimates on tail eigenvalues help in controlling the variance of the minimum-norm inter-
polant? Define
Σ≤k = diag(λ1 , . . . , λk ), Σ>k = diag(λk+1 , . . . , λd ).
Then, omitting σξ2 for the moment, the variance upper bound in (36) can be estimated by

tr (XX T )−2 XΣX T . tr (XX T )−2 X ≤k Σ≤k X ≤k


 T


+ tr (XX T )−2 X >k Σ>k X >k


T

. (41)

The first term is further upper-bounded by

)−2 X ≤k Σ≤k X ≤k
T T

tr (X ≤k X ≤k , (42)

and its expectation corresponds to the variance of k-dimensional regression, which is of the order of k/n.
On the other hand, by Bernstein’s inequality, with probability at least 1 − 2 exp−cn ,
X
T
tr(X >k Σ>k X >k ).n λ2i , (43)
i>k

so we have that the second term in (41) is, with high probability, of order at most

n i>k λ2i
P
.
( i>k λi )2
P

Putting these results together, we have the following theorem [TB20]:


T
Theorem 4.4. Fix δ < 1/2. Under Assumption 4.2, suppose for some k the condition number of X >k X >k
is at most κ with probability at least 1 − δ. Then

λ2
P
n
  
1 k
d . σξ2 κ2 log
var + P i>k i2 (44)
δ n ( i>k λi )

with probability at least 1 − 2δ.


We now turn to the analysis of the bias term. Since the projection operator in (35) annihilates any vector
in the span of the rows of X, we can write
2 2
2
d = Σ1/2 P ⊥ θ ∗ = (Σ − Σ)
b 1/2 P ⊥ θ ∗
, (45)

bias
2 2

b = n−1 X T X is the sample covariance operator. Since projection contracts distances, we obtain an
where Σ
upper bound
2
b 1/2 θ ∗ 2
(Σ − Σ) ≤ kθ ∗ k2 × Σ − Σ . (46)
b
2

The rate of approximation of the covariance operator by its sample-based counterpart has been studied in
[KL17], and we conclude
(r )
2 ∗ 2 r0 (Σ) r0 (Σ)
bias . kθ kΣ max , (47)
n n

(see [BLLT20] for details).


The upper bound in (47) can be sharpened significantly by analyzing the bias in the two subspaces, as
proved in [TB20]:

24
Theorem 4.5. Under the assumptions of Theorem 4.4, for n & log(1/δ), with probability at least 1 − 2δ,
" P 2 #
2 2 λ i 2
d . κ4 θ ∗≤k −1 i>k
+ kθ ∗>k kΣ>k .

bias (48)
Σ≤k n

The following result shows that without further assumptions, the bounds on variance and bias given in
Theorems 4.4 and 4.5 cannot be improved by more than constant factors; see [BLLT20] and [TB20].
Theorem 4.6. There are absolute constants b and c such that for Gaussian x ∼ N(0, Σ), where Σ has
eigenvalues λ1 ≥ λ2 ≥ · · · , with probability at least 1 − exp(−n/c),

λ2
P
n
  
k
d & 1 ∧ σξ2
var + P i>k i2 ,
n ( i>k λi )
d
where k is the effective dimension, k = min {l : rl (Σ) ≥ bn}. Furthermore, for  any θ ∈ R , if the regression
∗ ∗ ∗ d
function f (·) = h·, θ i, where θi = i θi and  = (1 , . . . , d ) ∼ Unif {±1} , then with probability at least
1 − exp(−n/c),
" P 2 #
2 ∗ 2 λ i 2
d & θ ≤k −1
E bias i>k
+ kθ ∗>k kΣ>k .
Σ≤k n

A discussion of Theorems 4.4, 4.5 and 4.6 is in order. First, the upper and lower bounds match up to con-
stants, and in particular both involve the decomposition of fb into a prediction component fb0 (x) := hθ
b≤k , x≤k i
and an interpolation component ∆(x) := hθ >k , x>k i with distinct bias and variance contributions, so this
b
2
decomposition is not an artifact of our analysis. Second, the kθ ∗>k kΣ>k term in the bias and the k/n term
in the variance for the prediction component fb0 correspond to the terms we would get by performing or-
dinary least-squares (OLS) restricted to the first k coordinates of θ. Provided k is small compared to n,
there is enough data to estimate the signal in this k-dimensional component, and the bias contribution is
the approximation error due to truncation at k. The other aspect of the interpolating component ∆ that
could harm prediction accuracy is its variance term. The definition of the effective dimension k implies that
this is no more than a constant, and it is small if the tail eigenvalues decay slowly and d − k  n, for in
that case, the ratio of the squared `1 norm to the squared `2 norm of these eigenvalues is large compared
to n; overparametrization is important. Finally, the bias and variance terms are similar to those that arise
in ridge regression (16), with the regularization coefficient determined by the self-induced regularization.
Indeed, define
bX
λ= λi (49)
n
i>k

for the constant b in the definition of the effective dimension k. That definition implies that λk ≥ λ ≥ λk+1 ,
so we can write the bias and variance terms, within constant factors, as
d d  2
2 λi σξ2 X λi
θi∗ 2
X
d ≈
bias 2,
d ≈
var .
i=1 (1 + λi /λ) n i=1 λ + λi

These are reminiscent of the bias and variance terms that arise in ridge regression (16). Indeed, a ridge
regression estimate in a fixed design setting with X T X = diag(s1 , . . . , sd ) has precisely these bias and
variance terms with λi replaced by si ; see, for example, [DFKU13, Lemma 1]. In Section 4.3.3, we shall see
the same bias-variance decomposition arise in a related setting, but with the dimension growing with sample
size.

25
4.3 Linear regression in Reproducing Kernel Hilbert Spaces
Kernel methods are among the core algorithms in machine learning and statistics. These methods were
introduced to machine learning in the pioneering work of [ABR64] as a generalization of the Perceptron
algorithm to nonlinear functions by lifting the x-variable to a high- or infinite-dimensional feature space.
Our interest in studying kernel methods here is two-fold: on the one hand, as discussed in detail in Sections 5
and 6, sufficiently wide neural networks with random initialization stay close to a certain kernel-based solution
during optimization and are essentially equivalent to a minimum-norm interpolant; on the other hand, it
has been noted that kernel methods exhibit similar surprising behavior of benign interpolation to neural
networks [BMM18].
A kernel method in the regression setting amounts to choosing a feature map x 7→ φ(x) and computing a
(regularized) linear regression solution in the feature space. While Section 4.2 already addressed the question
of overparametrized linear regression, the non-linear feature map φ(x) might not satisfy Assumption 4.2.
In this section, we study interpolating RKHS regression estimates using a more detailed analysis of certain
random kernel matrices.
Since the linear regression solution involves inner products of φ(x) and φ(x0 ), the feature maps do not
need to be computed explicitly. Instead, kernel methods rely on a kernel function k : X × X → R that, in
turn, corresponds to an RKHS H. A classical method is kernel ridge regression (KRR)
1X 2
fb = argmin (f (xi ) − yi )2 + λ kf kH , (50)
f ∈H n i=1

which has been extensively analyzed through the lens of bias-variance tradeoff with an appropriately tuned
parameter λ > 0 [CDV07]. As λ → 0+ , we obtain a minimum-norm interpolant
n o
fb = argmin kf kH : f (xi ) = yi for all i ≤ n , (51)
f ∈H

which has the closed-form solution

fb(x) = K(x, X) T K(X, X)−1 y, (52)

assuming K(X, X) is invertible; see (32) and (33). Here K(X, X) ∈ Rn×n is the kernel matrix with

[K(X, X)]i,j = k(xi , xj ) and K(x, X) = [k(x, x1 ), . . . , k(x, xn )] T .

Alternatively, we can write the solution as


n
X
fb(x) = yi ωi (x) with ωi (x) = K(x, X)K(X, X)−1 ei ,
i=1

which makes it clear that ωi (xj ) = 1 [i = j]. We first describe a setting where this approach does not lead
to benign overfitting.

4.3.1 The Laplace kernel with constant dimension


We consider the Laplace (exponential) kernel on Rd with parameter σ > 0:

kσ (x, x0 ) = σ −d exp{− kx − x0 k2 /σ}.

The RKHS norm corresponding to this kernel can be related to a Sobolev norm, and its RKHS has been
shown [Bac17, GYK+ 20, CX21] to be closely related to the RKHS corresponding to the Neural Tangent
Kernel (NTK), which we study in Section 6.
To motivate the lower bound, consider d = 1. In this case, the minimum-norm solution with the Laplace
kernel corresponds to a rope hanging from nails at heights yi and locations xi ∈ R. If points are ordered

26
x(1) ≤ x(2) ≤ . . . ≤ x(n) , the form of the minimum-norm solution between two adjacent points x(i) , x(i+1)
is only affected by the values y(i) , y(i+1) at these locations. As σ → ∞, the interpolant becomes piece-wise
linear, while for σ → 0, the solution is a sum of spikes at the datapoints and zero everywhere else. In both
cases, the interpolant is not consistent: the error Ekfb − f ∗ k2L2 (P) does not converge to 0 as n increases.
Somewhat surprisingly, there is no choice of σ that can remedy the problem, even if σ is chosen in a data-
dependent manner.
The intuition carries over to the more general case, as long as d is a constant. The following theorem
appears in [RZ19]:
Theorem 4.7. Suppose f ∗ is a smooth function defined on a unit ball in Rd . Assume the probability
distribution of x has density that is bounded above and away from 0. Suppose the noise random variables ξi
are independent Rademacher.5 For fixed n and odd d, with probability at least 1 − O(n−1/2 ), for any choice
σ > 0,
kfb − f ∗ k2L2 (P) = Ωd (1).
Informally, the minimum-norm interpolant with the Laplace kernel does not have the flexibility to both
estimate the regression function and generate interpolating spikes with small L2 (P) norm if the dimension
d is small. For high-dimensional data, however, minimum-norm interpolation with the same kernel can be
more benign, as we see in the next section.

4.3.2 Kernels on Rd with d  nα


Since d = O(1) may lead to inconsistency of the minimum-norm interpolator, we consider here a scaling
d  nα for α ∈ (0, 1]. Some assumption on the independence of coordinates is needed to circumvent the lower
bound of the previous section, and we assume the simplest possible scenario: each coordinate of x ∈ Rd is
independent.
Assumption 4.8. Assume that x ∼ P = p⊗d such that z ∼ p is mean-zero, that for some C > 0 and ν > 1,
P(|z| ≥ t) ≤ C(1 + t)−ν for all t ≥ 0, and that p does not contain atoms.
We only state the results for the inner-product kernel

hx, x0 i
  X
k(x, x0 ) = h , h(t) = αi ti , αi ≥ 0
d i=0

and remark that more general rotationally invariant kernels (including NTK: see Section 6) exhibit the same
behavior under the independent-coordinate assumption [LRZ20].
For brevity, define K = n−1 K(X, X). Let r = (r1 , · · · , rd ) ≥ 0 be a multi-index, and write krk =
Pd
i=1 ri . With this notation, each entry of the kernel matrix can be expanded as

∞ ι
hxi , xj i
X  X
nK ij = αι = cr αkrk pr (xi )pr (xj )/dkrk
ι=0
d r

with
(r1 + · · · + rd )!
cr = ,
r1 ! · · · rd !
and the monomials are pr (xi ) = (xi [1])r1 · · · (xi [d])rd . If h has infinitely many positive coefficients α, each
x is lifted to an infinite-dimensional space. However, the resulting feature map φ(x) is not (in general) sub-
Gaussian. Therefore, results from Section 4.2 are not immediately applicable and a more detailed analysis
that takes advantage of the structure of the feature map is needed.
As before, we separate the high-dimensional feature map into two parts, one corresponding to the pre-
diction component, and the other corresponding to the overfitting part of the minimum-norm interpolant.
5 P(ξ = ±1) = 1/2.
i

27

More precisely, the truncated function h≤ι (t) = i=0 αi ti leads to the degree-bounded component of the
empirical kernel:
[≤ι]
X
nK ij := cr αkrk pr (xi )pr (xj )/dkrk , nK [≤ι] = ΦΦ>
krk≤ι

ι+d
with data X ∈ Rn×d transformed into polynomial features Φ ∈ Rn×( ι ) defined as
1/2
Φi,r = cr αkrk pr (xi )/dkrk/2 .

The following theorem reveals the staircase structure of the eigenvalues of the kernel, with Θ(dι ) eigen-
values of order Ω(d−ι ), as long as n is large enough to sketch these directions; see [LRZ20] and [GMMM20a].
Theorem 4.9. Suppose α0 , . . . , αι0 > 0 and dι0 log d = o(n). Under Assumption 4.8, with probability at
ι0
least 1 − exp−Ω(n/d ) , for any ι ≤ ι0 , K [≤ι] has ι+d nonzero eigenvalues, all of them larger than Cd−ι and

ι
[≤ι]
the range of K is the span of

{(p(x1 ), . . . , p(xn )) : p multivariable polynomial of degree at most ι} .

The component K [≤ι] of the kernel matrix sketches the low-frequency component of the signal in much the
T
same way as the corresponding X ≤k X ≤k in linear regression sketches the top k directions of the population
distribution (see Section 4.2).
Let us explain the key ideas behind the proof of Theorem 4.9. In correspondence with the sample
covariance operator n−1 X ≤kT
X ≤k in the linear case, we define the sample covariance operator Θ[≤ι] :=
−1 >
n Φ Φ. If the monomials pr (x) were orthogonal in L2 (P), then we would have:
h i 0
E Θ[≤ι] = diag(C(0), · · · , C(ι0 )d−ι , · · · , C(ι)d−ι )
| {z }
(d+ι−1
d−1 ) such entries

where C(ι) denotes constants that depend on ι. Since under our general assumptions on the distribution
this orthogonality does not necessarily hold, we employ the Gram-Schmidt process on the basis {1, t, t2 , . . .}
with respect to L2 (p) to produce an orthogonal polynomial basis q0 , q1 , . . .. This yields new features
1/2 Y
Ψi,r = cr αkrk qr (xi )/dkrk/2 , qr (x) = qrj (x[j]).
j∈[d]

As shown in [LRZ20], these features are weakly dependent and the orthogonalization process does not distort
the eigenvalues of the covariance matrix by more than a multiplicative constant. A small-ball method [KM15]
can then be used to prove the lower bound for the eigenvalues of ΨΨ T and thus establish Theorem 4.9.
We now turn to variance and bias calculations. The analogue of (36) becomes
2
d ≤ σξ2 · Ex K(X, X)−1 K(X, x) 2

var (53)

and, similarly to (37), we split the kernel matrix into two parts, according to the degree ι.
The following theorem establishes an upper bound on (53) [LRZ20]:
Theorem 4.10. Under Assumption 4.8 and the additional assumption of sub-Gaussianity of the distribution
p for the coordinates of x, if α1 , . . . , αι > 0, there exists ι0 ≥ 2ι + 3 with αι0 > 0, and dι log d . n . dι+1 ,
ι
then with probability at least 1 − exp−Ω(n/d ) ,
 ι 
d n
d . σξ2 ·
var + ι+1 . (54)
n d

28
Notice that the behavior of the upper bound changes as n increases from dι to dι+1 . At d  nι , variance
is large since there is not enough data to reliably estimate all the dι directions in the feature space. As n
increases, variance in the first dι directions decreases; new directions in the data appear (those corresponding
to monomials of degree ι + 1, with smaller population eigenvalues) but cannot be reliably estimated. This
second part of (54) grows linearly with n, similarly to the second term in (44). The split between these two
terms occurs at the effective dimension defined in Section 4.2.
Two aspects of the multiple-descent behavior of the upper bound (54) should be noted. First, variance
is small when dι  n  dι+1 , between the peaks; second, the valleys become deeper as d becomes larger,
with variance at most d−1/2 at n = dι+1/2 .
We complete the discussion of this section by exhibiting one possible upper bound on the bias term
[LRZ20]:
Theorem 4.11. Assume the regression function can be written as
Z Z
f ∗ (x) = k(x, z)ρ∗ (z)P(dz) with ρ4∗ (z)P(dz) ≤ c.

Let Assumption 4.8 hold, and suppose supx k(x, x) . 1. Then


 
2 2 1
d . δ −1/2 Ex K(X, X)−1 K(X, x) 2 +

bias (55)
n

d ξ2 and can be bounded as in Theo-


with probability at least 1 − δ. The above expectation is precisely var/σ
rem 4.10.

4.3.3 Kernels on Rd with d  n


We now turn our attention to the regime d  n and investigate the behavior of minimum norm interpolants
in the RKHS in this high-dimensional setting. Random kernel matrices in the d  n regime have been
extensively studied in the last ten years. As shown in [EK10], under assumptions specified below, the kernel
matrix can be approximated in operator norm by

XX T
K(X, X) ≈ c1 + c2 In ,
d
that is, a linear kernel plus a scaling of the identity. While this equivalence can be viewed as a negative
result about the utility of kernels in the d  n regime, the term c2 In provides implicit regularization for the
minimum-norm interpolant in the RKHS [LR20].
We make the following assumptions.
Assumption 4.12. We assume that coordinates of z = Σ−1/2 x are independent, with zero mean and unit
variance, so that Σ = Exx T . Further assume there are constants 0 < η, M < ∞, such that the following
hold.
(a) For all i ≤ d, E[|z i |8+η ] ≤ M .
Pd
(b) kΣk ≤ M , d−1 i=1 λ−1 i ≤ M , where λ1 , . . . , λd are the eigenvalues of Σ.

Note that, for i 6= j, the rescaled scalar products hxi , xj i /d are typically of order 1/ d. We can therefore
approximate the kernel function by its Taylor expansion around 0. To this end, define

tr(Σ2 )
α := h(0) + h00 (0) , β := h0 (0),
2d2
1 
h(tr(Σ)/d) − h(0) − h0 (0)tr(Σ/d) .

γ :=
h0 (0)

29
Under Assumption 4.12, a variant of a result of [EK10] implies that for some c0 ∈ (0, 1/2), the following
holds with high probability
K(X, X) − K lin (X, X) . d−c0

(56)

where
XX T
K lin (X, X) = β + βγIn + α11 T . (57)
d
To make the self-induced regularization due to the ridge apparent, we develop an upper bound on the
variance of the minimum-norm interpolant in (53). Up to an additive diminishing factor, this expression can
be replaced by

σξ2 · tr (XX T + dγIn )−2 XΣX T ,



(58)

where we assumed without loss of generality that α = 0. Comparing to (41), we observe that here implicit
regularization arises due to the ‘curvature’ of the kernel, in addition to any favorable tail behavior in the
spectrum of XX T . Furthermore, this regularization arises under rather weak assumptions on the random
variables even if Assumption 4.2 is not satisfied. A variant of the development in [LR20] yields a more
interpretable upper bound of
 
1 k
d . σξ2 ·
var + λk+1 (59)
γ n

for any k ≥ 1 [Lia20]; the proof is in the Supplementary Material. Furthermore, a high probability bound
on the bias
 r 
1 X 1 k
2
d . kf ∗ k2H · inf
bias λj ( XX T ) + γ + (60)
0≤k≤n  n d n
j>k

can be established with basic tools from empirical process theory under boundedness assumptions on
supx k(x, x) [LR20].
With more recent developments on the bias and variance of linear interpolants in [HMRT20], a signifi-
cantly more precise statement can be derived for the d  n regime. The proof of the following theorem is in
the Supplementary Material.
Theorem 4.13. Let 0 < M, η < ∞ be fixed constants and suppose that Assumption 4.12 holds with M −1 ≤
d/n ≤ M . Further assume that h is continuous on R and smooth in a neighborhood of 0 with h(0), h0 (0) > 0,
that kf ∗ kL4+η (P) ≤ M and that the zi are M -subgaussian. Let yi = f ∗ (xi ) + ξi , E(ξi2 ) = σξ2 , and β 0 :=
Σ−1 E[xf ∗ (x)]. Let λ∗ > 0 be the unique positive solution of
 γ   
n 1− = tr Σ(Σ + λ∗ I)−1 . (61)
λ∗
Define B(Σ, β 0 ) and V (Σ) by

tr Σ2 (Σ + λ∗ I)−2

V (Σ) := , (62)
n − tr Σ2 (Σ + λ∗ I)−2
λ2∗ hβ 0 , (Σ + λ∗ I)−2 Σβ 0 i
B(Σ, β 0 ) := . (63)
1 − n−1 tr Σ2 (Σ + λ∗ I)−2
2
Finally, let bias
d and var d denote the squared bias and variance for the minimum-norm interpolant (51).
Then there exist C, c0 > 0 (depending also on the constants in Assumption 4.12) such that the following

30
holds with probability at least 1 − Cn−1/4 (here P>1 denotes the projector orthogonal to affine functions in
L2 (P)):
2
d − B(Σ, β 0 ) − kP>1 f ∗ k2L2 (1 + V (Σ)) ≤ Cn−c0 ,

bias (64)
d − σξ V (Σ) ≤ Cn 0 .
2 −c

var (65)

A few remarks are in order. First, note that the left hand side of (61) is strictly increasing in λ∗ , while
the right hand side is strictly decreasing. By considering the limits as λ∗ → 0 and λ∗ → ∞, it is easy to see
that this equation indeed admits a unique solution. Second, the bias estimate in (60) requires f ∗ ∈ H, while
the bias calculation in (64) does not make this assumption, but instead incurs an approximation error for
non-linear components of f ∗ .
We now remark that the minimum-norm interpolant with kernel K lin is simply ridge regression with
respect to the plain covariates X and ridge penalty proportional to γ:
1 y − θ0 − Xθ 2 + γkθk22 .

(θb0 , θ)
b := argmin
2
(66)
θ0 ,θ d

The intuition is that the minimum-norm interpolant for the original kernel takes the form fb(x) = θb0 +
hθ,
b xi + ∆(x). Here θb0 + hθ, b xi is a simple component, and ∆(x) is an overfitting component: a function
that is small in L2 (P) but allows interpolation of the data.
The characterization in (61), (62), and (63) can be shown to imply upper bounds that are related to the
analysis in Section 4.2.
Corollary 4.14. Under the assumptions of Theorem 4.13, further assume that f ∗ (x) = hβ 0 , xi is linear
and that there is an integer k ∈ N, and a constant c∗ > 0 such that rk (Σ) + (nγ/c∗ λk+1 ) ≥ (1 + c∗ )n. Then
there exists c0 ∈ (0, 1/2) such that, with high probability, the following hold as long as the right-hand side is
less than one:
d 2
2
 1 X
bias ≤ 4 γ +
d λi kβ 0,≤k k2Σ−1 + kβ 0,>k k2Σ + n−c0 , (67)
n
i=k+1
Pd
2kσξ2 4nσξ2 i=k+1 λi
2
d ≤
var + Pd + n−c0 . (68)
n c∗ (nγ/c∗ + i=k+1 λi )2

Further, under the same assumptions, the effective regularization λ∗ (that is, the unique solution of (61)),
satisfies
d d
c∗ 1 X 2 X
γ+ λi ≤ λ∗ ≤ 2γ + λi . (69)
1 + c∗ n n
i=k+1 i=k+1

Note that apart from the n−c0 term, (67) recovers the result of Theorem 4.5, while (68) recovers Theorem
4.4 (setting γ = 0), both with improved constants but limited to the proportional regime. We remark that
analogues of Theorems 4.4, 4.5, and 4.6 for ridge regression with γ 6= 0 can be found in [TB20].
The formulas (61), (62), and (63) might seem somewhat mysterious. However, they have an appealing
interpretation in terms of a simpler model that we will refer to as a ‘sequence model’ (this terminology comes
from classical statistical estimation theory [Joh19]). As stated precisely in the remark below, the sequence
model is a linear regression model in which the design matrix is deterministic (and diagonal), and the noise
and regularization levels are determined via a fixed point equation.
Remark 4.15. Assume without loss of generality Σ = diag(λ1 , . . . , λd ). In the sequence model we observe
y seq ∈ Rd distributed according to
1/2 τ
i = λi β0,i + √ gi , (gi )i≤d ∼iid N(0, 1) ,
y seq (70)
n

31
where τ is a parameter given below. We then perform ridge regression with regularization λ∗ :
b seq (λ∗ ) := argmin y seq − Σ1/2 β 2 + λ∗ kβk2 ,

β 2 2 (71)
β

which can be written in closed form as


1/2
λ y seq
βbiseq (λ∗ ) = i i . (72)
λ∗ + λi
b seq (λ∗ ) − β k2 . Then under the assumption
The noise level τ 2 is then fixed via the condition τ 2 = σξ2 + Ekβ 0 2
that f ∗ is linear, Theorem 4.13 states that
b seq (λ∗ ) − β k2 + O(n−c0 )
E{(f ∗ (x) − fb(x))2 |X} = Ekβ (73)
0 2

with high probability.


To conclude this section, we summarize the insights gained from the analyses of several models in the
interpolation regime. First, in all cases, the interpolating solution fb can be decomposed into a prediction (or
simple) component and an overfitting (or spiky) component. The latter ensures interpolation without hurting
prediction accuracy. In the next section, we show, under appropriate conditions on the parameterization and
the initialization, that gradient methods can be accurately approximated by their linearization, and hence
can be viewed as converging to a minimum-norm linear interpolating solution despite their non-convexity.
In Section 6, we return to the question of generalization, focusing specifically on two-layer neural networks
in linear regimes.

5 Efficient optimization
The empirical risk minimization (ERM) problem is, in general, intractable even in simple cases. Section 2.5
gives examples of such hardness results. The classical approach to address this conundrum is to construct
convex surrogates of the non-convex ERM problem. The problem of learning a linear classifier provides an
easy-to-state—and yet subtle—example. Considering the 0-1 loss, ERM reads
n
b 01 (θ) := 1
X
minimize L 1 [yi hθ, xi i ≤ 0] . (74)
n i=1

Note however that the original problem (74) is not always intractable. If there exists θ ∈ Rp such that
L(θ)
b = 0, then finding θ amounts to solving a set of n linear inequalities. This can be done in polynomial
time. In other words, when the model is sufficiently rich to interpolate the data, an interpolator can be
constructed efficiently.
In the case of linear classifiers, tractability arises because of the specific structure of the function class
(which is linear in the parameters θ), but one might wonder whether it is instead a more general phenomenon.
The problem of finding an interpolator can be phrased as a constraint optimization problem. Write the
empirical risk as
n
1X
L(θ)
b = `(θ; yi , xi ).
n i=1
Then we are seeking θ ∈ Θ such that

`(θ; yi , xi ) = 0 for all i ≤ n . (75)

Random constraint satisfaction problems have been studied in depth over the last twenty years, although
under different distributions from those arising from neural network theory. Nevertheless, a recurring obser-
vation is that, when the number of free parameters is sufficiently large compared to the number of constraints,

32
these problems (which are NP-hard in the worst case) become tractable; see, for example, [FS96, AM97] and
[CO10].
These remarks motivate a fascinating working hypothesis: modern neural networks are tractable because
they are overparametrized.
Unfortunately, a satisfactory theory of this phenomenon is still lacking, with an important exception: the
linear regime. This is a training regime in which the network can be approximated by a linear model, with a
random featurization map associated with the training initialization. We discuss these results in Section 5.1.
While the linear theory can explain a number of phenomena observed in practical neural networks, it
also misses some important properties. We will discuss these points, and results beyond the linear regime,
in Section 5.2.

5.1 The linear regime


Consider a neural network with parameters θ ∈ Rp : for an input x ∈ Rd the network outputs f (x; θ) ∈ R.
We consider training using the square loss
n
1 X 2 1
y − fn (θ) 2 .

L(θ)
b := yi − f (xi ; θ) = 2
(76)
2n i=1 2n

Here y = (y1 , . . . , yn ) and fn : Rp → Rn maps the parameter vector θ to the evaluation of f at the n
data points, fn : θ 7→ (f (x1 ; θ), . . . , f (xn ; θ)). We minimize this empirical risk using gradient flow, with
initialization θ 0 :
dθ t 1
= Dfn (θ t )T (y − fn (θ t )) . (77)
dt n
Here Dfn (θ) ∈ Rn×p is the Jacobian matrix of the map fn . Our focus on the square loss and continuous
time is for simplicity of exposition. Results of the type presented below have been proved for more general
loss functions and for discrete-time and stochastic gradient methods.
As first argued in [JGH18], in a highly overparametrized regime it can happen that θ changes only
slightly with respect to the initialization θ 0 . This suggests comparing the original gradient flow with the
one obtained by linearizing the right-hand side of (77) around the initialization θ 0 :

dθ t 1
= Dfn (θ 0 )T y − fn (θ 0 ) − Dfn (θ 0 )(θ t − θ 0 ) .

(78)
dt n
More precisely, this is the gradient flow for the risk function

b lin (θ) := 1 ky − fn (θ 0 ) − Dfn (θ 0 )(θ − θ 0 )k2 ,


L (79)
2
2n
which is obtained by replacing fn (θ) with its first-order Taylor expansion at θ 0 . Of course, L b lin (θ) is
quadratic in θ. In particular, if the Jacobian Dfn (θ 0 ) has full row rank, the set of global minimizers
ERM0 := {θ : L b lin (θ) = 0} forms an affine space of dimension p − n. In this case, gradient flow converges
to θ ∞ ∈ ERM0 , which—as discussed in Section 3—minimizes the `2 distance from the initialization:
n o
θ ∞ := argmin kθ − θ 0 k2 : Dfn (θ 0 )(θ − θ 0 ) = y − fn (θ 0 ) . (80)

The linear (or ‘lazy’ ) regime is a training regime in which θ t is well approximated by θ t at all times. Of
course if fn (θ) is an affine function of θ, that is, if Dfn (θ) is constant, then we have θ t = θ t for all times t.
It is therefore natural to quantify deviations from linearity by defining the Lipschitz constant
kDfn (θ 1 ) − Dfn (θ 2 )k
Lip(Dfn ) := sup . (81)
θ 1 6=θ 2 kθ 1 − θ 2 k2

33
(For a matrix A ∈ Rn×p , we define kAk := supx6=0 kAxk2 /kxk2 .) It is also useful to define a population
version of the last quantity. For this, we assume as usual that samples are i.i.d. draws (xi )i≤n ∼iid P, and
with a slight abuse of notation, we view f : θ 7→ f (θ) as a map from Rp to L2 (P) := L2 (Rd ; P). We let Df (θ)
denote the differential of this map at θ, which is a linear operator, Df (θ) : Rp → L2 (P). The corresponding
operator norm and Lipschitz constant are given by

kDf (θ)vkL2 (P)


kDf (θ)k := sup , (82)
v∈Rp \{0} kvk2
kDf (θ 1 ) − Df (θ 2 )k
Lip(Df ) := sup . (83)
θ 1 6=θ 2 kθ 1 − θ 2 k2

The next theorem establishes sufficient conditions for θ t to remain in the linear regime in terms of the
singular values and Lipschitz constant of the Jacobian. Statements of this type were proved in several papers,
starting with [DZPS19]; see, for example, [AZLS19, DLL+ 19, ZCZG20, OS20] and [LZB20]. We follow the
abstract point of view in [OS19] and [COB19].

Theorem 5.1. Assume


1 2
Lip(Dfn ) ky − fn (θ 0 )k2 < σ (Dfn (θ 0 )) . (84)
4 min
Further define
σmax := σmax (Dfn (θ 0 )), σmin := σmin (Dfn (θ 0 )).
Then the following hold for all t > 0:
2
1. The empirical risk decreases exponentially fast to 0, with rate λ0 = σmin /(2n):

L(θ b 0 ) e−λ0 t .
b t ) ≤ L(θ (85)

2. The parameters stay close to the initialization and are closely tracked by those of the linearized flow.
Specifically, letting Ln := Lip(Dfn ),
2
kθ t − θ 0 k2 ≤ ky − fn (θ 0 )k2 , (86)
σmin
n 32σ 16Ln o
max
kθ t − θ t k2 ≤ 2 ky − fn (θ 0 )k2 + 3 ky − fn (θ 0 )k22
σmin σmin
2
180Ln σmax
∧ 5 ky − fn (θ 0 )k22 . (87)
σmin

3. The models constructed by gradient flow and by the linearized flow are similar on test data. Specifically,
writing f lin (θ) = f (θ 0 ) + Df (θ 0 )(θ − θ 0 ), we have

kf (θ t ) − f lin (θ t )kL2 (P)


n 1 Ln σ 2 o
≤ 4 Lip(Df ) 2 + 180kDf (θ 0 )k 5 max ky − fn (θ 0 )k22 . (88)
σmin σmin

The bounds in (85) and (86) follow from the main result of [OS19]. The coupling bounds in (87) and
(88) are proved in the Supplementary Material.
A key role in this theorem is played by the singular values of the Jacobian at initialization, Dfn (θ 0 ).
These can also be encoded in the kernel matrix K m,0 := Dfn (θ 0 )Dfn (θ 0 )T ∈ Rn×n . The importance of

34
this matrix can be easily understood by writing the evolution of the predicted values fnlin (θ t ) := fn (θ 0 ) +
Dfn (θ 0 )(θ t − θ 0 ). Equation (78) implies
dfnlin (θ t ) 1 
= K m,0 y − fnlin (θ t ) . (89)
dt n
Equivalently, the residuals r t := y − fnlin (θ t ) are driven to zero according to (d/dt)r t = −K m,0 r t /n.
Applying Theorem 5.1 requires the evaluation of the minimum and maximum singular values of the
Jacobian, as well as its Lipschitz constant. As an example, we consider the case of two-layer neural networks:
m
α X
f (x; θ) := √ bj σ(hwj , xi), θ = (w1 , . . . , wm ) . (90)
m j=1

To simplify our task, we assume the second layer weights b = (b1 , . . . , bm ) ∈ {+1, −1}m to be fixed with an
equal number of +1s and −1s. Without loss of generality we can assume that b1 = · · · = bm/2 = +1 and
bm/2+1 = · · · = bm = −1. We train the weights w1 , . . . , wm via gradient flow. The number of parameters
is p = md. The scaling factor α allows tuning between different regimes. We consider two initializations,
(1) (2)
denoted by θ 0 and θ 0 :
(1)
θ0 : (wi )i≤m ∼i.i.d. Unif(Sd−1 ); (91)
(2)
θ0 : (wi )i≤m/2 ∼i.i.d. Unif(Sd−1 ), wm/2+i = wi , i ≤ m/2, (92)

where Sd−1 denotes the unit sphere in d dimensions. The important difference between these initializations
(1) (2)
is that (by the central limit theorem) |f (x; θ 0 )| = Θ(α), while f (x; θ 0 ) = 0.
n×md
It is easy to compute the Jacobian Dfn (x; θ) ∈ R :
α
[Dfn (x; θ)]i,(j,a) = √ bj σ 0 (hwj , xi i) xia , i ∈ [n], (j, a) ∈ [m] × [d] . (93)
m
Assumption 5.2. Let σ : R → R be a fixed activation function which we assume differentiable with bounded
first and second order derivatives. Let X
σ= µ` (σ)h`
`≥0

denote its decomposition into orthonormal Hermite polynomials. Assume µ` (σ) 6= 0 for all ` ≤ `0 for some
constant `0 .
Lemma 5.3. Under Assumption 5.2, further assume {(xi , yi )}i≤n to be i.i.d. with xi ∼i.i.d. N(0, Id ), and yi
B 2 -sub-Gaussian. Then there exist constants Ci , depending uniquely on σ, such that the following hold with
probability at least 1 − 2 exp{−n/C0 }, provided md ≥ C0 n log n and n ≤ d`0 (whenever not specified, these
(1) (2)
hold for both initializations θ 0 ∈ {θ 0 , θ 0 }):
(1) √
ky − fn (θ 0 )k2 ≤ C1 B + α) n (94)
(2) √
ky − fn (θ 0 )k2 ≤ C1 B n , (95)

σmin (Dfn (θ 0 )) ≥ C2 α d , (96)
√ √ 
σmax (Dfn (θ 0 )) ≤ C3 α n + d , (97)
√ 
r
d √
Lip(Dfn ) ≤ C4 α n+ d . (98)
m
Further
kDf (θ 0 )k ≤ C10 α , (99)
r
d
Lip(Df ) ≤ C40 α . (100)
m

35
Equations (94), (95) are straightforward [OS19]. The remaining inequalities are proved in the Supple-
mentary Material. Using these estimates in Theorem 5.1, we get the following approximation theorem for
two-layer neural nets.
Theorem 5.4. Consider the two layer neural network of (90) under the assumptions of Lemma 5.3. Further
(1) (2)
let α := α/(1 + α) for initialization θ 0 = θ 0 and α := α for θ 0 = θ 0 . Then there exist constants Ci ,
`0
depending uniquely on σ, such that if md ≥ C0 n log n, d ≤ n ≤ d and
r
n2
α ≥ C0 , (101)
md
then, with probability at least 1 − 2 exp{−n/C0 }, the following hold for all t ≥ 0.
1. Gradient flow converges exponentially fast to a global minimizer. Specifically, letting λ∗ = C1 α2 d/n,
we have

L(θ b 0 ) e−λ∗ t .
b t ) ≤ L(θ (102)

2. The model constructed by gradient flow and linearized flow are similar on test data, namely
( r r )
α n2 1 n5
kf (θ t ) − flin (θ t )kL2 (P) ≤ C1 + . (103)
α2 md α2 md4

It is instructive to consider Theorem 5.4 for two different choices of α (a third one will be considered in
Section 5.2).
(2)
For α = Θ(1), we have α = Θ(1) and therefore the two initializations {θ (1) , θ 0 } behave similarly. In
2
particular, condition (101) requires md  n : the number of network parameters must be quadratic in the
sample size. This is significantly stronger than the simple condition that the network is overparametrized,
namely md  n. Under the condition md  n2 we have exponential convergence to vanishing training
error, and the difference between the neural network and its linearization is bounded as in (103). This
bound vanishes for m  n5 /d4 . While we do not expect this condition to be tight, it implies that, under
the choice α = Θ(1), sufficiently wide networks behave as linearly parametrized models.
(1)
For α → ∞, we have α → 1 for initialization θ 0 and therefore Theorem 5.4 yields the same bounds
(2)
as in the previous paragraph for this initialization. However, for the initialization θ 0 = θ 0 (which is
(2)
constructed so that f (θ 0 ) = 0) we have α = α and condition (101) is always verified as α → ∞. Therefore
the conclusions of Theorem 5.4 apply under nearly minimal overparametrization, namely if md  n log n.
In that case, the linear model is an arbitrarily good approximation of the neural net as α grows: kf (θ t ) −
flin (θ t )kL2 (P) = O(1/α). In other words, an overparametrized neural network can be trained in the linearized
regime by choosing suitable initializations and suitable scaling of the parameters.
Recall that, as t → ∞, θ t converges to the min-norm interpolant θ ∞ ; see (80). Therefore, as long as
condition (101) holds and the right-hand side of (103) is negligible, the generalization properties of the neural
network are well approximated by those of min-norm interpolation in a linear model with featurization map
x 7→ Df (x; θ 0 ). We will study the latter in Section 6.
In the next subsection we will see that the linear theory outlined here fails to capture different training
schemes in which the network weights genuinely change.

5.2 Beyond the linear regime?


For a given dimension d and sample size n, we can distinguish two ways to violate the conditions for the linear
regime, as stated for instance in Theorem 5.4. First, we can reduce the network size m. While Theorem 5.4
does not specify the minimum m under which the conclusions of the theorem cease to hold, it is clear that
md ≥ n is necessary in order for the training error to vanish as in (102).

36
However, even if the model is overparametrized,
√ the same condition is violated if α is sufficiently small.
In particular, the limit m → ∞ with α = α0 / m has attracted considerable attention and is known as the
mean field limit. In order to motivate the mean field analysis, we can suggestively rewrite (90) as
Z
f (x; θ) := α0 b σ(hw, xi) ρb(dw, db) , (104)

Pm
where ρb := m−1 j=1 δwj ,bj is the empirical distribution of neuron weights. If the weights are drawn i.i.d.
from a common distribution (wj , bj ) ∼ ρ, we can asymptotically replace ρb with ρ in the above expression,
by the law of large numbers.
The gradient flow (77) defines an evolution over the space of neuron weights, and hence an evolution in the
space of empirical distributions ρb. It is natural to ask whether this evolution admits a simple characterization.
This question was first addressed by [NS17, MMN18, RVE18, SS20] and [CB18].

Theorem 5.5. Initialize the weights so that {(wj , bj )}j≤m ∼i.i.d. ρ0 with ρ0 a probability measure on
Rd+1 . Further, assume the activation function u 7→ σ(u) to be differentiable with σ 0 bounded and Lipschitz
continuous, and assume |bj | ≤ C almost surely under the initialization ρ0 , for some constant C. Then, for
any fixed T ≥ 0, the following limit holds in L2 (P), uniformly over t ∈ [0, T ]:
Z
lim f (θ mt ) = F (ρt ) := α0 b σ(hw, · i) ρt (dw, db) , (105)
m→∞

where ρt is a probability measure on Rd+1 that solves the following partial differential equation (to be inter-
preted in the weak sense):

∂t ρt (w, b) = α0 ∇(ρt (w, b)∇Ψ(w, b; ρt )) , (106)


 
Ψ(w, b; ρ) := E b bσ(hw, xi) F (x; ρt ) − y . (107)

Here the gradient ∇ is with respect to (w, b) (gradient in d + 1 dimensions) if both first- and second-layer
weights are trained, and only with respect to w (gradient in d dimensions) if only first-layer weights are
trained.
This statement can be obtained by checking the conditions of [CB18, Theorem 2.6]. A quantitative
version can be obtained for bounded σ using Theorem 1 of [MMM19].
A few remarks are in order. First, the limit in (105) requires time to be accelerated by a factor m. This
is to compensate for the fact that the function value is scaled by a factor 1/m. Second, while we stated
this theorem as an asymptotic result, for large m, the evolution described by the PDE (106) holds at any
finite m for the empirical measure ρbt . In that case, the gradient of ρt is not well defined, and it is important
to interpret this equation in the weak sense [AGS08, San15]. The advantage of working with the average
measure ρt instead of the empirical one ρbt is that the former is deterministic and has a positive density (this
has important connections to global convergence). Third, quantitative versions of this theorem were proved
in [MMN18, MMM19], and generalizations to multi-layer networks in [NP20].
Mean-field theory can be used to prove global convergence results. Before discussing these results, let us
emphasize that —in this regime— the weights move in a non-trivial way during training, despite the fact
that the network is infinitely wide. For the sake of simplicity, we will focus on the case already treated
in the previous section in which the weights bj ∈ {+1, −1} are initialized with signs in equal proportions,
and are not changed during training. Let us first consider the evolution of the predicted values Fn (ρt ) :=
(F (x1 ; ρt ), . . . , F (xn ; ρt )). Manipulating (106), we get

d 1 
Fn (ρt ) = − K t Fn (ρt ) − y , K t = (Kt (xi , xj ))i,j≤n (108)
dt Zn
Kt (x1 , x2 ) := hx1 , x2 iσ 0 (hw, x1 i)σ 0 (hw, x2 i) ρt (db, dw) , (109)

37
In the short-time limit we recover the linearized evolution of (89) [MMM19], but the kernel Kt is now
changing with training (with a factor m acceleration in time).
It also follows from the same characterization of Theorem 5.5 that the weight wj of a neuron with weight
b x σ 0 (hw, xi(F (x; ρt ) − y)}. This implies
(wj , bj ) = (w, b) moves at a speed E{b

1 W t+s − W t k2F = v2 (ρt ) s2 + o(s2 ) ,


lim (110)
m→∞ m
1
v2 (ρt ) := 2 hy − Fn (ρt ), K t (y − Fn (ρt ))i . (111)
n
This expression implies that the first-layer weights change significantly more than in the linear regime studied
in Section 5.1. As an example, consider the setting of Lemma 5.3, namely data (xi )i≤n ∼i.i.d. N(0, Id ), an
activation function satisfying Assumption 5.2 and dimension parameters such that md ≥ Cn log n, n ≤ d`0 .
We further initialize ρ0 = Unif(Sd−1 ) ⊗ Unif({+1, −1}) (that is, the vectors wj are uniform on the unit
sphere and the weights bj are uniform in {+1, −1}). Under this initialization kW 0 k2F = m and hence (110)
at t = 0 can be interpreted as describing the initial relative change of the first-layer weights.
Theorem 5.1 (see (86)) and Lemma 5.3 (see (94)–(96)) imply that, with high probability,
r
1 1 n
sup √ kW t − W 0 kF ≤ C , (112)
t≥0 m α md

(1) (2)
where α = α/(1 + α) for initialization θ 0 and α = α for p initialization θ 0 . In the mean field regime

α  α  1/ m and the right hand side above is of order n/d, and hence it does not vanish. This is not
due to a weakness of the analysis. By (110), we can choose ε a small enough constant so that
1 1 1
lim sup √ kW t − W 0 kF ≥ lim √ kW ε − W 0 kF ≥ v2 (ρ0 )1/2 ε . (113)
m→∞ t≥0 m m→∞ m 2

This is bounded away from 0 as long as v2 (ρ0 ) is non-vanishing. In order to see this, note that λmin (K 0 ) ≥ c0 d
with high probability for c0 a constant (note that K 0 is a kernel inner product random matrix,R and hence this
claim follows from the general results of [MMM21]). Noting that Fn (ρ0 ) = 0 (because bρ0 (db, dw) = 0),
this implies, with high probability,

1 c0 d c0 d
v(ρ0 ) = 2
hy, K 0 yi ≥ 2 kyk22 ≥ 0 . (114)
n n n
We expect this lower bound to be tight, as can be seen by considering the pure noise case y ∼ N(0, τ 2 In ),
which leads to v(ρ0 ) = τ 2 tr(K 0 )/n2 (1 +√
on (1))  d/n.
To summarize, (112) (setting α  1/ m) and (113) conclude that, for d ≤ n ≤ d`0 ,
r r
d 1 n
c1 ≤ lim sup √ kW t − W 0 kF ≤ c2 , (115)
n m→∞ t≥0 m d

hence the limit on the left-hand side of (113) is indeed non-vanishing as m → ∞ at n, d fixed. In other
words, the fact that the upper bound in (112) is non-vanishing is not an artifact of the bounding technique,
but a consequence of the change of training regime. We also note a gap between the upper and lower bounds
in (115) when n  d: a better understanding of this quantity is an interesting open problem. In conclusion,
both a linear and a nonlinear regime can be obtained in the infinite-width limit of two-layer neural networks,
for different scalings of the normalization factor α.
As mentioned above, the mean field limit can be used to prove global convergence results, both for
two-layer [MMN18, CB18] and for multilayer networks [NP20]. Rather than stating these (rather technical)
results formally, it is instructive to discuss the nature of fixed points of the evolution (106): this will also
indicate the key role played by the support of the distribution ρt .

38
Lemma 5.6. Assume t 7→ σ(t) to be differentiable with bounded derivative. Let L(ρ) b = E{[y
b − F (x; ρ)]2 } be
the empirical risk of an infinite-width network with neuron’s distribution ρ, and define ψ(w; ρ) := E{σ(hw,
b xi)[y−
F (x; ρ)]}.
b if and only if ψ(w; ρ∗ ) = 0 for all w ∈ Rd .
(a) ρ∗ is a global minimizer of L

(b) ρ∗ is a fixed point of the evolution (106) if and only if, for all (b, w) ∈ supp(ρ∗ ), we have ψ(w; ρ∗ ) = 0
and b∇w ψ(w; ρ∗ ) = 0.
The same statement holds if the empirical averages above are replaced by population averages (that is, the
empirical risk L(ρ)
b is replaced by its population version Ln (ρ) = E{[y − F (x; ρ)]2 }).
This statement clarifies that fixed points of the gradient flow are only a ‘small’ superset of global mini-
mizers, as m → ∞. Consider for instance the case of an analytic activation function t 7→ σ(t). Let ρ∗ be a
stationary point and assume that its support contains a sequence of distinct points {(bi , wi )}i≥1 such that
{wi }i≥1 has an accumulation point. Then, by condition (b), ψ(w; ρ∗ ) = 0 identically and therefore ρ∗ is
a global minimum. In other words, the only local minima correspond to ρ∗ supported on a set of isolated
points. Global convergence proofs aim at ruling out this case.

5.3 Other approaches


The mean-field limit is only one of several analytical approaches that have been developed to understand
training beyond the linear regime. A full survey of these directions goes beyond the scope of this review.
Here we limit ourselves to highlighting a few of them that have a direct connection to the analysis in the
previous section.
A natural idea is to view the linearized evolution as the first order in a Taylor expansion, and to construct
higher order approximations. This can be achieved by writing an ordinary differential equation for the
evolution of the kernel K t (see (109) for the infinite-width limit). This takes the form [HY20]

d 1 (3)
K t = − K t · (Fn (ρt ) − y) , (116)
dt n
(3)
where K t ∈ (Rn )⊗3 is a certain higher order kernel (an order-3 tensor), which is contracted along one
(3)
direction with (Fn (ρt ) − y) ∈ Rn . The linearized approximation amounts to replacing K t with 0. A
(3) (3)
better approximation could be to replace K t with its value at initialization K 0 . This construction can
be repeated, leading to a hierarchy of increasingly complex (and accurate) approximations.
Other approaches towards constructing a Taylor expansion around the linearized evolutions were pro-
posed, among others, by [DGA20] and [HN20].
Note that the linearized approximation relies on the assumption that the Jacobian Dfn (θ 0 ) is non-
vanishing and well conditioned. [BL20a] propose specific neural network parametrizations in which the
Jacobian at initialization vanishes, and the first non-trivial term in the Taylor expansion is quadratic. Under
such initializations the gradient flow dynamics is ‘purely nonlinear’.

6 Generalization in the linear regime


As discussed in Sections 2 and 4, approaches that control the test error via uniform convergence fail for
overparametrized interpolating models. So far, the most complete generalization results for such models
have been obtained in the linear regime, namely under the assumption that we can approximate f (θ) by
its first order Taylor approximation flin (θ) = f (θ 0 ) + Df (θ)(θ − θ 0 ). While Theorem 5.1 provides a set
of sufficient conditions for this approximation to be accurate, in this section we leave aside the question of
whether or when this is indeed the case, and review what we know about the generalization properties of these
linearized models. We begin in Section 6.1 by discussing the inductive bias induced by gradient descent on

39
wide two-layer networks. Section 6.2 describes a general setup. Section 6.3 reviews random features models:
two-layer neural networks in which the first layer is not trained and entirely random. While these are simpler
than neural networks in the linear regime, their generalization behavior is in many ways similar. Finally, in
Section 6.4 we review progress on the generalization error of linearized two-layer networks.

6.1 The implicit regularization of gradient-based training


As emphasized in previous sections, in an overparametrized setting, convergence to global minima is not suf-
ficient to characterize the generalization properties of neural networks. It is equally important to understand
which global minima are selected by the training algorithm, in particular by gradient-based training. As
shown in Section 3, in linear models gradient descent converges to the minimum `2 -norm interpolator. Under
the assumption that training takes place in the linear regime (see Section 5.1), we can apply this observation
to neural networks. Namely, the neural network trained by gradient descent will be well approximated by
the model6 flin (b
a) = f (θ 0 ) + Df (θ 0 )b b minimizes kak2 among empirical risk minimizers
a where a
n o
ab := argmin kak2 : yi = flin (xi ; a) for all i ≤ n . (117)
a∈Rp

For simplicity, we will set f (x; θ 0 ) = 0. This can be achieved either by properly constructing the initialization
(2)
θ 0 (as in the initialization θ 0 in Section 5.1) or by redefining the response vector y 0 = y − fn (θ 0 ). If
f (x; θ 0 ) = 0, the interpolation constraint yi = flin (xi ; a) for all i ≤ n can be written as Dfn (θ 0 )a = y.
Consider the case of two-layer neural networks in which only first-layer weights are trained. Recalling
the form of the Jacobian (93), we can rewrite (117) as
n m
X o
b := argmin kak2 : yi =
a haj , xi iσ 0 (hwj , xi i) , (118)
a∈Rmd j=1

where we write a = (a1 , . . . , am ), ai ∈ Rd . In this section we will study the generalization properties of
this neural tangent (NT) model and some of its close relatives. Before formally defining our setup, it is
instructive to rewrite the norm that we are minimizing in function space:
n 1 m
1 X o
kf kNT,m := inf √ kak2 : f (x) = haj , xi iσ 0 (hwj , xi) a.e. . (119)
m m j=1

This is an RKHS norm defining a finite-dimensional subspace of L2 (Rd , P). We can also think of it as a finite
approximation to the norm
n Z o
kf kNT := inf kakL2 (ρ0 ) : f (x) = ha(w), xi iσ 0 (hw, xi) ρ0 (dw) . (120)

Here a : Rd → Rd is a measurable function with


Z
kak2L2 (ρ0 ) := ka(w)k2 ρ0 (dw) < ∞,

and we are assuming that the weights wj in (119) are initialized as

(wj )j≤m ∼i.i.d. ρ0 .

This is also an RKHS norm whose kernel KNT (x1 , x2 ) will be described below; see (129).
Let us emphasize that moving out of the linear regime leads to different—and possibly more interesting—
inductive biases than those described in (119) or (120). As an example, [CB20] analyze the mean field limit
6 With a slight abuse of notation, in this section we parametrize the linearized model by the shift with respect to the

initialization θ 0 .

40
of two-layer networks, trained with logistic loss, for activation functions that have Lipschitz gradient and are
positively 2-homogeneous. For instance, the square ReLU σ(x) = (x+ )2 with fixed second-layer coefficients
fits this framework. The usual ReLU with trained second-layer coefficients bj σ(hwj , xi) = bj (hwj , xi)+ is
2-homogeneous but not differentiable. In this setting, and under a convergence assumption, they show that
gradient flow minimizes the following norm among interpolators:
n Z o
kf kσ := inf kνkTV : f (x) = σ(hw, xi) ν(dw) a.e. . (121)

Here, minimization is over the finite signed measure ν with Hahn decomposition ν = ν+ − ν− , and kνkTV :=
ν+ (Rd )+ν− (Rd ) is the associated total variation. The norm kf kσ is a special example of the variation norms
introduced in [Kur97] and further studied in [KS01, KS02].
This norm differs in two ways from the RKHS norm of (120). Each is defined in terms of a different
integral operator, Z
a 7→ ha(w), xiσ 0 (hw, xi i) ρ0 (dw)

for (120) and Z


ν 7→ σ(hw, xi) ν(dw)

for (121). However, more importantly, the norms are very different: in (120) it is a Euclidean norm while
in (121) it is a total variation norm. Intuitively, the total variation norm kνkTV promotes ‘sparse’ measures
ν, and hence the functional norm kf kσ promotes functions that depend primarily on a small number of
directions in Rd [Bac17].

6.2 Ridge regression in the linear regime


We generalize the min-norm procedure of (117) to consider the ridge regression estimator:
n
n1 X o
2
b (λ) := argmin
a yi − flin (xi ; a) + λkak22 , (122)
a∈Rp n i=1
flin (xi ; a) := ha, Df (xi ; θ 0 )i . (123)

The min-norm estimator can be recovered by taking the limit of vanishing regularization limλ→0 a b (λ) = ab (0+ )
(with a slight abuse of notation, we will identify λ = 0 with this limit). Apart from being intrinsically
interesting, the behavior of ab (λ) for λ > 0 is a good approximation of the behavior of the estimator produced
by gradient flow with early stopping [AKT19]. More precisely, letting (b aGF (t))t≥0 denote the path of gradient
b GF (0) = 0, there exists a parametrization t 7→ λ(t), such that the test error at a
flow initialized at a b GF (t) is
well approximated by the test error at a b (λ(t)).
Note that the function class {flin (xi ; a) := ha, Df (xi ; θ 0 )i : a ∈ Rp } is a linear space, which is linearly
parametrized by a. We consider two specific examples which are obtained by linearizing two-layer neural
networks (see (90)):
n m
X o
m
FRF := flin (x; a) = ai σ(hwi , xi) : ai ∈ R , (124)
i=1
n Xm o
m
FNT := flin (x; a) = hai , xiσ 0 (hwi , xi) : ai ∈ Rd . (125)
i=1

m
Namely, FRF (RF stands for ‘random features’) is the class of functions obtained by linearizing a two-layer
m
network with respect to second-layer weights and keeping the first layer fixed, and FNT (NT stands for
‘neural tangent’) is the class obtained by linearizing a two-layer network with respect to the first layer and
keeping the second fixed. The first example was introduced by [BBV06] and [RR07] and can be viewed as

41
a linearization of the two-layer neural networks in which only second-layer weights are trained. Of course,
since the network is linear in the second-layer weights, it coincides with its linearization. The second example
is the linearization of a neural network in which only the first-layer weights are trained. In both cases, we
draw (wi )i≤m ∼i.i.d. Unif(Sd−1 ) (the Gaussian initialization wi ∼ N(0, Id /d) behaves very similarly).
Ridge regression (122) within either model FRF or FNT can be viewed as kernel ridge regression (KRR)
with respect to the kernels
m
1 X
KRF,m (x1 , x2 ) := σ(hwi , x1 i)σ(hwi , x2 i) , (126)
m i=1
m
1 X
KNT,m (x1 , x2 ) := hx1 , x2 iσ 0 (hwi , x1 i)σ 0 (hwi , x2 i) . (127)
m i=1

These kernels are random (because the weights wi are) and have finite rank, namely rank at most p, where
p = m in the first case and p = md in the second. The last property is equivalent to the fact that the RKHS
is at most p-dimensional. As the number of neurons diverge, these kernels converge to their expectations
KRF (x1 , x2 ) and KNT (x1 , x2 ). Since the distribution of wi is invariant under rotations in Rd , so are these
√of kx1 k2 , kx2 k2 and
kernels. The kernels KRF (x1 , x2 ) and KNT (x1 , x2 ) can therefore be written as functions
hx1 , x2 i. In particular, if we assume that data are normalized, say kx1 k2 = kx2 k2 = d, then we have the
particularly simple form

KRF (x1 , x2 ) = HRF,d (hx1 , x2 i/d) , (128)


KNT (x1 , x2 ) = d HNT,d (hx1 , x2 i/d) , (129)

where
√ √
HRF,d (q) := Ew {σ( dhw, e1 i)σ( dhw, qe1 + qe2 i)} , (130)
√ √
HNT,d (q) := qEw {σ 0 ( dhw, e1 i)σ 0 ( dhw, qe1 + qe2 i)} , (131)
p
with q := 1 − q 2 .
The convergence KRF,m → KRF , KNT,m → KNT,m takes place under suitable assumptions, pointwise
[RR07]. However, we would like to understand the qualitative behavior of the generalization error in the
above linearized models.
(i) Does the procedure (122) share qualitative behavior with KRR, as discussed in Section 4? In particular,
can min-norm interpolation be (nearly) optimal in the RF or NT models as well?
(ii) How large should m be for the generalization properties of RF or NT ridge regression to match those
of the associated kernel?
(iii) What discrepancies between KRR and RF or NT regression can we observe when m is not sufficiently
large?
(iv) Is there any advantage of one of the three methods (KRR, RF, NT) over the others?
Throughout this section we assume an isotropic model for the distribution of the covariates xi , namely
we assume {(xi , yi )}i≤n to be i.i.d., with

yi = f ∗ (xi ) + εi , xi ∼ Unif(Sd−1 ( d)) , (132)

where f ∗ ∈ L2 (Sd−1 ) is a square-integrable function on the sphere and εi is noise independent of xi , with
E{εi } = 0, E{ε2i } = τ 2 . We will also consider a modification of this model in which xi ∼ N(0, Id ); the two
settings are very close to each other in high dimension. Let us emphasize that we do not make any regularity
assumption about the target function beyond square integrability, which is the bare minimum for the risk

42
to be well defined. On the other hand, the covariates have a simple isotropic distribution and the noise has
variance independent of xi (it is homoscedastic).
While homoscedasticity is not hard to relax to an upper bound on the noise variance, it is useful to
comment on the isotropicity assumption. The main content of this assumption is that the ambient dimension
d of the covariate vectors does coincide with the intrinsic dimension of the data. If, for instance, the xi lie
on a d0 -dimensional subspace in Rd , d0  d, then it is intuitively clear that d would have to be replaced by
d0 below. Indeed this is a special case of a generalization studied in [GMMM20b]. An even more general
setting is considered in [MMM21], where xi belongs to an abstract space. The key assumption there is that
leading eigenfunctions of the associated kernel are delocalized.
We evaluate the quality of method (122) using the square loss

L(λ) := Ex (f ∗ (x) − flin (x; a


b (λ))2 .

(133)

The expectation is with respect to the test point x ∼ Unif(Sd−1 ( d)); note that the risk is random because
b (λ) depends on the training data. However, in all the results below, it concentrates around a non-random
a
value. We add subscripts, and write LRF (λ) or LNT (λ) to refer to the two classes of models above.

6.3 Random features model


We begin by considering the random features model FRF . A number of authors have established upper bounds
on its minimax generalization error for suitably chosen positive values of the regularization [RR17, RR09].
Besides the connection to neural networks, FRF can be viewed as a randomized approximation for the RKHS
associated with KRF . A closely related approach in this context is provided by randomized subset selection,
also known as Nyström’s method [WS01, Bac13, EAM15, RCR15].
The classical random features model FRF is mathematically easier to analyze than the neural tangent
model FNT , and a precise picture can be established that covers the interpolation limit. Several elements of
this picture have been proved to generalize to the NT model as well, as discussed in the next subsection.
We focus on the high-dimensional regime, m, n, d → ∞; as discussed in Section 4, interpolation methods
have appealing properties in high dimension. Complementary asymptotic descriptions are obtained depend-
ing on how m, n, d diverge. In Section 6.3.1 we discuss the behavior at a coarser scale, namely when m and
n scale polynomially in d: this type of analysis provides a simple quantitative answer to the question of how
large m should be to approach the m = ∞ limit. Next, in Section 6.3.2, we consider the proportional regime
m  n  d. This allows us to explore more precisely what happens in the transition from underparametrized
to overparametrized.

6.3.1 Polynomial scaling


The following characterization was proved in [MMM21] (earlier work by [GMMM20a] established this result
for the two limiting cases m = ∞ and n = ∞). In what follows, we let L2 (γ) denote the space of square
2
integrable functions on R, with respect to the standard Gaussian measure γ(dx) = (2π)−1/2 e−x /2 dx, and
we write h · , · iL2 (γ) , k · kL2 (γ) for the associated scalar product and norm.
Theorem 6.1. Fix an integer ` > 0. Let the activation function σ : R → R be independent of d and such
that: (i) |σ(x)| ≤ c0 exp(|x|c1 ) for some constants c0 > 0 and c1 < 1, and (ii) hσ, qiL2 (γ) 6= 0 for any non-
vanishing polynomial q, with deg(q) ≤ `. Assume max((n/m), (m/n)) ≥ dδ and d`+δ ≤ min(m, n) ≤ d`+1−δ
for some constant δ > 0. Then for any λ = Od ((m/n) ∨ 1), and all η > 0,

LRF (λ) = kP>` f ∗ k2L2 + od (1) kf ∗ k2L2 + kP>` f ∗ k2L2+η + τ 2 .



(134)

In words, as long as the number of parameters m and the number of samples n are well separated, the
test error is determined by the minimum of m and n:

43
• For m  n, the approximation error dominates. If d`  m  d`+1 , the model fits the projection
of f onto degree-` polynomials perfectly but does not fit the higher degree components at all: fbλ ≈
P≤` f . This is consistent with a parameter-counting heuristic: degree-` polynomials form a subspace
of dimension Θ(d` ) and in order to approximate them we need a network with Ω(d` ) parameters.
Surprisingly, this transition is sharp.
• For n  m, the statistical error dominates. If d`  n  d`+1 , fbλ ≈ P≤` f . This is again consistent
with a parameter-counting heuristic: to learn degree-` polynomials we need roughly as many samples
as parameters.
• Both of the above are achieved for any sufficiently small value of the regularization parameter λ. In
particular, they apply to min-norm interpolation (corresponding to the case λ = 0+ ).
From a practical perspective, if the sample size n is given, we might be interested in choosing the number of
neurons m. The above result indicates that the test error roughly decreases until the overparametrization
threshold m ≈ n, and that there is limited improvement from increasing the network size beyond m ≥ ndδ .
At this point, RF ridge regression achieves the same error as the corresponding kernel method. Indeed
the statement of Theorem 6.1 holds for the case of KRR as well, by identifying it with the limit m = ∞
[GMMM20a].
Note that the infinite width (kernel) limit m = ∞ corresponds to the setting already investigated in
Theorem 4.10. Indeed, the staircase phenomenon in the m = ∞ case of Theorem 6.1 corresponds to
the multiple descent behavior seen in Theorem 4.10. The two results do not imply each other because
Theorem 4.10 assumes f ∗ to have bounded RKHS norm; Theorem 6.1 does not make this assumption, but
is not as sharp for functions with bounded RKHS norm.
The significance of polynomials in Theorem 6.1 is related to the fact that the kernel KRF is invariant
under rotations (see (128)). As a consequence, the eigenfunctions of KRF are √ spherical harmonics, that is,
restrictions of homogeneous harmonic polynomials in Rd to the sphere Sd−1 ( d), with eigenvalues given by
their degrees. [MMM21] have obtained analogous results for more general probability spaces (X , P) for the
covariates, and more general random features models. The role of low-degree polynomials is played by the
top eigenfunctions of the associated kernel.
The mathematical phenomenon underlying Theorem 6.1 can be understood by considering the feature
matrix Φ ∈ Rn×m :
 
σ(hx1 , w1 i) σ(hx1 , w2 i) · · · σ(hx1 , wm i)
 σ(hx2 , w1 i) σ(hx2 , w2 i) · · · σ(hx2 , wm i) 
Φ :=  .. .. .. . (135)
 
 . . . 
σ(hxn , w1 i) σ(hxn , w2 i) · · · σ(hxn , wm i)

The ith row of this matrix is the feature vector associated with the ith sample. We can decompose Φ accord-

ing to the eigenvalue decomposition of σ, seen as an integral operator from L2 (Sd−1 (1)) to L2 (Sd−1 ( d)):
Z
a(w) 7→ σ(hx, wi) a(w) τd (dw)

(where τd is the uniform measure on Sd−1 (1)). This takes the form

X
σ(hx, wi) = sk ψk (x)φk (w) , (136)
k=0

where (ψj )j≥1 and (φj )j≥1 are two orthonormal systems in L2 (Sd−1 ( d)) and L2 (Sd−1 (1)) respectively, and
√ s0 ≥ s1 ≥ · · · ≥ 0. (In the present example, σ can be regarded as a self-adjoint
the sj are singular values
operator on L2 (Sd−1 ( d)) after rescaling the wj , and hence the φj and ψj can be taken to coincide up to a
rescaling, but this is not crucial.)

44
The eigenvectors are grouped into eigenspaces V` indexed by ` ∈ Z≥0 , where V` consists of the degree-`
polynomials, and
d − 2 + 2` d − 3 + `
 
dim(V` ) =: B(d, `) = , B(d, `)  d` /`!.
d−2 `
We write s(`) for the eigenvalue associated with eigenspace V` : it turns out that s(`)  d−`/2 , for a generic σ;
(s(`) )2 B(d, `) ≤ C since σ is square integrable. Let ψ k = (ψk (x1 ), . . . , ψk (xn ))T be the evaluation of the kth
T
left eigenfunction at the n data points, and let φk = (φk (w1 ), . . . , φP k (w m )) be the evaluation of the kth
right eigenfunction at the m neuron parameters. Further, let k(`) := `0 ≤` B(d, `0 ). Following our approach
in Section 4, we decompose Φ into a ‘low-frequency’ and a ‘high-frequency’ component,

Φ = Φ≤` + Φ>` , (137)


k(`)
sj ψ j φ> >
X
Φ≤` = j = ψ ≤` S ≤` φ≤` , (138)
j=0

where S ≤` = diag(s1 , . . . , sk(`) ), ψ ≤` ∈ Rn×k(`) is the matrix whose jth column is ψ j , and φ≤` ∈ Rm×k(`)
is the matrix whose jth column is φj .
Consider, to be definite, the overparametrized case m ≥ n1+δ , and assume d`+δ ≤ n. Then we can
think of φj , ψ j , j ≤ k(`) as densely sampled eigenfunctions. This intuition is accurate in the sense that
ψT T
≤` ψ ≤` ≈ nIk(`) and φ≤` φ≤` ≈ mIk(`) [MMM21]. Further, if n ≤ d
`+1−δ
, the ‘high-frequency’ part of the
decomposition (137) behaves similarly to noise along directions orthogonal to the previous ones. Namely, (i)
Φ>` φ≤` ≈ 0, ψ T ≤` Φ>` ≈ 0, and (ii) its singular values (except those along the low-frequency components)
concentrate: for any δ 0 > 0,
1/2 0 1/2 0
κ` n−δ ≤ σn−k(`) (Φ>` )/m1/2 ≤ σ1 (Φ>` )/m1/2 ≤ κ` nδ ,

where κ` := j≥k(`)+1 s2j .


P

In summary, regression with respect to the random features σ(hwj , · i) turns out to be essentially equiv-
alent to kernel ridge regression with respect to a polynomial kernel of degree `, where ` depends on the
smaller of the sample size and the network size. Higher degree parts in the activation function effectively
behave as noise in the regressors. We will next see that this picture can become even more precise in the
proportional regime m  n.

6.3.2 Proportional scaling


Theorem 6.1 requires that m and n are well separated. When m, n are close to each other, the feature matrix
(135) is nearly square and we might expect its condition number to be large. When this is the case, the
variance component of the risk can also be large.
Theorem 6.1 also requires the smaller of m and n to be well separated from d` , with ` any integer. For
d  m  d`+1 the model has enough degrees of freedom to represent (at least in principle) all polynomials
`

of degree at most ` and not enough to represent even a vanishing fraction of all polynomials of degree ` + 1.
Hence it behaves in a particularly simple way. On the other hand, when m is comparable to d` , the model
can partially represent degree-` polynomials, and its behavior will be more complex. Similar considerations
apply to the sample size n.
What happens when m is comparable to n, and both are comparable to an integer power of d? Figure
1 reports simulations within the data model introduced above. We performed ridge regression as per (122),
with a small value of the regularization parameter, λ = 10−3 (m/d). We report test error and train error for
several network widths m, plotting them as a function of the overparametrization ratio m/n.
We observe that the train error decreases with the overparametrization ratio, and becomes very small
for m/n ≥ 1: it is not exactly 0 because we are using λ > 0, but for m/n > 1 it vanishes as λ → 0. On
the other hand, the test error displays a peak at the interpolation threshold m/n = 1. For λ = 0+ the

45
2.5
Predicted test error
Predicted train error
Predicted µ2∗ kâk22
2.0 Empirical test error
Empirical train error
Empirical µ2∗ kâk22
Train/Test Error

1.5

1.0

0.5

0.0
0 1 2 3 4 5
m/n

Figure 1: Train and test error of a random features model (two-layer neural net with random first layer) as
a function of the overparametrization ratio m/n. Here d = 100, n = 400, τ 2 = 0.5, and the target function
is f ∗ = hβ 0 , xi, kβ 0 k2 = 1. The model is fitted using ridge regression with a small regularization parameter
λ = 10−3 (m/d). Circles report the results of numerical simulations (averaged over 20 realizations), while
lines are theoretical predictions for the m, n, d → ∞ asymptotics.

error actually diverges at this threshold. It then decreases and converges rapidly to an asymptotic value as
m/n  1. If both n/d  1, and m/n  1, the asymptotic value of the test error is given by kP>1 f ∗ kL2 : the
model is fitting the degree-one polynomial component of the target function perfectly and behaves trivially
on higher degree components. This matches the picture obtained under polynomial scalings, in Theorem
6.1, and actually indicates that a far smaller separation between m and n is required than assumed in that
theorem. Namely, m/n  1 instead of m/n ≥ dδ appears to be sufficient for the risk to be dominated by
the statistical error.
The peculiar behavior illustrated in Figure 1 was first observed empirically in neural networks and
then shown to be ubiquitous for numerous over-parametrized models [GSd+ 19, SGd+ 19, BHMM19]. It is
commonly referred to as the ‘double descent phenomenon’, after [BHMM19].
Figure 1 displays curves that are exact asymptotic predictions in the limit m, n, d → ∞, with m/d → ψw ,
n/d → ψs . Explicit formulas for these asymptotics were originally established in [MM19] using an approach
from random matrix theory, which we will briefly outline. The first step is to write the risk as an explicit
function of the matrices X ∈ Rn×d√(the matrix whose ith row√is the sample xi ), Θ ∈ Rm×d (the matrix
whose jth row is the sample θ j = dwj ), and Φ = σ(XΘT / d) (the feature matrix in (135)). After a
straightforward calculation, one obtains
2 T
LRF (λ) =Ex [f ∗ (x)2 ] − y Φ(ΦT Φ/n + λIm )−1 V (139)
n
1 T
+ y Φ(ΦT Φ/n + λIm )−1 U (ΦT Φ/n + λIm )−1 ΦT y ,
n2

46
where V ∈ Rm , U ∈ Rm×m are matrices with entries

Vi := Ex {σ(hθ i , xi/ d) f ∗ (x)} , (140)
√ √
Uij := Ex {σ(hθ i , xi/ d) σ(hθ j , xi/ d)} . (141)

Note that the matrix U takes the form of an empirical kernel matrix, although expectation is taken over
the covariates x and the kernel is evaluated at the neuron parameters (θ i )i≤m . Namely, we have Uij =
HRF,d (hθ i , θ j i/d), where the kernel HRF,d is defined exactly7 as in (130). Estimates similar to those of
Section 4 apply here (see also [EK10]): since m  d we can approximate the kernel HRF,d by a linear kernel
in operator norm. Namely, if we decompose σ(x) = µ0 + µ1 x + σ⊥ (x), where E{σ⊥ (G)} = E{Gσ⊥ (G)} = 0,
and E{σ⊥ (G)2 } = µ2∗ , we have

U = µ20 11T + µ21 ΘΘT + µ∗ Im + ∆ , (142)

where ∆ is an error term that vanishes asymptotically in operator norm. Analogously, V can be approxi-
mated as V ≈ a1 + Θb for suitable coefficients a ∈ R, b ∈ Rd .
Substituting these approximations for U and V in (139) yields an expression of the risk in terms of
the three (correlated) random matrices X, Θ, Φ. Standard random matrix theory does not apply directly
to compute the asymptotics of this expression. The main difficulty is that the matrix Φ does not have
independent or nearly independent entries. It is instead obtained by applying a nonlinear function to a
product of matrices with (nearly) independent entries; see (135). The name ‘nonlinear random matrix
theory’ has been coined to refer to this setting [PW17]. Techniques from random matrix theory have been
adapted to this new class of random matrices. In particular, the leave-one-out method can be used to derive
a recursion for the resolvent, as first shown for this type of matrices in [CS13], and the moments method
was first used in [FM19] (both of these papers consider symmetric random matrices, but these techniques
extend to the asymmetric case). Further results on kernel random matrices can be found in [DV13, LLC18]
and [PW18].
Using these approaches, the exact asymptotics of LRF (λ) was determined in the proportional asymptotics
m, n, d → ∞ with m/d → ψw ( ψw represents the number of neurons per dimension), n/d → ψs (ψs represents
the number of samples per dimension). The target function f ∗ is assumed to be square integrable and such
that P>1 f ∗ is a Gaussian isotropic function.8 In this setting, the risk takes the form

LRF (λ) =kP1 f ∗ k2L2 B(ζ, ψw , ψs , λ/µ2∗ ) (143)


+ (τ 2 + kP>1 f ∗ k2L2 )V (ζ, ψw , ψs , λ/µ2∗ ) + kP>1 f ∗ k2L2 + od (1) ,

where ζ := |µ1 |/µ∗ . The functions B, V ≥ 0 are explicit and correspond to an effective bias term and an
effective variance term. Note the additive term kP>1 f ∗ k2L2 : in agreement with Theorem 6.1, the nonlinear
component of f ∗ cannot be learnt at all (recall that m, n = O(d) here). Further kP>1 f ∗ k2L2 is added to the
noise strength in the ‘variance’ term: high degree components of f ∗ are equivalent to white noise at small
sample/network size.
The expressions for B, V can be used to plot curves such as those in Figure 1: we refer to [MMM21] for ex-
plicit formulas. As an interesting conceptual consequence, these results establish a universality phenomenon:
the risk under the random features model is asymptotically the same as the risk of a mathematically sim-
pler model. This simpler model can be analyzed by a direct application of standard random matrix theory
[HMRT20].
We refer to the simpler equivalent model as the ‘noisy features model.’ In order to motivate it, recall the
decomposition σ(x) = µ0 +µ1 x+σ⊥ (x) (with the three components being orthogonal in L2 (γ)). Accordingly,
7 The two kernels coincide because we are using the same distribution for x and θ : while this symmetry simplifies some
i j
calculations, it is not really crucial.
8 Concretely, for each ` ≥ 2, let f = (f ∗
` k,` )k≤B(d,`) be the coefficients of f in a basis of degree-` spherical harmonics. Then
f ` ∼ N(0, F`2 IB(d,` ) independently across `.

47
we decompose the feature matrix as

Φ = Φ≤1 + Φ>1
µ1
= µ0 11T + √ ΘX T + µ∗ Z̃ ,
d

where Z̃ij = σ⊥ (hxi , θ j i/ d)/µ∗ . Note that the entries of Z̃ have zero mean and√are asymptotically
uncorrelated. Further they are asymptotically uncorrelated with the entries9 of ΘX T / d.
As we have seen in Section 6.3.1, the matrix Z̃ behaves in many ways as a matrix with independent
entries, independent of Θ, X. In particular, if max(m, n)  d2 and either m  n or m  n, its eigenvalues
concentrate around a deterministic value (see discussion below (137)).
The noisy features model is obtained by replacing Z̃ with a matrix Z, with independent entries, inde-
pendent of Θ, X. Accordingly, we replace the target function with a linear function with additional noise.
In summary:
µ1
ΦNF = µ0 11T + √ ΘX T + µ∗ Z, (Zij )i≤n,j≤m ∼ N(0, 1) , (144)
d
y = b0 1 + Xβ + τ+ g̃, (g̃i )i≤n ∼ N(0, 1) . (145)

Here the random variables (g̃i )i≤n , (Zij )i≤n,j≤m are mutually independent, and independent of all the others,
and the parameters b0 , β, τ+ are fixed by the conditions P≤1 f ∗ (x) = b0 + hβ, xi and τ+
2
= τ 2 + kP>1 f ∗ k2L2 .
The next statement establishes asymptotic equivalence of the noisy and random features model.
Theorem 6.2. Under the data distribution introduced above, let LRF (λ) denote the risk of ridge regression
in the random features model with regularization λ, and let LNF (λ) be the risk in the noisy features model.
Then we have, in n, m, d → ∞ with m/d → ψw , n/d → ψs ,

LRF (λ) = LNF (λ) · 1 + on (1) . (146)

Knowing the exact asymptotics of the risk allows us to identify phenomena that otherwise would be out
of reach. A particularly interesting one is the optimality of interpolation at high signal-to-noise ratio.
Corollary 6.3. Define the signal-to-noise ratio of the random features model as SNRd := kP1 f ∗ k2L2 /(kP>1 f ∗ k2L2 +
τ 2 ), and let LRF (λ) be the risk of ridge regression with regularization λ. Then there exists a critical value
SNR∗ > 0 such that the following hold.
(i) If limd→∞ SNRd = SNR∞ > SNR∗ , then the optimal regularization parameter is λ = 0+ , in the sense
that LRF,∞ (λ) := limd→∞ LRF (λ) is monotone increasing for λ ∈ (0, ∞).
(ii) If limd→∞ SNRd = SNR∞ < SNR∗ , then the optimal regularization parameter is λ > 0, in the sense
that LRF,∞ (λ) := limd→∞ LRF (λ) is monotone decreasing for λ ∈ (0, λ0 ) with λ0 > 0.
In other words, above a certain threshold in SNR, (near) interpolation is required in order to achieve
optimal risk, not just optimal rates.
The universality phenomenon of Theorem 6.2 first emerged in random matrix theory studies of (symmet-
ric) kernel inner product random matrices. In that case, the spectrum of such a random matrix was shown in
[CS13] to behave asymptotically as the one of the sum of independent Wishart and Wigner matrices, which
correspond respectively to the linear and nonlinear parts of the kernel (see also [FM19] where this remark is
made more explicit). In the context of random features ridge regression, this type of universality was first
pointed out in [HMRT20], which proved a special case of Theorem 6.2. In [GMKZ19] and [GRM+ 20], a
universality conjecture was put forward on the basis of statistical physics arguments and proved to hold in
online learning schemes (that is, if each sample is visited only once).
9 Uncorrelatedness

holds only asymptotically, because the distribution of hxi , θ j i/ d is not exactly Gaussian, but only
2
asymptotically so, while the decomposition σ(x) = σ0 + σ1 x + σ⊥ (x) is taken in L (γ).

48
Universality is conjectured to hold in significantly broader settings than ridge-regularized least-squares.
This is interesting because analysing the noisy feature models is often significantly easier than the original
random features model. For instance [MRSY19] studied max margin classification under the universality hy-
pothesis, and derived an asymptotic characterization of the test error using Gaussian comparison inequalities.
Related results were obtained by [TPT20] and [KT20], among others.
Finally, a direct proof of universality for general strongly convex smooth losses was recently proposed in
[HL20] using the Lindeberg interpolation method.

6.4 Neural tangent model


The neural tangent model FNT —recall (125)— has not (yet) been studied in as much detail as the random
features model. The fundamental difficulty is related to the fact that the features matrix Φ ∈ Rn×md no
longer has independent columns:
 0
σ (hx1 , w1 i)xT σ 0 (hx1 , w2 i)xT σ(hx1 , wm i)xT

1 1 ··· 1
0
 σ (hx2 , w1 i)x2 σ(hx2 , w2 i)xT
T
2 ··· σ(hx2 , wm i)xT
2

Φ :=  .. .. .. . (147)

 . . . 
σ 0 (hxn , w1 i)xT
n σ(hxn , w2 i)xT
n ··· σ(hxn , wm i)xT
n

Nevertheless, several results are available and point to a common conclusion: the generalization properties
of NT are very similar to those of RF, provided we keep the number of parameters constant, which amounts
to reducing the number of neurons according to mNT d = pNT = pRF = mRF .
Before discussing rigorous results pointing in this direction, it is important to emphasize that, even if the
two models are statistically equivalent, they can differ from other points of view. In particular, at prediction
time both models have complexity O(md). Indeed, in the case of RF the most complex operation is the
matrix vector multiplication x 7→ W x, while for NT two such multiplications are needed x 7→ W x and
x 7→ Ax (here A ∈ Rm×d is the matrix with rows (ai )i≤m . If we keep the same number of parameters (which
we can regard as a proxy for expressivity of the model), we obtain complexity O(pd) for RF and O(p) for
NT. Similar considerations apply at training time. In other words, if we are constrained by computational
complexity, in high dimension NT allows significantly better expressivity.
A first element confirming this picture is provided by the following result, which partially generalizes
Theorem 6.1. In order to state this theorem, we introduce a useful notation. Given a function f : R → R,
such that E{f (G)2 } < ∞, we let µk (f ) := E{Hek (G)f (G)} denote the kth coefficient of f in the basis of
Hermite polynomials.
Theorem 6.4. Fix an integer ` > 0. Let the activation function σ : R → R be weakly differentiable,
independent of d, and such that: (i) |σ 0 (x)| ≤ c0 exp(c1 x2 /4) for some constants c0 > 0, and c1 < 1,
(ii) there exist k1 , k2 ≥ 2` + 7 such that µk1 (σ 0 ), µk2 (σ 0 ) 6= 0, and µk1 (x2 σ 0 )/µk1 (σ 0 ) 6= µk1 (x2 σ 0 )/µk1 (σ 0 ),
and (iii) µk (σ) 6= 0 for all k ≤ ` + 1. Then the following holds.
Assume either n = ∞ (in which case we are considering pure approximation error) or m = ∞ (that is,
the test error of kernel ridge regression) and d`+δ ≤ min(md; n) ≤ d`+1−δ for some constant δ > 0. Then,
for any λ = od (1) and all η > 0,

LNT (λ) = kP>` f ∗ k2L2 + od (1) kf ∗ k2L2 + τ 2 .



(148)

In this statement we abused notation in letting m = ∞ denote the case of KRR, and letting n = ∞ refer
to the approximation error:

lim LNT (λ) = inf E [f ∗ (x) − fb(x)]2 .



(149)
n→∞ m
fb∈FNT

Note that here the NT kernel is a rotationally invariant kernel on Sd−1 ( d) and hence takes the same form
as the RF kernel, namely KNT (x1 , x2 ) = d HNT,d (hx1 , x2 i/d) (see (128)). Hence the m = ∞ case of the last
theorem is not new: it can be regarded as a special case of Theorem 6.1.

49
On the other hand, the n = ∞ portion of the last theorem is new. In words, if d`+δ ≤ md ≤ d`+1−δ ,
m
then FNT can approximate degree-` polynomials to an arbitrarily good relative accuracy, but is roughly
orthogonal to polynomials of higher degree (more precisely, to polynomials that have vanishing projection
onto degree-` ones). Apart from the technical assumptions, this result is identical to the n = ∞ case of
Theorem 6.1, with the caveat that, as mentioned above, the two models should be compared by keeping the
number of parameters (not the number of neurons) constant.
How do NT models behave when both m and n are finite? By analogy with the RF model, we would
expect that the model undergoes an ‘interpolation’ phase transition at md ≈ n: the test error is bounded
away from 0 for md . n and can instead vanish for md & n. Note that finding an interpolating function
m
f ∈ FNT amounts to solving the system of linear equations Φa = y, and hence a solution exists for generic y
if and only if rank(Φ) = n. Lemma 5.3 implies10 that this is indeed the case for md ≥ C0 n log n and n ≤ d`0
for some constant `0 (see (96)).
In order to study the test error, it is not sufficient to lower-bound the minimum singular value of Φ, but
we need to understand the structure of this matrix: results in this direction were obtained in [MZ20], for
m ≤ C0 d, for some constant C0 . Following the same strategy of previous sections, we decompose
Φ = Φ0 + Φ≥1 , (150)
 T
x1 xT T

1 ··· x1
xT T
··· xT
 2 x2 2

Φ0 = µ1  . .. ..  , (151)
 .. . . 
xT
n xT
n ··· xT
n

where µ1 := E{σ 0 (G)] for G ∼ N(0, 1). The empirical kernel matrix K = ΦΦT /m then reads
1 1 1 1
K= Φ0 ΦT
0 + Φ0 ΦT
≥1 + Φ≥1 ΦT0 + Φ≥1 ΦT
≥1 (152)
m m m m
1
= µ21 XX T + Φ≥1 P ⊥ ΦT ≥1 + ∆ . (153)
m
Here P ∈ Rmd×md is a block-diagonal projector, with m blocks of dimension d × d, with `th block given by

P ` := w` wT
`, P = Imd − P and ∆ := (Φ0 ΦT T T
≥1 + Φ≥1 Φ0 + Φ≥1 P Φ≥1 )/m.

For the diagonal entries we have (assuming for simplicity xi ∼ N(0, Id )),
n1  o
Φ≥1 P ⊥ ΦT 0 2

E ≥1 ii = E hxi , (Id − P ` )xi i(σ (hw ` , xi i) − µ1 )
m
= E hxi , (Id − P ` )xi i E σ 0 (hw` , xi i) − µ1 )2
 

= (d − 1)E{(σ 0 (G) − Eσ 0 (G))2 } =: (d − 1)v(σ),


where the second equality follows because (Id − P ` )xi and hw` , xi i are independent for xi ∼ N(0, Id ), and
the last expectation is with respect to G ∼ N(0, 1). As proved in [MZ20] the matrix Φ≥1 P ⊥ ΦT ≥1 is well
approximated by this diagonal expectation. Namely, under the model above, there exists a constant C such
that, with high probability:
1 r n(log d)C
⊥ T
Φ≥1 P Φ≥1 − v(σ) In ≤ . (154)

md md

Equations (153) and (154) suggest that for m = O(d), ridge regression in the NT model can be ap-
proximated by ridge regression in the raw covariates, as long as the regularization parameter is suitably
modified. The next theorem confirms this intuition [MZ20]. We define ridge regression with respect to the
raw covariates as per
n1 o
y − Xβ 2 + γkβk22 .

β(γ)
b := argmin 2
(155)
d
10 To be precise, Lemma 5.3 assumes the covariate vectors xi ∼ N(0, Id ).

50
Theorem 6.5. Assume d1/C0 ≤ m ≤ C0 d, n ≥ d/C0 and md  n. Then with high probability there exists
an interpolator. Further assume xi ∼ N(0, Id ) and f ∗ (x) = hβ ∗ , xi. Let

Llin (γ) := E{(f ∗ (x) − hβ(γ),


b xi)2 }

denote the risk of ridge regression with respect to the raw features.
Set λ = λ0 (md/n) for some λ0 ≥ 0. Then there exists a constant C > 0 such that, with high probability,
r !
n(log d)C
LNT (λ) = Llin (γeff (λ0 , σ)) + O , (156)
md

where γeff (λ0 , σ) := (λ0 + v(σ))/E{σ 0 (G)}2 .


Notice that the shift in regularization parameter matches the heuristics given above (the scaling in
λ = λ0 (md/n) is introduced to match the typical scale of Φ).

7 Conclusions and future directions


Classical statistical learning theory establishes guarantees on the performance of a statistical estimator fb,
by bounding the generalization error L(fb) − L( b fb). This is often thought of as a small quantity compared
to the training error L(f ) − L(f )  L(f ). Regularization methods are designed precisely with the aim of
b b b b b
keeping the generalization error L(fb) − L( b fb) small.
The effort to understand deep learning has recently led to the discovery of a different learning scenario,
in which the test error L(fb) is optimal or nearly optimal, despite being much larger than the training error.
Indeed in deep learning the training error often vanishes or is extremely small. The model is so rich that it
overfits the data, that is, L( b fb)  inf f L(f ). When pushed, gradient-based training leads to interpolation or
near-interpolation L( b fb) ≈ 0 [ZBH+ 17]. We regard this as a particularly illuminating limit case.
This behavior is especially puzzling from a statistical point of view, that is, if we view data (xi , yi ) as
inherently noisy. In this case yi − f ∗ (xi ) is of the order of the noise level and therefore, for a model that
interpolates, fb(xi ) − f ∗ (xi ) is also large. Despite this, near-optimal test error means that fb(xtest ) − f ∗ (xtest )
must be small at ‘most’ test points xtest ∼ P.
As pointed out in Section 2, interpolation poses less of a conceptual problem if data are noiseless. Indeed,
unlike the noisy case, we can exhibit at least one interpolating solution that has vanishing test error, for any
sample size: the true function f ∗ . Stronger results can also be established in the noiseless case: [Fel20] proved
that interpolation is necessary to achieve optimal error rates when the data distribution is heavy-tailed in a
suitable sense.
In this review we have focused on understanding when and why interpolation can be optimal or nearly
optimal even with noisy data. Rigorous work has largely focused on models that are linear in a certain
feature space, with the featurization map being independent of the data. Examples are RKHSs, the features
produced by random network layers, or the neural tangent features defined by the Jacobian of the network at
initialization. Mathematical work has established that interpolation can indeed be optimal and has described
the underlying mechanism in a number of settings. While the scope of this analysis might appear to be limited
(neural networks are notoriously nonlinear in their parameters), it is relevant to deep learning in two ways.
First, in a direct way: as explained in Section 5, there are training regimes in which an overparametrized
neural network is well approximated by a linear model that corresponds to the first-order Taylor expansion
of the network around its initialization (the ‘neural tangent’ model). Second, in an indirect way: insights
and hypotheses arising from the analysis of linear models can provide useful guidance for studying more
complex settings.
Based on the work presented in this review, we can distill a few insights worthy of exploration in broader
contexts.

51
Simple-plus-spiky decomposition. The function learnt in the overfitting (interpolating) regime takes
the form

fb(x) = fb0 (x) + ∆(x) . (157)

Here fb0 is simple in a suitable sense (for instance, it is smooth) and hence is far from interpolating the data,
while ∆ is spiky: it has large complexity and allows interpolation of the data, but it is small, in the sense
that it has negligible effect on the test error, i.e. L(fb0 + ∆) ≈ L(fb0 ).
In the case of linear models, the decomposition (157) corresponds to a decomposition of fb into two
orthogonal subspaces that do not depend on the data. Namely, fb0 is the projection of fb onto the top
eigenvectors of the associated kernel and ∆ is its orthogonal complement. In nonlinear models, the two
components need not be orthogonal and the associated subspaces are likely to be data-dependent.
Understanding whether such a decomposition is possible, and what is its nature is a wide-open problem,
which could be investigated both empirically and mathematically. A related question is whether the de-
composition (157) is related to the widely observed ‘compressibility’ of neural network models. This is the
observation that the test error of deep learning models does not change significantly if —after training— the
model is simplified by a suitable compression operation [HMD15].
Implicit regularization. Not all interpolating models generalize equally well. This is easily seen in the
case of linear models, where the set of interpolating models forms an affine space of dimension p − n (where p
is the number of parameters). Among these, we can find models of arbitrarily large norm, that are arbitrarily
far from the target regression function. Gradient-based training selects a specific model in this subspace,
which is the closest in `2 norm to the initialization.
The mechanism by which the training algorithm selects a specific empirical risk minimizer is understood
in only a handful of cases: we refer to Section 3 for pointers to this literature. It would be important
to understand how the model nonlinearity interacts with gradient flow dynamics. This in turn impacts the
decomposition (157), namely which part of the function fb is to be considered ‘simple’ and which one is ‘spiky’.
Finally, the examples of kernel machines, random features and neural tangent models show that—in certain
regimes—the simple component fb0 is also regularized in a non-trivial way, a phenomenon that we called
self-induced regularization. Understanding these mechanisms in a more general setting is an outstanding
challenge.
Role of dimension. As pointed out in Section 4, interpolation is sub-optimal in a fixed dimension in the
presence of noise, for certain kernel methods [RZ19]. The underlying mechanism is as described above: for
an interpolating model, fb(xi ) − f ∗ (xi ) is of the order of the noise level. If fb and f ∗ are sufficiently regular
(for instance, uniformly continuous, both in x and in n) fb(xtest ) − f ∗ (xtest ) is expected to be of the same
order when xtest is close to the training set. This happens with constant probability in fixed dimension.
However, this probability decays rapidly with the dimension.
Typical data in deep learning applications are high-dimensional (images, text, and so on). On the other
hand, it is reasonable to believe that deep learning methods are not affected by the ambient dimension
(the number of pixels in an image), but rather by an effective or intrinsic dimension. This is the case for
random feature models [GMMM20b]. This raises the question of how deep learning methods escape the
intrinsic limitations of interpolators in low dimension. Is it because they construct a (near) interpolant fb
that is highly irregular (not uniformly continuous)? Or perhaps because the effective dimension is at least
moderately large? (After all the lower bounds mentioned above decrease rapidly with dimension.) What is
the proper mathematical definition of effective dimension?
Adaptive model complexity. As mentioned above, in the case of linear models, the terms fb0 and ∆ in
the decomposition (157) correspond to the projections of fb onto Vk and Vk⊥ . Here Vk is the space spanned
by the top k eigenfunctions of the kernel associated with the linear regression problem. Note that this is the
case also for the random features and neural tangent models of Section 6. In this case the relevant kernel is
the expectation of the finite-network kernel Df (θ 0 )T Df (θ 0 ) with respect to the choice of random weights
at initialization.

52
A crucial element of this behavior is the dependence of k (the dimension of the eigenspace Vk ) on various
features of the problem at hand: indeed k governs the complexity of the ‘simple’ part of the model fb0 , which
is the one actually relevant for prediction. As discussed in Section 4, in kernel methods k increases with the
sample size n: as more data are used, the model fb0 becomes more complex. In random features and neural
tangent models (see Section 6), k depends on the minimum of n and the number of network parameters
(which is proportional to the width for two-layer networks). The model complexity increases with sample
size, but saturates when it reaches the number of network parameters.
This suggests a general hypothesis that would be interesting to investigate beyond linear models. Namely,
if a decomposition of the type (157) is possible, then the complexity of the simple part fb0 increases with the
sample size and the network size.
Computational role of overparametrization. We largely focused on the surprising discovery that
overparametrization and interpolation do not necessarily hurt generalization, even in the presence of noise.
However, we should emphasize once more that the real motivation for working with overparametrized models
is not statistical but computational. The empirical risk minimization problem for neural networks is com-
putationally hard, and in general we cannot hope to be able to find a global minimizer using gradient-based
algorithms. However, empirical evidence indicates that global optimization becomes tractable when the
model is sufficiently overparametrized.
The linearized and mean field theories of Section 5 provide general arguments to confirm this empirical
finding. However, we are far from understanding precisely what amount of overparametrization is necessary,
even in simple neural network models.

Acknowledgements
PB, AM and AR acknowledge support from the NSF through award DMS-2031883 and from the Simons
Foundation through Award 814639 for the Collaboration on the Theoretical Foundations of Deep Learning.
For insightful discussions on these topics, the authors also thank the other members of that Collaboration and
many other collaborators and colleagues, including Emmanuel Abbe, Misha Belkin, Niladri Chatterji, Amit
Daniely, Tengyuan Liang, Philip Long, Gábor Lugosi, Song Mei, Theodor Misiakiewicz, Hossein Mobahi,
Elchanan Mossel, Phan-Minh Nguyen, Nati Srebro, Nike Sun, Alexander Tsigler, Roman Vershynin, and
Bin Yu. We thank Tengyuan Liang and Song Mei for insightful comments on the draft. PB acknowledges
support from the NSF through grant DMS-2023505. AM acknowledges support from the ONR through
grant N00014-18-1-2729. AR acknowledges support from the NSF through grant DMS-1953181, and support
from the MIT-IBM Watson AI Lab and the NSF AI Institute for Artificial Intelligence and Fundamental
Interactions.

References
[AB99] Martin Anthony and Peter L. Bartlett. Neural Network Learning: Theoretical Foundations.
Cambridge University Press, 1999.
[ABR64] MA Aizerman, E M Braverman, and LI Rozonoer. Theoretical foundations of the potential
function method in pattern recognition. Avtomat. i Telemeh, 25(6):917–936, 1964.
[AGS08] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient Flows: In Metric Spaces and in
the Space of Probability Measures. Springer Science & Business Media, 2008.
[AKT19] Alnur Ali, J. Zico Kolter, and Ryan J. Tibshirani. A continuous-time view of early stopping for
least squares regression. In Kamalika Chaudhuri and Masashi Sugiyama, editors, Proceedings
of Machine Learning Research, volume 89 of Proceedings of Machine Learning Research, pages
1370–1378. PMLR, 2019.

53
[AM97] Dimitris Achlioptas and Michael Molloy. The analysis of a list-coloring algorithm on a random
graph. In Proceedings 38th Annual Symposium on Foundations of Computer Science, pages
204–212. IEEE, 1997.
[AZLS19] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-
parameterization. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of
the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine
Learning Research, pages 242–252. PMLR, 2019.
[Bac13] Francis Bach. Sharp analysis of low-rank kernel matrix approximations. In Shai Shalev-Shwartz
and Ingo Steinwart, editors, Proceedings of the 26th Annual Conference on Learning Theory,
volume 30 of Proceedings of Machine Learning Research, pages 185–209. PMLR, 2013.

[Bac17] Francis Bach. Breaking the curse of dimensionality with convex neural networks. The Journal
of Machine Learning Research, 18(1):629–681, 2017.
[Bar98] P. L. Bartlett. The sample complexity of pattern classification with neural networks: the
size of the weights is more important than the size of the network. IEEE Transactions on
Information Theory, 44(2):525–536, 1998.

[Bar08] Peter L. Bartlett. Fast rates for estimation error and oracle inequalities for model selection.
Econometric Theory, 24(2):545–552, April 2008.
[BBD02] P. L. Bartlett and S. Ben-David. Hardness results for neural network approximation problems.
Theoretical Computer Science, 284(1):53–66, 2002.

[BBL02] P. L. Bartlett, S. Boucheron, and G. Lugosi. Model selection and error estimation. Machine
Learning, 48:85–113, 2002.
[BBM05] Peter L. Bartlett, Olivier Bousquet, and Shahar Mendelson. Local Rademacher complexities.
Annals of Statistics, 33(4):1497–1537, 2005.

[BBV06] Maria-Florina Balcan, Avrim Blum, and Santosh Vempala. Kernels as features: On kernels,
margins, and low-dimensional mappings. Machine Learning, 65(1):79–94, 2006.
[BD07] Peter J. Bickel and Kjell A. Doksum. Mathematical Statistics: Basic Ideas and Selected Topics.
Pearson Prentice Hall, 2007.
[BE02] Olivier Bousquet and André Elisseeff. Stability and generalization. Journal of Machine Learn-
ing Research, 2:499–526, 2002.
[BEHW89] Anselm Blumer, A. Ehrenfeucht, David Haussler, and Manfred K. Warmuth. Learnability and
the Vapnik-Chervonenkis dimension. Journal of the ACM, 36(4):929–965, 1989.
[BFT17] Peter L. Bartlett, Dylan Foster, and Matus Telgarsky. Spectrally-normalized margin bounds
for neural networks. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-
wanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30,
pages 6240–6249. Curran Associates, Inc., 2017.
[BH89] Eric B. Baum and David Haussler. What size net gives valid generalization? Neural Compu-
tation, 1(1):151–160, 1989.

[BHLM19] Peter L. Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian. Nearly-tight VC-
dimension and pseudodimension bounds for piecewise linear neural networks. Journal of Ma-
chine Learning Research, 20(63):1–17, 2019.

54
[BHM18] Mikhail Belkin, Daniel J Hsu, and Partha Mitra. Overfitting or perfect fitting? risk bounds
for classification and regression rules that interpolate. In S. Bengio, H. Wallach, H. Larochelle,
K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Pro-
cessing Systems, volume 31, pages 2300–2311. Curran Associates, Inc., 2018.
[BHMM19] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-
learning practice and the classical bias–variance trade-off. Proceedings of the National Academy
of Sciences, 116(32):15849–15854, 2019.
[BJM06] Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Convexity, classification, and risk
bounds. Journal of the American Statistical Association, 101(473):138–156, 2006.
[BL99] P. L. Bartlett and G. Lugosi. An inequality for uniform deviations of sample averages from
their means. Statistics and Probability Letters, 44(1):55–62, 1999.
[BL20a] Yu Bai and Jason D. Lee. Beyond linearization: On quadratic and higher-order approximation
of wide neural networks. In International Conference on Learning Representations, 2020.
arXiv:1910.01619.
[BL20b] Peter L. Bartlett and Philip M. Long. Failures of model-dependent generalization bounds for
least-norm interpolation. arXiv preprint arXiv:2010.08479, 2020.
[BLLT20] Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting
in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070,
2020.
[BLM13] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration Inequalities: a
Nonasymptotic Theory of Independence. Oxford University Press, 2013.
[BM02] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and
structural results. Journal of Machine Learning Research, 3:463–482, 2002.
[BMM98] P. L. Bartlett, V. Maiorov, and R. Meir. Almost linear VC dimension bounds for piecewise
polynomial networks. Neural Computation, 10(8):2159–2173, 1998.
[BMM18] Mikhail Belkin, Siyuan Ma, and Soumik Mandal. To understand deep learning we need to
understand kernel learning. In International Conference on Machine Learning, pages 541–549,
2018.
[BR92] Avrim Blum and Ronald L. Rivest. Training a 3-node neural network is NP-complete. Neural
Networks, 5(1):117–127, 1992.
[Bre98] Leo Breiman. Arcing classifiers. The Annals of Statistics, 26(3):801–849, 1998.
[BRT19] Mikhail Belkin, Alexander Rakhlin, and Alexandre B Tsybakov. Does data interpolation con-
tradict statistical optimality? In The 22nd International Conference on Artificial Intelligence
and Statistics, pages 1611–1619. PMLR, 2019.
[CB18] Lénaı̈c Chizat and Francis Bach. On the global convergence of gradient descent for over-
parameterized models using optimal transport. In S. Bengio, H. Wallach, H. Larochelle,
K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Pro-
cessing Systems, volume 31, pages 3036–3046. Curran Associates, Inc., 2018.
[CB20] Lénaı̈c Chizat and Francis Bach. Implicit bias of gradient descent for wide two-layer neural
networks trained with the logistic loss. In Jacob Abernethy and Shivani Agarwal, editors,
Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of
Machine Learning Research, pages 1305–1338. PMLR, 2020. arXiv:2002.04486.

55
[CDV07] Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares
algorithm. Foundations of Computational Mathematics, 7(3):331–368, 2007.
[CH67] Thomas Cover and Peter Hart. Nearest neighbor pattern classification. IEEE Transactions
on Information Theory, 13(1):21–27, 1967.

[CLG01] Rich Caruana, Steve Lawrence, and C. Giles. Overfitting in neural nets: Backpropagation,
conjugate gradient, and early stopping. In T. Leen, T. Dietterich, and V. Tresp, editors,
Advances in Neural Information Processing Systems, volume 13. MIT Press, 2001.
[CO10] Amin Coja-Oghlan. A better algorithm for random k-SAT. SIAM Journal on Computing,
39(7):2823–2864, 2010.

[COB19] Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable pro-
gramming. In Advances in Neural Information Processing Systems, pages 2937–2947, 2019.
[CS13] Xiuyuan Cheng and Amit Singer. The spectrum of random inner-product kernel matrices.
Random Matrices: Theory and Applications, 2(04):1350010, 2013.

[CV95] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning, 20(3):273–
297, 1995.
[CX21] Lin Chen and Sheng Xu. Deep neural tangent kernel and Laplace kernel have the same RKHS.
In International Conference on Learning Representations, 2021. arXiv:2009.10683.
[DC95] Harris Drucker and Corinna Cortes. Boosting decision trees. In Proceedings of the 8th In-
ternational Conference on Neural Information Processing Systems, page 479–485, Cambridge,
MA, USA, 1995. MIT Press.
[DFKU13] Paramveer S. Dhillon, Dean P. Foster, Sham M. Kakade, and Lyle H. Ungar. A risk com-
parison of ordinary least squares vs ridge regression. Journal of Machine Learning Research,
14(10):1505–1511, 2013.

[DGA20] Ethan Dyer and Guy Gur-Ari. Asymptotics of wide networks from Feynman diagrams. In
International Conference on Learning Representations, 2020. arXiv:1909.11304.
[DGK98] Luc Devroye, Laszlo Györfi, and Adam Krzyżak. The Hilbert kernel regression estimate.
Journal of Multivariate Analysis, 65(2):209–227, 1998.

[DLL+ 19] Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds
global minima of deep neural networks. In International Conference on Machine Learning,
pages 1675–1685, 2019.
[DSS95] Bhaskar DasGupta, Hava T. Siegelmann, and Eduardo D. Sontag. On the complexity of
training neural networks with continuous activation functions. IEEE Transactions on Neural
Networks, 6(6):1490–1504, 1995.
[DV13] Yen Do and Van Vu. The spectrum of random kernel matrices: universality results for rough
and varying kernels. Random Matrices: Theory and Applications, 2(03):1350005, 2013.
[DW79] Luc Devroye and Terry Wagner. Distribution-free inequalities for the deleted and holdout
error estimates. IEEE Transactions on Information Theory, 25(2):202–207, 1979.
[DZPS19] Simon S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably
optimizes over-parameterized neural networks. In International Conference on Learning Rep-
resentations, 2019. arXiv:1810.02054.

56
[EAM15] Ahmed El Alaoui and Michael W Mahoney. Fast randomized kernel ridge regression with
statistical guarantees. In Advances in Neural Information Processing Systems, pages 775–783,
2015.
[EHKV89] A. Ehrenfeucht, David Haussler, Michael J. Kearns, and Leslie G. Valiant. A general lower
bound on the number of examples needed for learning. Information and Computation, 82:247–
261, 1989.
[EK10] Noureddine El Karoui. The spectrum of kernel random matrices. The Annals of Statistics,
38(1):1–50, 2010.
[Fel20] Vitaly Feldman. Does learning require memorization? A short tale about a long tail. In
Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages
954–959, 2020.
[FHT01] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The Elements of Statistical Learning.
Springer, 2001.
[FM19] Zhou Fan and Andrea Montanari. The spectral norm of random inner-product kernel matrices.
Probability Theory and Related Fields, 173(1-2):27–85, 2019.
[Fri01] Jerome H. Friedman. Greedy function approximation: A gradient boosting machine. Ann.
Statist., 29(5):1189–1232, 10 2001.
[FS96] Alan Frieze and Stephen Suen. Analysis of two simple heuristics on a random instance of
k-SAT. Journal of Algorithms, 20(2):312–355, 1996.
[FS97] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning
and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139,
1997.
[GBC16] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.

[GJ79] Michael R. Garey and David S. Johnson. Computers and Intractability: A Guide to the Theory
of NP-Completeness. W. H. Freeman and Company, 1979.
[GLSS18a] Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias
in terms of optimization geometry. volume 80 of Proceedings of Machine Learning Research,
pages 1832–1841, Stockholmsmässan, Stockholm Sweden, 2018. PMLR.
[GLSS18b] Suriya Gunasekar, Jason D Lee, Daniel Soudry, and Nati Srebro. Implicit bias of gradient de-
scent on linear convolutional networks. In Advances in Neural Information Processing Systems,
pages 9461–9471, 2018.
[GMKZ19] Sebastian Goldt, Marc Mézard, Florent Krzakala, and Lenka Zdeborová. Modelling the influ-
ence of data structure on learning in neural networks. arXiv:1909.11500, 2019.
[GMMM20a] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Linearized two-
layers neural networks in high dimension. arXiv:1904.12191. Annals of Statistics (To appear).,
2020.

[GMMM20b] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. When do neural
networks outperform kernel methods? In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan,
and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages
14820–14830. Curran Associates, Inc., 2020.

57
[GRM+ 20] Sebastian Goldt, Galen Reeves, Marc Mézard, Florent Krzakala, and Lenka Zdeborová. The
gaussian equivalence of generative models for learning with two-layer neural networks. arXiv
preprint arXiv:2006.14709, 2020.
[GRS18] Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity
of neural networks. In Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet, editors,
Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine
Learning Research, pages 297–299. PMLR, 06–09 Jul 2018.
[GSd+ 19] Mario Geiger, Stefano Spigler, Stéphane d’Ascoli, Levent Sagun, Marco Baity-Jesi, Giulio
Biroli, and Matthieu Wyart. Jamming transition as a paradigm to understand the loss land-
scape of deep neural networks. Physical Review E, 100(1):012115, 2019.

[GWB+ 17] Suriya Gunasekar, Blake Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nathan
Srebro. Implicit regularization in matrix factorization. In Proceedings of the 31st International
Conference on Neural Information Processing Systems, pages 6152–6160, 2017.
[GYK+ 20] Amnon Geifman, Abhay Yadav, Yoni Kasten, Meirav Galun, David Jacobs, and Basri Ronen.
On the similarity between the Laplace and neural tangent kernels. In H. Larochelle, M. Ran-
zato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing
Systems, volume 33, pages 1451–1461. Curran Associates, Inc., 2020. arXiv:2007.01580.
[Hau92] D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other
learning applications. Information and Computation, 100(1):78–150, 1992.

[HL20] Hong Hu and Yue M Lu. Universality laws for high-dimensional learning with random features.
arXiv:2009.07669, 2020.
[HMD15] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural
networks with pruning, trained quantization and Huffman coding. arXiv:1510.00149, 2015.
[HMRT20] Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high-
dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560v5, 2020.
[HN20] Boris Hanin and Mihai Nica. Finite depth and width corrections to the neural tangent kernel.
In International Conference on Learning Representations, 2020. arXiv:1909.05989.
[HY20] Jiaoyang Huang and Horng-Tzer Yau. Dynamics of deep neural networks and neural tangent
hierarchy. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International
Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research,
pages 4542–4551. PMLR, 13–18 Jul 2020.
[JGH18] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence
and generalization in neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman,
N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems,
volume 31, pages 8571–8580. Curran Associates, Inc., 2018.
[Joh19] Iain M. Johnstone. Gaussian Estimation: Sequence and Wavelet Models. 2019. Manuscript,
available at http://statweb.stanford.edu/∼imj/.
[JP78] David S. Johnson and F. P. Preparata. The densest hemisphere problem. Theoretical Computer
Science, 6:93–107, 1978.
[JT18] Ziwei Ji and Matus Telgarsky. Risk and parameter convergence of logistic regression. arXiv
preprint arXiv:1803.07300, 2018.

58
[JT19] Ziwei Ji and Matus Telgarsky. A refined primal-dual analysis of the implicit bias. arXiv
preprint arXiv:1906.04540, 2019.
[Jud90] J. S. Judd. Neural Network Design and the Complexity of Learning. MIT Press, 1990.
[KL17] Vladimir Koltchinskii and Karim Lounici. Concentration inequalities and moment bounds for
sample covariance operators. Bernoulli, 23(1):110–133, 2017.
[KM97] Marek Karpinski and Angus J. Macintyre. Polynomial bounds for VC dimension of sigmoidal
and general Pfaffian neural networks. Journal of Computer and System Sciences, 54:169–176,
1997.
[KM15] Vladimir Koltchinskii and Shahar Mendelson. Bounding the smallest singular value of
a random matrix without concentration. International Mathematics Research Notices,
2015(23):12991–13008, 2015.
[Kol01] V. Koltchinskii. Rademacher penalties and structural risk minimization. IEEE Transactions
on Information Theory, 47(5):1902–1914, July 2001.

[Kol06] Vladimir Koltchinskii. Local Rademacher complexities and oracle inequalities in risk mini-
mization. Annals of Statistics, 34:2593–2656, 2006.
[KP00] V. I. Koltchinskii and D. Panchenko. Rademacher processes and bounding the risk of function
learning. In Evarist Giné, David M. Mason, and Jon A. Wellner, editors, High Dimensional
Probability II, volume 47, pages 443–459. Birkhäuser, 2000.

[KS01] Vera Kurková and Marcello Sanguineti. Bounds on rates of variable-basis and neural-network
approximation. IEEE Transactions on Information Theory, 47(6):2659–2665, 2001.
[KS02] Vera Kurková and Marcello Sanguineti. Comparison of worst case errors in linear and neural
network approximation. IEEE Transactions on Information Theory, 48(1):264–275, 2002.

[KT20] Ganesh Ramachandra Kini and Christos Thrampoulidis. Analytic study of double descent in
binary classification: The impact of loss. In IEEE International Symposium on Information
Theory, ISIT 2020, Los Angeles, CA, USA, June 21-26, 2020, pages 2527–2532. IEEE, 2020.
arXiv:2001.11572.
[Kur97] Věra Kurková. Dimension-independent rates of approximation by neural networks. In Com-
puter Intensive Methods in Control and Signal Processing, pages 261–270. Springer, 1997.
[KY17] Antti Knowles and Jun Yin. Anisotropic local laws for random matrices. Probability Theory
and Related Fields, 169(1-2):257–352, 2017.
[LBH15] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521:436–444, 2015.

[LBW96] W. S. Lee, P. L. Bartlett, and R. C. Williamson. Efficient agnostic learning of neural networks
with bounded fan-in. IEEE Transactions on Information Theory, 42(6):2118–2132, 1996.
[Led01] Michel Ledoux. The concentration of measure phenomenon. Number 89. American Mathe-
matical Society, 2001.
[LGT97] Steve Lawrence, C. Lee Giles, and Ah Chung Tsoi. Lessons in neural network training: Over-
fitting may be harder than expected. In In Proceedings of the Fourteenth National Conference
on Artificial Intelligence, AAAI-97, pages 540–545. AAAI Press, 1997.
[Lia20] Tengyuan Liang, 2020. Personal communication.

59
[Lin04] Y. Lin. A note on margin-based loss functions in classification. Statistics and Probability
Letters, 68:73–82, 2004.
[LLC18] Cosme Louart, Zhenyu Liao, and Romain Couillet. A random matrix approach to neural
networks. The Annals of Applied Probability, 28(2):1190–1248, 2018.

[LMZ18] Yuanzhi Li, Tengyu Ma, and Hongyang Zhang. Algorithmic regularization in over-
parameterized matrix sensing and neural networks with quadratic activations. In Conference
On Learning Theory, pages 2–47. PMLR, 2018.
[LR20] Tengyuan Liang and Alexander Rakhlin. Just interpolate: Kernel “ridgeless” regression can
generalize. Annals of Statistics, 48(3):1329–1347, 2020.

[LRS15] Tengyuan Liang, Alexander Rakhlin, and Karthik Sridharan. Learning with square loss: Lo-
calization through offset Rademacher complexity. In Peter Grünwald, Elad Hazan, and Satyen
Kale, editors, Proceedings of the 28th Conference on Learning Theory, volume 40 of Proceedings
of Machine Learning Research, pages 1260–1285, Paris, France, 03–06 Jul 2015. PMLR.
[LRZ20] Tengyuan Liang, Alexander Rakhlin, and Xiyu Zhai. On the multiple descent of minimum-
norm interpolants and restricted lower isometry of kernels. In Jacob Abernethy and Shivani
Agarwal, editors, Proceedings of Thirty Third Conference on Learning Theory, volume 125 of
Proceedings of Machine Learning Research, pages 2683–2711. PMLR, 2020. arXiv:1908.10292.
[LT91] M. Ledoux and M. Talagrand. Probability in Banach Spaces: Isoperimetry and Processes.
Springer, 1991.

[LV04] G. Lugosi and N. Vayatis. On the Bayes-risk consistency of regularized boosting methods.
Annals of Statistics, 32:30–55, 2004.
[LZB20] Chaoyue Liu, Libin Zhu, and Mikhail Belkin. On the linearity of large non-linear models:
when and why the tangent kernel is constant. In H. Larochelle, M. Ranzato, R. Hadsell, M. F.
Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33,
pages 15954–15964. Curran Associates, Inc., 2020.
[Men02] Shahar Mendelson. Improving the sample complexity using global data. IEEE Transactions
on Information Theory, 48:1977–1991, 2002.
[Men20] Shahar Mendelson. Extending the scope of the small-ball method. Studia Mathematica, pages
1–21, 2020.
[MM19] Song Mei and Andrea Montanari. The generalization error of random features regression: Pre-
cise asymptotics and double descent curve. Communications in Pure and Applied Mathematics
(To appear), 2019. arXiv:1908.05355.

[MMM19] Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Mean-field theory of two-layers
neural networks: dimension-free bounds and kernel limit. In Conference on Learning Theory,
pages 2388–2464, 2019.
[MMM21] Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Generalization error of random fea-
tures and kernel methods: hypercontractivity and kernel matrix concentration. arXiv preprint
arXiv:2101.10588, 2021.

[MMN18] Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of
two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665–
E7671, 2018.

60
[MP90] Gale Martin and James Pittman. Recognizing hand-printed letters and digits. In D. Touretzky,
editor, Advances in Neural Information Processing Systems, volume 2. Morgan-Kaufmann,
1990.
[MRSY19] Andrea Montanari, Feng Ruan, Youngtak Sohn, and Jun Yan. The generalization error of
max-margin linear classifiers: High-dimensional asymptotics in the overparametrized regime.
arXiv preprint arXiv:1911.01544, 2019.
[MZ20] Andrea Montanari and Yiqiao Zhong. The interpolation phase transition in neural networks:
Memorization and generalization under lazy training. arXiv:2007.12826, 2020.
[Nad64] Elizbar A Nadaraya. On estimating regression. Theory of Probability & Its Applications,
9(1):141–142, 1964.
[NK19] Vaishnavh Nagarajan and J. Zico Kolter. Uniform convergence may be unable to explain
generalization in deep learning. In NeurIPS, pages 11611–11622, 2019.
[NLG+ 19] Mor Shpigel Nacson, Jason Lee, Suriya Gunasekar, Pedro Henrique Pamplona Savarese,
Nathan Srebro, and Daniel Soudry. Convergence of gradient descent on separable data. In
The 22nd International Conference on Artificial Intelligence and Statistics, pages 3420–3428.
PMLR, 2019.
[NP20] Phan-Minh Nguyen and Huy Tuan Pham. A rigorous framework for the mean field limit of
multilayer neural networks. arXiv:2001.11443, 2020.
[NS17] Atsushi Nitanda and Taiji Suzuki. Stochastic particle gradient descent for infinite ensembles.
arXiv:1712.05438, 2017.
[NTS15] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in
neural networks. In Peter Grünwald, Elad Hazan, and Satyen Kale, editors, Proceedings of the
28th Conference on Learning Theory, volume 40 of Proceedings of Machine Learning Research,
pages 1376–1401. PMLR, 2015.
[NTSS17] Behnam Neyshabur, Ryota Tomioka, Ruslan Salakhutdinov, and Nathan Srebro. Geometry
of optimization and implicit regularization in deep learning. arXiv preprint arXiv:1705.03071,
2017.
[OS19] Samet Oymak and Mahdi Soltanolkotabi. Overparameterized nonlinear learning: Gradient
descent takes the shortest path? In International Conference on Machine Learning, pages
4951–4960. PMLR, 2019.
[OS20] Samet Oymak and Mahdi Soltanolkotabi. Towards moderate overparameterization: global
convergence guarantees for training shallow neural networks. IEEE Journal on Selected Areas
in Information Theory, 1(1):84–105, 2020.
[Pol90] David Pollard. Empirical Processes: Theory and Applications, volume 2. Institute of Mathe-
matical Statistics, 1990.
[Pol95] David Pollard. Uniform ratio limit theorems for empirical processes. Scandinavian Journal of
Statistics, 22:271–278, 1995.
[PW17] Jeffrey Pennington and Pratik Worah. Nonlinear random matrix theory for deep learning. In
Advances in Neural Information Processing Systems, pages 2637–2646, 2017.
[PW18] Jeffrey Pennington and Pratik Worah. The spectrum of the Fisher information matrix of
a single-hidden-layer neural network. Advances in Neural Information Processing Systems,
31:5410–5419, 2018.

61
[Qui96] J. R. Quinlan. Bagging, boosting, and C4.5. In In Proceedings of the Thirteenth National
Conference on Artificial Intelligence, pages 725–730, 1996.
[RCR15] Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. Less is more: Nyström com-
putational regularization. In Advances in Neural Information Processing Systems, volume 28,
pages 1657–1665, 2015.

[RR07] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances
in Neural Information Processing Systems, volume 20, pages 1177–1184, 2007.
[RR09] Ali Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: Replacing minimiza-
tion with randomization in learning. In Advances in Neural Information Processing Systems,
volume 22, pages 1313–1320, 2009.
[RR17] Alessandro Rudi and Lorenzo Rosasco. Generalization properties of learning with random
features. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob
Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information
Processing Systems 30, pages 3215–3225, 2017.

[RST17] Alexander Rakhlin, Karthik Sridharan, and Alexandre B Tsybakov. Empirical entropy, mini-
max regret and minimax risk. Bernoulli, 23(2):789–824, 2017.
[RV06] M. Rudelson and R. Vershynin. Combinatorics of random processes and sections of convex
bodies. Annals of Mathematics, 164(2):603–648, 2006.

[RVE18] Grant Rotskoff and Eric Vanden-Eijnden. Parameters as interacting particles: long time
convergence and asymptotic error scaling of neural networks. In S. Bengio, H. Wallach,
H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural
Information Processing Systems, volume 31. Curran Associates, Inc., 2018. arXiv:1805.00915.
[RZ19] Alexander Rakhlin and Xiyu Zhai. Consistency of interpolation with Laplace kernels is a
high-dimensional phenomenon. In Conference on Learning Theory, pages 2595–2623, 2019.
[San15] Filippo Santambrogio. Optimal Transport for Applied Mathematicians: Calculus of Variations,
PDEs, and Modeling, volume 87. Birkhäuser, 2015.
[SFBL98] Robert E Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee. Boosting the margin: A
new explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5):1651–
1686, 1998.
[SGd+ 19] Stefano Spigler, Mario Geiger, Stéphane d’Ascoli, Levent Sagun, Giulio Biroli, and Matthieu
Wyart. A jamming transition from under-to over-parametrization affects generalization in
deep learning. Journal of Physics A: Mathematical and Theoretical, 52(47):474001, 2019.

[SHN+ 18] Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The
implicit bias of gradient descent on separable data. The Journal of Machine Learning Research,
19(1):2822–2878, 2018.
[SS20] Justin Sirignano and Konstantinos Spiliopoulos. Mean field analysis of neural networks: A
law of large numbers. SIAM Journal on Applied Mathematics, 80(2):725–752, 2020.

[SST10] Nathan Srebro, Karthik Sridharan, and Ambuj Tewari. Optimistic rates for learning with a
smooth loss. arXiv:1009.3896, 2010.
[Tal94] M. Talagrand. Sharper bounds for Gaussian and empirical processes. Annals of Probability,
22:28–76, 1994.

62
[TB20] Alexander Tsigler and Peter L Bartlett. Benign overfitting in ridge regression. arXiv preprint
arXiv:2009.14286, 2020.
[Tel13] Matus Telgarsky. Margins, shrinkage, and boosting. In International Conference on Machine
Learning, pages 307–315, 2013.

[Tib96] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical
Society (Series B), 58:267–288, 1996.
[TPT20] Hossein Taheri, Ramtin Pedarsani, and Christos Thrampoulidis. Fundamental limits of ridge-
regularized empirical risk minimization in high dimensions. arXiv:2006.08917, 2020.
[Tsy08] Alexandre B Tsybakov. Introduction to Nonparametric Estimation. Springer Science & Busi-
ness Media, 2008.
[VC71] V. N. Vapnik and A. Ya. Chervonenkis. On the uniform convergence of relative frequencies of
events to their probabilities. Theory of Probability and Its Applications, 16(2):264–280, 1971.
[VC74] V. N. Vapnik and A. Ya. Chervonenkis. Theory of Pattern Recognition. Nauka, 1974.

[vdG90] Sara van de Geer. Estimating a regression function. Annals of Statistics, 18:907–924, 1990.
[Ver18] Roman Vershynin. High-Dimensional Probability. An Introduction with Applications in Data
Science. Cambridge University Press, 2018.
[Vu98] Van H. Vu. On the infeasibility of training neural networks with small squared errors. In
Michael I. Jordan, Michael J. Kearns, and Sara A. Solla, editors, Advances in Neural Infor-
mation Processing Systems 10, pages 371–377. MIT Press, 1998.
[Was13] Larry Wasserman. All of Statistics: a Concise Course in Statistical Inference. Springer Science
& Business Media, 2013.
[Wat64] Geoffrey S Watson. Smooth regression analysis. Sankhyā: The Indian Journal of Statistics,
Series A, pages 359–372, 1964.
[WOBM17] Abraham J Wyner, Matthew Olson, Justin Bleich, and David Mease. Explaining the success
of AdaBoost and random forests as interpolating classifiers. The Journal of Machine Learning
Research, 18(1):1558–1590, 2017.

[WS01] Christopher KI Williams and Matthias Seeger. Using the Nyström method to speed up kernel
machines. In Advances in Neural Information Processing Systems, pages 682–688, 2001.
[ZBH+ 17] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Under-
standing deep learning requires rethinking generalization. In 5th International Conference on
Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track
Proceedings. OpenReview.net, 2017. arXiv:1611.03530.
[ZCZG20] Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Gradient descent optimizes over-
parameterized deep ReLU networks. Machine Learning, 109(3):467–492, 2020.
[Zha04] Tong Zhang. Statistical behavior and consistency of classification methods based on convex
risk minimization. Annals of Statistics, 32:56–85, 2004.

[ZY05] Tong Zhang and Bin Yu. Boosting with early stopping: Convergence and consistency. The
Annals of Statistics, 33(4):1538–1579, 2005.

63
A Kernels on Rd with d  n
A.1 Bound on the variance of the minimum-norm interpolant
Lemma A.1. For any X ∈ Rn×d and any positive semidefinite Σ ∈ Rd×d , for n . d and any k < d,
 
T −2 T
 1 λ1 k
tr (XX + dγIn ) XΣX . + λk+1 , (158)
γ n

where λ1 ≥ . . . ≥ λd are the eigenvalues of Σ.


Proof. This deterministic argument is due to T. Liang [Lia20]. We write Σ = Σ≤k + Σ>k , with Σ≤k =
P
i≤k λi ui ui . Then by the argument in [LR20, Remark 5.1],
T

n
X λ
bi n λk+1
tr (XX T + dγIn )−2 XΣ>k X T ≤ λk+1

≤ λk+1 . (159)
i=1
bi )2
(dγ + λ 4dγ γ

bi are the eigenvalues of XX T . Here we use the fact that t 2 ≤ 1 for all t, r > 0. On the other
where λ (r+t) 4r
hand,
 X 2
tr (XX T + dγIn )−2 XΣ≤k X T ≤ λi (dγI n + XX T )−1 Xui . (160)
i≤k

Now, using the argument similar to that in [BLLT20], we define A−i = dγI n + X(I n − ui uiT )X T , v = Xui
and write
v T A−2
−i v
(dγI n + XX T )−1 Xui 2 = (A−i + vv T )−1 v 2 =

(161)
(1 + v T A−1
−i v)
2

by the Sherman-Morrison formula. The last quantity is upper bounded by

1 v T A−1
−i v 1
−1 ≤ . (162)
dγ (1 + v A−i v)
T 2 4γd

Substituting in (160), we obtain an upper bound of

1 X λ1 k
λi . ,
4γd γn
i≤k

assuming n . d.

A.2 Exact characterization in the proportional asymptotics


We will denote by K = (h(hxi , xj i/d))i,j≤n the kernel matrix. We will also denote by K 1 the linearized
kernel
XX T
K1 = β + βγIn + α11 T , (163)
d
tr(Σ2 )
α := h(0) + h00 (0) , β := h0 (0), (164)
2d2
1 
h(tr(Σ)/d) − h(0) − h0 (0)tr(Σ/d) .

γ := 0 (165)
h (0)

64
Assumption 4.12. We assume that the coordinates of z = Σ−1/2 x are independent, with zero mean and
unit variance, so that Σ = Exx T . Further assume there are constants 0 < η, M < ∞, such that the following
hold.
(a) For all i ≤ d, E[|z i |8+η ] ≤ M .
Pd
(b) kΣk ≤ M , d−1 i=1 λ−1 i ≤ M , where λ1 , . . . , λd are the eigenvalues of Σ.
Theorem 4.13. Let 0 < M, η < ∞ be fixed constants and suppose that Assumption 4.12 holds with M −1 ≤
d/n ≤ M . Further assume that h is continuous on R and smooth in a neighborhood of 0 with h(0), h0 (0) > 0,
that kf ∗ kL4+η (P) ≤ M and that the zi ’s are M -sub-Gaussian. Let yi = f ∗ (xi ) + ξi , E(ξi2 ) = σξ2 , and
β 0 := Σ−1 E[xf ∗ (x)]. Let λ∗ > 0 be the unique positive solution of
 γ   
n 1− = tr Σ(Σ + λ∗ I)−1 . (166)
λ∗
Define B(Σ, β 0 ) and V (Σ) by

tr Σ2 (Σ + λ∗ I)−2

V (Σ) := , (167)
n − tr Σ2 (Σ + λ∗ I)−2
λ2∗ hβ 0 , (Σ + λ∗ I)−2 Σβ 0 i
B(Σ, β 0 ) := . (168)
1 − n−1 tr Σ2 (Σ + λ∗ I)−2
2
Finally, let bias
d and var d denote the squared bias and variance for the minimum-norm interpolant. Then
there exist C, c0 > 0 (depending also on the constants in Assumption 4.12) such that the following holds with
probability at least 1 − Cn−1/4 (here P>1 denotes the projector orthogonal to affine functions in L2 (P)):
2
d − B(Σ, β 0 ) − kP>1 f ∗ k2L2 (1 + V (Σ)) ≤ Cn−c0 ,

bias (169)
d − σξ V (Σ) ≤ Cn
2 −c

var 0
. (170)

Remark A.1. The result for the variance will be proved under weaker assumptions and in a stronger form
than stated. In particular, it does not require any assumption on the target function f∗ , and it holds with
smaller error terms than stated.
Remark A.2. Notice that by positive definiteness of the kernel, we have h0 (0), h00 (0) ≥ 0. Hence the
conditions that these are strictly positive is essentially a non-degeneracy requirement.
We note for future reference that the target function f ∗ is decomposed as

f ∗ (x) = b0 + hβ 0 , xi + P>1 f ∗ (x) , (171)

where b0 := E{f ∗ (x)}, β 0 := Σ−1 E[xf ∗ (x)] as defined above and E{P>1 f ∗ (x)}, E{xP>1 f ∗ (x)} = 0.

A.2.1 Preliminaries
Throughout the proof, we will use C for constants that depend uniquely on the constants in Assumption
4.12 and Theorem 4.13. We also write that an inequality holds with very high probability if, for any A > 0,
we can choose the constants C in the inequality such that this holds with probability at least 1 − n−A for
all A large enough.
We will repeatedly use the following bound, see e.g. [EK10].
Lemma A.2. Under the assumptions of Theorem 4.13, we have, with very high probability

K = K1 + ∆ , k∆k ≤ n−c0 . (172)

In particular, as long as h is non-linear, we have K  c∗ In , c∗ = βγ > 0 with probability at least 1 − Cn−D .

65
Define the matrix M ∈ Rn×n , and the vector v ∈ Rn by
  
1  1
Mij := Ex h hxi , xi h hxj , xi , (173)
d d
  
1 
vi := Ex h hxi , xi f ∗ (x) . (174)
d
Our first lemma provides useful approximations of these quantities.
Lemma A.3. Define (here expectations are over G ∼ N(0, 1)):
1
v 0 := a0 b0 + h0 (0)XΣβ 0 , (175)
d
n r Q o 1
ii
ai,0 := E h G , Qij := hxi , Σxj i . (176)
d d
and
1
M 0 := aa T + B , B := DQD , (177)
d
1  Q 3/2
ii
d
X (Σ1/2 xi )3j
ai := ai,0 + ai,1 , ai,1 = h(3) (0) E(zj3 ) , (178)
6 d j=1 kΣ1/2 xi k32
n r Q o
ii
D := diag(D1 , . . . , Dn ) , Di := E h0 G . (179)
d
Then the following hold with very high probability (in other words, for any A > 0 there exists C such that
the following hold with probability at least 1 − n−A for all n large enough)

log d
max vi − v0,i ≤ C 3/2 , (180)
i≤n d
log d
max Mij − M0,ij ≤ C 5/2 , (181)
i6=j≤n d
log d
max Mii − M0,ii ≤ C 2 . (182)
i≤n d

In particular, this implies kv − v 0 k2 ≤ Cd−1 log d, kM − M 0 kF ≤ Cd−3/2 log d.
Proof. Throughout the proof we will work on the intersection E1 ∩ E2 of following events, which hold with
very high probability by standard concentration arguments. These events are defined by
r
n
−1 1 1 log d o
E1 := C ≤ √ kΣz i k2 ≤ C; √ kΣz i k∞ ≤ C ∀i ≤ n (183)
d d d
r
n 1 1 log d o
= C −2 ≤ hxi , Σxi i ≤ C 2 ; √ kΣ1/2 xi k∞ ≤ C ∀i ≤ n , (184)
d d d
and
n1 Xd r r
2 log d 1 log d 1 2 log d o
E2 := (Σz i )` (Σz j )` ≤ 1/2 ; |hz i , Σz j i| ≤ C ; |hz i , Σ z j i| ≤ C ∀i 6= j ≤ n
d d d d d d
`=1
(185)
d
n1 X r r
log d 1 log d 1 log d o
= (Σ1/2 xi )` (Σ1/2 xj )2` ≤ ; |hxi , xj i| ≤ C |hxi , Σxj i| ≤ C ∀i 6= j ≤ n .
d
`=1
d1/2 d d d d

66
Recall that, by assumption, h is smooth on an interval [−t0 , t0 ], t0 > 0. On the event E2 , we have hxi , xj i/d ∈
[−t0 , t0 ] for all i 6= j. If h is not smooth everywhere, we can always modify it outside
√ [−t0 /2, t0 /2] to obtain
a kernel h̃ that is smooth everywhere. Since x is sub-Gaussian, as long as kxi k/ d ≤ C for all i ≤ n (this
happens on E1 ) we have (for x ∼ P), hxi , xi/d ∈ [−t0 /2, t0 /2] with probability at least 1 − e−d/C . Further
using the fact that f is bounded in Eqs. (173), (174), we get,
  
1  1
Mij := Ex h̃ hxi , xi h̃ hxj , xi + O(e−d/C ) , (186)
d d
  
1 
vi := Ex h̃ hxi , xi f ∗ (x) + O(e−d/C ) , (187)
d
where the term O(e−d/C ) is uniform over i, j ≤ n. Analogously, in the definition of v 0 , M 0 (more precisely,
in defining a0 , D), we can replace h by h̃ at the price of an O(e−d/C ) error. Since these error terms are
negligible as compared to the ones in the statement, we shall hereafter neglect them and set h̃ = h (which
corresponds to defining arbitrarily the derivatives of h outside a neighborhood of 0).
We denote by hi,k the k-th coefficient of h((Qii /d)1/2 x) in the basis of Hermite polynomials. Namely:
n r Q  o  Q k/2 n r Q o
ii ii (k) ii
hi,k = E h G Hek (G) = E h G . (188)
d d d
Here h(k) denotes the k-th derivative of h (recall that by the argument above we can assume, without loss
of generality, that h is k-times differentiable).
We write hi,>k for the remainder after the first k terms of the Hermite expansion have been removed:
r Q  r Q  X k
ii ii 1
hi,>k x := h x − hi,` He` (x) (189)
d d `!
`=0
r Q  X k r
ii 1  Qii `/2 n (k)  Qii o
=h x − E h G He` (x) .
d `! d d
`=0

Finally, we denote by ĥi,>k (x) the remainder after the first k terms in the Taylor expansion have been
subtracted:
k
X 1 (`)
ĥ>k (x) := h(x) − h (0) x` . (190)
`!
`=0

Of course h − ĥ>k is a polynomial of degree k, and therefore its projection orthogonal to the first k Hermite
polynomials vanishes, whence
r Q  r Q  X k r
ii ii 1  Qii `/2 n (`)  Qii o
hi,>k x = ĥ>k x − E ĥ>k G He` (x) . (191)
d d `! d d
`=0
(`)
Note that, by smoothness of h, we have |ĥ>k (t)|≤ C min(|t|k+1−` , 1), and therefore

1  Q `/2 n r Q o
ii (`) ii
E ĥ>k G ≤ Cd−(k+1)/2 . (192)

`! d d



We also have that |ĥ>k (t)| ≤ C min(1, |t|k+1 ). Define v i = Σ1/2 xi / d, kv i k22 = Qii . For any fixed m ≥ 2,
by Eq. (191) and the triangle inequality,
n  1  m o1/m (a) n  1  m o1/m k n  hv , zi  m o1/m
i
X
−(k+1)/2
√ hv i , zi ≤ E ĥ>k √ hv i , zi + Cd

Ez hi,>k E He`
kv k

d d `=0
i 2
 Q (k+1)/2
ii
≤C + C d−(k+1)/2 ≤ C d−(k+1)/2 , (193)
d

67
where the inequality (a) follows since hv i , zi is C-sub-Gaussian. Note that Eqs. (189), (193) can also be
rewritten as
1 k
 X 1  1  1 
h hxi , xi = hi,` He` √ hxi , xi + hi,>k hxi , xi , (194)
d `! dQii d
`=0
n 1  m
E hi,>k hxi , xi }1/m ≤ C d−(k+1)/2 . (195)

d
We next prove Eq. (180). Using Eq. (194) with k = 2 and recalling He0 (x) = 1, He1 (x) = x, He2 (x) =
x2 − 1, we get
  
1 
vi = Ex h hxi , xi f ∗ (x)
d
hi,1
= hi,0 Ex {f ∗ (x)} + √ hxi , Ex {xf ∗ (x)}i
dQii
 
hi,2  ∗ 2
1 

+ Ex f (x)(hx, xi i − dQii ) + Ex hi,>2 hxi , xi f (x)
2dQii d
 
hi,1 hi,2 1 

= hi,0 b0 + √ hΣβ 0 , xi i + hxi , F 2 xi i + Ex hi,>2 hxi , xi f (x) .
dQii 2dQii d
Here we defined the d × d matrix F 2 = E{[f ∗ (x) − b0 ]xx T }. Recalling the definitions of hi,k , in Eq. (188),
we get hi,0 = ai,0 . Comparing other terms we obtain that the following holds with very high probability,
r r
1 n 0  Qii o 0
1 n 00  Qii o
|vi − v0,i | ≤ E h G − h (0) · |hΣβ 0 , xi i| + 2 E h G · hxi , F 2 xi i

d d d d
 1  
+ Ex hi,>2 hxi , xi f ∗ (x)

d
(a) 1 C C
≤ × × C log d + 2 hxi , F 2 xi i + C d−3/2

d d d
C
≤ 2 hxi , F 2 xi i + C d−3/2 .

d
Here the inequality (a) follows
√ since |E{h0 (Z) − h0 (0)}| ≤ CE{Z 2 } by smoothness of h and Taylor expansion,
maxi≤n |hΣβ 0 , xi i| ≤ C log n by sub-Gaussian tail bounds, and we used Eq. (195) for the last term.
The proof√of Eq. (180) is completed by showing that, with very high probability, maxi≤n |hxi , F 2 xi i| ≤
CkP>1 f ∗ kL2 d log d. Without loss of generality, we assume here kP>1 f ∗ kL2 = 1. In order to show this
claim, note that (defining P>0 f ∗ (x) := f ∗ (x) − Ef ∗ (x))

Ehxi , F 2 xi i = tr(ΣF 2 ) ≤ CE{P>0 f ∗ (x)kxk22 } ≤ Var(kxk22 )1/2 ≤ C d . (196)

Further notice that

kF 2 k = max |hv, F 2 vi| (197)


kvk2 =1

= max E P>0 f ∗ (x)hv, xi2



(198)
kvk2 =1
1/2
≤ max E hv, xi4

≤C. (199)
kvk2 =1

By the above and the Hanson-Wright inequality


√   t2 t  2
≤ 2 e−c((t /d)∧t) ,

P hxi , F 2 xi i ≥ C d + t ≤ 2 exp − c 2 ∧ (200)
kF 2 kF kF 2 k

68
√ similarly for the lower tail. By taking a union bound over i ≤ n, we obtain maxi≤n |hxi , F 2 xi i| ≤
and
C d log d as claimed, thus completing the proof of Eq. (180).
We next prove Eq. (181). We claim that this bound holds for any realization in E1 ∩ E2 . Therefore we
can fix without loss of generality i = 1, j = 2. We use Eq. (194) with k = 4. Using Cauchy-Schwarz and
Eqs. (194), (195), we get
4
X 1
M12 = h1,`1 h2,`2 M1,2 (`1 , `2 ) + ∆12 , (201)
`1 !`2 !
`1 ,`2 =0
  1   1 
M1,2 (`1 , `2 ) := Ex He`1 √ hx1 , xi He`2 √ hx2 , xi , |∆12 | ≤ Cd−5/2 . (202)
dQ11 dQ22

Note that, by Eq. (188), |hik | ≤ Cd−k/2 , and M1,2 (`1 , `2 ) is bounded on the event E1 ∩ E2 , by the sub-
Gaussianity of z. Comparing with Eqs. (177), (179), we get
4
X 1
|M12 − M0,12 | ≤ h1,`1 h2,`2 M1,2 (`1 , `2 ) − M0,12 + Cd−5/2 (203)

`1 !`2 !
`1 ,`2 =0
X
+2 h1,`1 h2,`2 M1,2 (`1 , `2 ) (204)
(`1 ,`2 )∈S
1
+ 2 h1,0 h2,3 M1,2 (0, 3) − a1,0 a2,1 + |a1,1 a2,1 | + Cd−5/2 ,

 6
S := (0, 1), (0, 2), (0, 4), (1, 2), (1, 3), (2, 2) , (205)

where in the inequality we used the identities h1,0 h2,0 M1,2 (0, 0) = h1,0 h2,0 = a1,0 a2,0 , and

1 r Q  r Q 
11 22
h1,1 h2,1 M1,2 (1, 1) = 2 hx1 , Σx2 iEh0 G Eh0 G = B12 .
d d d
We next bound each of the terms above separately.
We begin with the terms (`1 , `2 ) ∈ S. Since by Eq. (188), |hik | ≤ Cd−k/2 , for each of these pairs, we need

to show |M1,2 (`1 , `2 )| ≤ Cd(`1 +`2 −5)/2 log d. Consider (`1 , `2 ) = (0, k), k ∈ {1, 2, 4}. Set w = Σ1/2 x2 / dQ22 ,
Pk
kwk2 = 1, and write Hek (x) = m=0 ck,` x` . If g is a standard Gaussian vector, we have Eg Hek hw, gi) = 0
and therefore
   
M1,2 (0, k) = Ez Hek hw, zi − Eg Hek hw, gi (206)
k
X X 
= ck,` wi1 · · · wi` E(zi1 · · · zi` ) − E(gi1 · · · gi` ) . (207)
`=0 i1 ,...,i` ≤n

Note that the only non-vanishing terms in the above sum are those in which all of the indices appearing in
(i1 , . . . , i` ) appear at least twice, and at least one of the indices appears at least 3 times (because otherwise
the two expectations are equal). This immediately implies M1,2 (0, 1) = M1,2 (0, 2) = 0. Analogously, all
terms ` ≤ 2 vanish in the above sum.
As for k = 4, we have (recalling He4 (x) = x4 − 3x2 ):

X

M1,2 (0, 4) =
wi1 · · · wi4 E(zi1 · · · zi4 ) − E(gi1 · · · gi4 ) (208)
i1 ,...,i4 ≤n
X C log d
wi4 E(zi4 ) − 3 ≤ Ckwk2∞ kwk22 ≤

≤ , (209)
d
i≤n
p
where the last inequality follows since kwk2 = 1 by construction and kwk∞ ≤ C (log d)/d on E1 ∩ E2 .

69

Next consider (`1 , `2 ) = (1, 2). Setting wi = Σ1/2 xi / dQii , i ∈ {1, 2}, we get
     
M1,2 (1, 2) = Ez He1 hw1 , zi He2 hw2 , zi − Eg He1 hw1 , gi He2 hw2 , gi (210)
n  2 o n  2 o
= Ez hw1 , zi hw2 , zi − Eg hw1 , gi hw2 , gi (211)
X 
= w1,i1 w2,i2 w2,i3 E(zi1 zi2 zi2 ) − E(gi1 gi2 gi3 ) (212)
i1 ,i2 ,i3 ≤n
X n
= w1,i w2,i w2,i E(zi3 ) . (213)
i=1

Therefore, on E1 ∩ E2 ,
X n C log d
2

M1,2 (1, 2) ≤ C w1,i w2,i ≤ . (214)
i=1
d

Next consider (`1 , `2 ) = (1, 3). Proceeding as above (and noting that the degree-one term in He3 does
not contribute), we get
     
M1,2 (1, 3) = Ez He1 hw1 , zi He3 hw2 , zi − Eg He1 hw1 , gi He3 hw2 , gi (215)
n  3 o n  3 o
= Ez hw1 , zi hw2 , zi − Eg hw1 , gi hw2 , gi (216)
X 
= w1,i1 w2,i2 w2,i3 w2,i4 E(zi1 zi2 zi2 zi4 ) − E(gi1 gi2 gi3 gi4 ) (217)
i1 ,...,i4 ≤d
d
X
3
= w1,i w2,i (E(zi4 ) − 3) . (218)
i=1

Therefore, on E1 ∩ E2
d
X
3
C log d
≤ Ckw1 k∞ kw2 k∞ kw2 k22 ≤

M1,2 (1, 3) ≤ C w1,i w2,i . (219)
i=1
d

Finally, for (`1 , `2 ) = (2, 2), proceeding as above we get



d
X
2 2
C log d
M1,2 (2, 2) = w1,i w2,i (E(zi4 ) − 3) ≤ Ckw1 k2∞ kw2 k22 ≤ . (220)


i=1
d

Next consider the term |h1,0 h2,3 M1,2 (0, 3)/6 − a1,0 a2,1 | in Eq. (204). Using the fact that h1,0 = a1,0 is
bounded, we get
1
h1,0 h2,3 M1,2 (0, 3) − a1,0 a2,1 ≤ C h2,3 M1,2 (0, 3) − 6a2,1 . (221)

6

Recalling He3 (x) = x3 − 3x, and letting w = Σ1/2 x2 /kΣ1/2 x2 k2 :


X 
M1,2 (0, 3) = wi1 wi2 wi3 E(zi1 zi2 zi3 ) − E(gi1 gi2 gi3 ) (222)
i1 ,...,i3 ≤d
X
= wi3 E(zi3 ) . (223)
i≤d

70
p
In particular, on the event E1 ∩ E2 , |M1,2 (0, 3)| ≤ C (log d)/d. Comparing the definitions of a2,1 and h2,3 ,
we get
 3/2 n r Q
Q22 (3) ii
o
G − h(3) (0)

h1,0 h2,3 M1,2 (0, 3) − a1,0 a2,1 ≤ C|M1,2 (0, 3)| × E h

(224)
d d
r
log d 1 1 C(log d)1/2
≤C × 3/2 × 1/2 ≤ . (225)
d d d d5/2

Finally, consider term |a1,1 a2,1 | in Eq. (204). By the above estimates, we get |a2,1 | ≤ Cd−2 (log d)1/2 , and
hence this term is negligible as well. This completes the proof of Eq. (181).
Equation (182) follows by a similar argument, which we omit.

A.2.2 An estimate on the entries of the resolvent


Lemma A.4. Let Z = (zij )i≤n,j≤d be a random matrix with iid rows z 1 , . . . , z n ∈ Rd that are zero mean
and C-sub-Gaussian. Further assume C −1 ≤ n/d ≤ C. Let S ∈ Rd×d be a symmetric matrix such that
0  S  CId for some finite constant C > 1. Finally, let g : Rd → R be a measurable function such that
E{g(z 1 )} = E{z 1 g(z 1 )} = 0, and E{g(z 1 )2 } = 1.
Then, for any λ > 0 there exists a finite constant C such that, for any i 6= j,
n −1 o
E ZSZ T /d + λIn i,j g(z i )g(z j ) ≤ C d−3/2 . (226)

Proof. Without loss of generality, we can consider i = 1, j = 2. Further, we let Z 0 ∈ R(n−2)×d be the matrix
comprising the last n − 2 rows of Z, and U ∈ Rd×2 be the matrix with columns U e1 = z 1 , U e2 = z 2 . We
finally define the matrices R0 ∈ Rd×d and Y = (Yij )i,j≤2 :
−1
R0 := λS 1/2 S 1/2 Z 0T Z 0 S 1/2 /d + λId S 1/2 , (227)
−1
Y := ZSZ T /d + λIn . (228)

Then, by a simple linear algebra calculation, we have


 −1
Y = U T R0 U /d + λI2 , (229)
hz 1 , R0 z 2 i/d
Y12 = − . (230)
(λ + hz 1 , R0 z 2 i/d)(λ + hz 1 , R0 z 2 i/d) − hz 1 , R0 z 2 i2 /d2

Note that since R0  0, we have hz 1 , R0 z 2 i2 ≤ hz 1 , R0 z 1 ihz 2 , R0 z 2 i, and therefore


(1) (2)
Y12 = Y12 + Y12 , (231)
(1) hz 1 , R0 z 2 i/d
Y12 := − , (232)
(λ + hz 1 , R0 z 1 i/d)(λ + hz 2 , R0 z 2 i/d)
(2) 1
|Y12 | ≤ 4 3 |hz 1 , R0 z 2 i|3 . (233)
λ d
Denote by E+ expectation with respect to z 1 , z 2 (conditional on (z i )2<i≤n ). We have

E+ {Y12 g(z 1 ) g(z 2 )} ≤ E+ {Y (1) g(z 1 ) g(z 2 )} + E+ {(Y (2) )2 }1/2 E+ {g(z 1 )2 g(z 2 )2 }1/2

12 12
(1) (2)
≤ E+ {Y12 g(z 1 ) g(z 2 )} + E+ {(Y12 )2 }1/2

(1)
≤ E+ {Y12 g(z 1 ) g(z 2 )} + C d−3/2 .

71
Here the last step follows by the Hanson-Wright inequality. We therefore only have to bound the first term.
Defining qj := λ + hz j , R0 z j i/d, q j = E+ qj , gj = g(z j ), j ∈ {1, 2},

E+ {Y (1) g1 g2 } ≤ E+ q −2 hz 1 , R0 z 2 i g1 g2
n o
12
d
n
−1
 hz 1 , R0 z 2 i o
+ 2 E+ q1 − q −1 q −2 g1 g2

d
n   hz , R z i o
1 0 2
+ E q1−1 − q −1 q2−1 − q −1 g1 g2

d
(a) n   hz , R z i o
1 0 2
≤ E+ q1−1 − q −1 q2−1 − q −1 g1 g2

d
1 n hz , R z i o
1 0 2
≤ 4 E |q1 − q||q2 − q| |g1 g2 | .
λ d
Here (a) follows from the orthogonality of g(z) to linear functions.
We then conclude
(a)
E+ {Y (1) g1 g2 } ≤ CE+ |q1 − q|8 1/4 E+ (hz 1 , R0 z 2 i/d)4 1/4
 
12
1/4  1/4
≤ CE+ |hz 1 , R0 z 1 i/d − E+ hz 1 , R0 z 1 i/d|8 E+ (hz 1 , R0 z 2 i/d)4


(b)
≤ C(d−1/2 )2 × Cd−1/2 ≤ Cd−3/2 .

Here (a) follows from Hölder’s inequality and (b) from the Hanson-Wright inequality using the fact that
kR0 k is bounded. The proof is completed by taking expectation over (z i )2<i≤n .
Lemma A.5. Under the definitions and assumptions of Lemma A.4, let Yij := (ZSZ T /d + λIn )−1
i,j . Then,
for any tuple of four distinct indices i, j, k, l, we have
E{Yij Ykl g(z i )g(z j )g(z k )g(z l )} ≤ Cd−5/2 .

(234)

Proof. The proof is analogous to the one of Lemma A.4. Without loss of generality, we set (i, j, k, l) =
(1, 2, 3, 4), denote by Z 0 ∈ R(n−4)×d the matrix with rows (z ` )`≥5 , and define the d × d matrix
−1 1/2
R0 := λS 1/2 S 1/2 Z 0T Z 0 S 1/2 /d + λIn−2 S . (235)

We then have that Y = (Yij )i,j≤4 is given by

Y = (diag(q) + A)−1 , (236)


qi := q + Qi , q := λ + tr(R0 )/d , Qi = (hz i , R0 z i i − Ehz i , R0 z i i)/d , (237)
(
hz i , R0 z j i/d if i 6= j,
Aij := (238)
0 if i = j.

In what follows we denote by E+ expectation with respect to (z i )i≤4 , with Z 0 fixed. Note that, by the
Hanson-Wright inequality, E+ {|Aij |k }1/k ≤ ck d−1/2 , E+ {|Qi |k }1/k ≤ ck d−1/2 for each k ≥ 1. We next
compute the Taylor expansion of Y12 and Y3,4 in powers of A to get
(1) (2) (3) (4)
Y12 = Y12 + Y12 + Y12 + Y12 , (239)
(1)
Y12 := −q1−1 A12 q2 , (240)
(2)
Y12 := q1−1 A13 q3−1 A32 q2−1 + q1−1 A14 q4−1 A41 q2−1 , (241)
(3)
X
Y12 := − q1−1 A1i1 qi−1
1
Ai1 i2 qi−1
2
Ai2 2 q2−1 , (242)
i1 6=i2 ,i1 6=1i2 6=2

72
(`)
and similarly for Y34 . It is easy to show that E+ {|Yab |k }1/k ≤ ck d−`/2 , for all k ≥ 1. Therefore, using
E{g(z i )2 } ≤ C and Cauchy-Schwarz inequality, and writing gi = g(z i ):
E{Y (`1 ) Y (`2 ) g1 g2 g3 g4 } + Cd−5/2 .
X
E{Y12 Y34 g1 g2 g3 g4 } = (243)
12 34
`1 +`2 ≤4

The proof is completed by bounding each of the terms above, which we now do. By symmetry it is sufficient
to consider `1 ≤ `2 and therefore we are left with the 4 pairs (`1 , `2 ) ∈ {(1, 1), (1, 2), (1, 3), (2, 2)}.
Term (`1 , `2 ) = (1, 1). By the same argument as in the proof of Lemma A.4, we have |E{Aij qi−1 qj−1 gi gj }| ≤
Cd−3/2 and therefore
E+ {Y (1) Y (1) g1 g2 g3 g4 } = E+ {A12 q −1 q −1 g1 g2 } · E{A34 q −1 q −1 g3 g4 } ≤ Cd−3 .

12 34 1 2 3 4 (244)
(2)
Term (`1 , `2 ) = (1, 2). Note that each of the two terms in the definition of Y34 contributes a summand
with the same structure. Hence we can consider just the one resulting in the largest expectation, say
q3−1 A31 q1−1 A14 q4−1
E+ {Y (1) Y (2) g1 g2 g3 g4 } = 2 E+ {q −1 A12 q −1 q −1 A31 q −1 A14 q −1 g1 g2 g3 g4 }

12 34 1 2 3 1 4
(a)
= 2 E+ {q1−2 A12 (q2−1 − q −1 )(q3−1 − q −1 )A31 A14 (q4−1 − q −1 )g1 g2 g3 g4 }

(b)
≤ CE+ {|A12 |p }1/p E{|A13 |p }1/p E{|A13 |p }1/p E+ {|q2−1 − q −1 |p }1/p E{|q3−1 − q −1 |p }1/p
· E{|q4−1 − q −1 |p }1/p kgk4L2
(c)
≤ Cd−3 .
Here (a) holds because gi is orthogonal to z i for i ∈ {2, 3, 4} and hence the terms q −1 have vanishing
contribution; (b) by Hölder for p = 12, and using the fact that qi−1 is bounded; (c) by the above bounds on
the moments of Aij , Qi , plus |qi−1 − q −1 | ≤ C|Qi |.
Term (`1 , `2 ) = (1, 3). Taking into account symmetries, there are only two distinct terms to consider in the
(3)
sum defining Y34 , which we can identify with the following ones:
E+ {Y (1) Y (3) g1 g2 g3 g4 } ≤ C E+ q −1 A12 q −1 q −1 A31 q −1 A12 q −1 A24 q −1 g1 g2 g3 g4

12 34 1 2 3 1 2 4
+ C E+ q −1 A12 q −1 q −1 A31 q −1 A13 q −1 A34 q −1 g1 g2 g3 g4 =: C · T1 + C · T2 .

1 2 3 1 3 4

Notice that in the first term z 3 only appears in q3 , A31 , and g3 , and similarly z 4 only appears in q4 , A24 ,
and g4 . Hence
T1 = E+ q1−1 A12 q2−1 (q3−1 − q −1 )A31 q1−1 A12 q2−1 A24 (q4−1 − q −1 )g1 g2 g3 g4 ≤ Cd−3 ,


where the last inequality follows again by Hölder. Analogously, for the second term we have
T2 = E+ q1−1 A12 (q2−1 − q −1 )q3−1 A31 q1−1 A32 q3−1 A24 (q4−1 − q −1 )g1 g2 g3 g4 ≤ Cd−3 ,


This proves the desired bound for (`1 , `2 ) = (1, 3).


(2)
Term (`1 , `2 ) = (2, 2). There are four terms that arise from the sum in the definition of Yij . By symmetry,
these are equivalent by pairs
E+ {Y (2) Y (2) g1 g2 g3 g4 } ≤ 2 E+ q −1 A13 q −1 A32 q −1 q −1 A31 q −1 A14 q −1 g1 g2 g3 g4

12 34 1 3 2 3 1 4
+ 2 E+ q1−1 A13 q3−1 A32 q2−1 q3−1 A32 q2−1 A24 q4−1 g1 g2 g3 g4


≤ 2 E+ q1−1 A13 q3−1 A32 (q2−1 − q −1 )q3−1 A31 q1−1 A14 (q4−1 − q −1 )g1 g2 g3 g4


+ 2 E+ (q −1 − q −1 )A13 q −1 A32 q −1 q −1 A32 q −1 A24 (q −1 − q −1 )g1 g2 g3 g4



1 3 2 3 2 4
−3
≤ Cd .
This completes the proof of this lemma.

73
Lemma A.6. Under the definitions and assumptions of Lemma A.5, further assume E{|g(z)|2+η } ≤ C for
some constants 0 < C, η < ∞. for any triple of four distinct indices i, j, k, we have
E{Yij Yjk g(z i )g(z j )2 g(z k )} ≤ Cd−3/2 ,

(245)
2 2 2 −1

E{Yij g(z i ) g(z l ) } ≤ Cd . (246)

Proof. This proof is very similar to the one of Lemma A.5, and we will follow the same notation introduced
there.
(`)
Consider Eq. (245). Without loss of generality, we take (i, j, k) = (1, 2, 3). Since E{|Yij |k } ≤ ck d−`/2 ,
we have
E+ {Y12 Y23 g1 g22 g3 } ≤ E+ {Y (1) Y (1) g1 g22 g3 } + Cd−3/2 .

12 23 (247)

Further
E+ {Y (1) Y (1) g1 g22 g3 } = E+ {q −1 A12 q −2 A23 q −1 g1 g22 g3 }

12 23 1 2 3
= E+ {(q1−1 − q −1 )A12 q2−2 A23 (q3−1 − q −1 )g1 g22 g3 }

≤ Cd−2 ,

where the last bound follows from Hölder inequality.


Finally, Eq. (246) follows immediately by Hölder inequality since E{|Yij |k }1/k ≤ Ck d−1/2 for all k.
Theorem A.7. Let Z = (zij )i≤n,j≤d be a random matrix with iid rows z 1 , . . . , z n ∈ Rd , with zero mean
C-sub-Gaussian. Let S ∈ Rd×d be a symmetric matrix such that 0  S  CId for some finite constant
C > 1. Finally, let g : Rd → R be a measurable function such that E{g(z 1 )} = E{z 1 g(z 1 )} = 0, and
E{|g(z 1 )|4+η } ≤ C.
Then, for any λ > 0, with probability at least 1 − Cd−1/4 , we have


1 X T
 −1 −1/8
ZSZ /d + λI n i,j g(z i )g(z j ≤Cd
) . (248)

d
i<j≤n

Proof. Denote by X the sum on the left-hand side of Eq. (248), and define Yij := (ZSZ T /d + λIn )−1 i,j ,
gi = g(z i ). Further, let Im := {(i, j, k, l) : i < j ≤ n, k < l ≤ n, |{i, j} ∩ {k, j}| = m}, m ∈ {0, 1}. Then we
have
1 XX
E{X 2 } = 2 E{Yij Ykl gi gj gk gl }
d i<j
k<l
1 X 1 X 1 X
≤ E{Yij Ykl gi gj gk gl } + E{Yij Ykl gi gj gk gl } + + E{Yij2 gi2 gj2 }
d2 d2 d2 i<j
(i,j,k,l)∈I0 (i,j,k,l)∈I1

≤ Cd E{Y12 Y34 g1 g2 g3 g4 } + Cd E{Y12 Y23 g1 g22 g3 } + C E{Y12


2 2 2 2

g1 g2 }
≤ Cd−1/2 .

The proof is completed by Chebyshev inequality.

A.2.3 Proof of Theorem 4.13: Variance term


Throughout this section we will refer to the events E1 , E2 defined in Eqs. (183), (185). The variance is given
by

d = σξ2 Ex K(x, X) T K(X, X)−2 K(x, X) .



var (249)

The following lemma allows us to take the expectation with respect to x.

74
Lemma A.8. Under the assumptions of Theorem 4.13, define M 0 ∈ Rn×n as in the statement of Lemma
A.3. Then, with very high probability, we have

1 C log d
−2
2 var − hM 0 , K i ≤ . (250)
d
σξ d

Proof. First notice that, defining M as in Eq. (173), we have


1
d = hM , K −2 i .
var (251)
σξ2

We then have, with very high probability,



1 −2
−2

var
σ2 d − hM 0 , K i ≤ hM − M 0 , K i
(252)

≤ kM − M 0 kF nkK −2 k (253)
(a) C log d √
≤ × d × kK −1 k2 (254)
d3/2
(b) C log d
≤ , (255)
d
where (a) follows from Lemma A.3 and (b) from Lemma A.2.
In the following we define B 0 ∈ Rn×n via
h0 (0)
B 0 := XΣX T . (256)
d2
The next lemma shows that B 0 is a good approximation for B, defined in Eq. (177).
Lemma A.9. Let B be defined as per Eq. (177). With very high probability, we have kB − B 0 k ≤ Cd−3/2
and kB − B 0 k∗ ≤ Cd−1/2 .
Proof. Notice that B = DXΣX T D/d2 and, on E1 ∩ E2 ,
r Q  C
D − h0 (0)Ik = max Eh0 ii
G − h0 (0) ≤ √ .

(257)

i≤n d d
We then have
C 1
C C
2 XΣX T ≤ 5/2 kXk2 ≤ 3/2 .

B − B 0 ≤ √
(258)
d d d d

This immediately implies kB − B 0 k∗ ≤ nkB − B 0 k ≤ C/ d.
Lemma A.10. Under the assumptions of Theorem 4.13, let B be defined as per Eq. (177)and B 0 as per
Eq. (256). Also, recall the definition of K 1 in Eq. (163). Then, with very high probability, we have
hB, K −2 i − hB 0 , K −2 i ≤ C n−c0 .

1 (259)

Proof. Throughout this proof, we work under events E1 ∩ E2 defined in the proof of Lemma A.3. Recall that
maxi≤n |Di | is bounded (see, e.g., Eq. (257)), whence
(
C C/d if i = j,
|Bij | ≤ 2 hxi , Σxj i ≤

1/2 3/2
(260)
d C(log d) /d 6 j,
if i =

75
p
whence kBkF ≤ C (log d)/d. Using Lemma A.2, we have
hB, K −2 i − hB, K −2 i ≤ kBkF n1/2 kK −2 − K −2 k

1 1

≤ C (log d)/d × n [λmin (K) ∧ λmin (K 1 )]−3 kK − K 1 k


p 1/2
(261)
≤ C log dkK − K 1 k ≤ C n−c0 .
p

Using again Lemma A.2 together with Lemma A.9, we obtain that the following holds with very high
probability:

hB, K −2 i − hB , K −2
i ≤ λ (K ) −2
− B

1 0 1 min 1 B 0

C
≤ 1/2 .
d
The desired claim follows from this display alongside Eq. (261).
Lemma A.11. Under the assumptions of Theorem 4.13, let a be defined as in Lemma A.3. Then, with very
high probability we have
C
0 ≤ ha, K −2 ai ≤ . (262)
n
Proof. Notice that the lower bound is trivial since K is positive semidefinite. We will write
K = α 11 T + K ∗ , (263)
a = h(0)1 + ã . (264)
By standard bounds on the norm of matrices with i.i.d. rows (and using kΣk ≤ C), we have 0  XX T /d 
C I, with probability at least 1 − C exp(−n/C). Therefore, by Lemma A.2, and since βγ > 0 is bounded
away from zero by assumption, with very high probability we have C −1 I  K ∗  CI, for a suitable constant
C. Note that ã = (a0 − h(0)1) + a1 . Under event E1 ∩ E2 , the following holds by smoothness of h:
( r )
 Q  C
ii
ka0 − h(0)1k∞ = max E h G − h(0) ≤ . (265)

i≤d d d

On the other hand, recalling the definition of a1 in Eq. (178), we have, always on E1 ∩ E2 ,
1 3/2 kΣ1/2 xi k3∞
ka1 k∞ ≤ C max Qii × d × max (266)
d3/2 i≤d i≤n kΣ1/2 xi k32
1  log d 3/2 (log d)3/2
≤C ×d× ≤C . (267)
d3/2 d d2

Therefore we conclude that kãk∞ ≤ C/d, whence kãk2 ≤ C/ d.
We therefore obtain, again using Lemma A.2,
C
ha, K −2 ai − h(0)2 h1, K −2 1i − 2h(0)h1, K −2 ãi = hã, K −2 ãi ≤ λmin (K)−2 kãk22 ≤ . (268)

d
We are therefore left with the task of controlling the two terms h1, K −2 1i and hã, K −2 1i. We will assume
h(0) 6= 0 because otherwise there is nothing to control. Since h is a positive semidefinite kernel, this also
implies h(0) > 0 and α ≥ h(0) > 0. By an application of the Sherman-Morrison formula, we get
h1, K −2 1i = h1, (K ∗ + α11 T )−2 1i (269)
h1, K −2
∗ 1i
= (270)
(1 + αh1, K −1
∗ 1i)
2

1 h1, K −2
∗ 1i C 1 C
≤ ≤ 2 ≤ , (271)
α2 h1, K −1
∗ 1i2 α k1k2 d

76
where we used the above remark C −1 I  K ∗  CI.
Using again Sherman-Morrison formula,

hã, K −2
∗ 1i αh1, K −2 −1
∗ 1ihã, K ∗ 1i
h1, K −2 ãi = − , (272)
1 + αh1, K −1∗ 1i (1 + αh1, K −1
∗ 1i)
2
3
h1, K −2 ãi ≤ C kãk2 k1k2 + k1k2 kãk2

2 (273)
αk1k2 αk1k42
C
≤ . (274)
d
Using the last two displays in Eq. (268) yields the desired claim.
Proof of Theorem 4.13: Variance term. By virtue of Lemmas A.8, A.10, A.11, we have
1
d = hB 0 , K −2
var 1 i + Err(n) (275)
σξ2
= hB 0 , (K 0 + α11 T )−2 i + Err(n) . (276)

Here and below we denote by Err(n) an error term bounded as |Err(n)| ≤ Cn−c0 with very high probability,
and we defined
XX T
K 0 := β + βγIn . (277)
d
By an application of the Sherman-Morrison formula, and recalling that βγ > 0 is bounded away from zero,
we get
1 2α
d =tr(B 0 K −2
var 0 )− tr(B 0 K −2 T −1
0 11 K 0 ) (278)
σξ2 1 + αA1
α 2 A2
+ tr(B 0 K −1 T −1
0 11 K 0 ) + Err(n) , (279)
(1 + αA1 )2

where A` := h1, K −`
0 1i, ` ∈ {1, 2}. By standard bounds on the norm of matrices with i.i.d. rows (and
using kΣk ≤ C), we have 0  XX T /d  C I. Therefore C −1 I  K 0  CI, for a suitable constant C,
with very high probability. This implies d/C ≤ A` ≤ Cd for ` ∈ {1, 2} and some constant C > 0. Further
kB 0 k ≤ CkXk2 /d2 ≤ C/d. Therefore, (since α > 0):

1 C C
2 var − tr(B 0 K 0 ) ≤ h1, K −1
−2 −2
0 B 0 K 0 1i + h1, K −1 −1
0 B 0 K 0 1i + Err(n) (280)
d
σξ d d
C
≤ + Err(n) . (281)
d
We are therefore left with the task of evaluating the asymptotics of

tr(B 0 K −2 2
T T

0 ) = tr XΣX (XX + γdIn ) . (282)

However, this is just the variance of ridge regression with respect to the simple features X, with ridge
regularization proportional to γ. We apply the results of [HMRT20] to obtain the claim.

A.2.4 Proof of Theorem 4.13: Bias term


We recall the decomposition

f ∗ (x) = b0 + hβ 0 , xi + fNL

(x) =: fL∗ (x) + fNL

(x) , (283)

77
∗ ∗
where b0 , β 0 are defined by the orthogonality conditions E{fNL (x)} = E{xfNL (x)} = 0. This yields b0 =
∗ −1 ∗ ∗ ∗ ∗
E{f (x)} and β 0 = Σ E{f (x)x}. We denote by f = (f (x1 ), . . . , f (xn ))T the vector of noiseless
responses, which we correspondingly decompose as f ∗ = f ∗L + f ∗NL . Recalling the definition of M , v in
Eqs. (173), (174), the bias reads
2
d = hf ∗ , K −1 M K −1 f ∗ i − 2hv, K −1 f ∗ i + kf ∗ k2L2 .
bias (284)

We begin with an elementary lemma on the norm of f ∗ .


Lemma A.12. Assume E{f ∗ (x)4 } ≤ C0 for a constant C0 (in particular, this is the case if E{|f ∗ (x)|4+η } ≤
C0 ). Then, there exists a constant C depending uniquely on C0 such that the following hold:

(a) |b0 | ≤ C, kΣ1/2 β 0 k2 ≤ C, E{fNL



(x)2 } ≤ C.
(b) With probability at least 1 − Cn−1/4 , we have |kf ∗ k22 /n − kf ∗ k2L2 | ≤ n−3/8 .

(c) With probability at least 1 − Cn−1/4 , we have |kf ∗NL k22 /n − kfNL
∗ 2
kL2 | ≤ n−3/8 .
Proof. By Jensen’s inequality we have E{f ∗ (x)2 } ≤ C. By orthogonality of fNL ∗
to linear and constant
∗ ∗ 1/2 ∗
functions, we also have E{f (x) } = b0 + E{hβ 0 , xi } + E{fNL (x) } = b0 + kΣ β 0 k22 + E{fNL
2 2 2 2 2
(x)2 }, which
proves claim (a).
To prove (b), simply call Z = kf ∗ k22 /n − kf ∗ k2L2 , and note that E{Z 2 } = (E{f ∗ (x)4 } − E{f ∗ (x)2 }2 )/n ≤
C/n. The claim follows by Chebyshev inequality.

Finally, (c) follows by the same argument as for claim (b), once we bound kfNL kL4 . In order to show this,
∗ ∗
notice that, by triangle inequality, kfNL kL4 ≤ kf kL4 + kf0 kL4 + kf1 kL4 , where f0 (x) = b0 , f1 (x) = hβ 0 , xi.
Since x = Σz, with z C-sub-Gaussian, kfNL ∗
kL4 ≤ kf ∗ kL4 + b0 + CkΣ1/2 β 0 k2 ≤ C.
Lemma A.13. Under the assumptions of Theorem 4.13, let M 0 , v 0 be defined as in the statement of Lemma
A.3. Then, with probability at least 1 − Cn−1/4 , we have
2 2 C log d
bias d 0 ≤ √
d − bias , (285)
d
2
d 0 := hf ∗ , K −1 M 0 K −1 f ∗ i − 2hv 0 , K −1 f ∗ i + kf ∗ k2L2
bias (286)

Proof. We have
2 2
d 0 ≤ hf ∗ , K −1 (M − M 0 )K −1 f ∗ i + 2 hv − v 0 , K −1 f ∗ i

d − bias
bias (287)
≤ kM − M 0 kF kK f ∗ k22 + 2kv − v 0 k2 kK −1 f ∗ k2
−1
(288)
≤ kM − M 0 kF kK −1 k2 kf ∗ k22 + 2kv − v 0 k2 kK −1 kkf ∗ k2 (289)
log d log d √ C log d
≤ C 3/2 × n + C × n≤ √ . (290)
d d d
Here, in the last line, we used Lemmas A.2, A.3 and the fact that kf ∗ k2 ≤ Cn by Lemma A.12.
2
In view of the last lemma, it is sufficient to work with bias
d 0 . We decompose it as
2 2 2
∗ 2 2
d 0 = bias
bias d L + bias d mix + kfNL
d NL + bias kL2 , (291)
2
d L := hf ∗L , K −1 M 0 K −1 f ∗L i − 2hv 0 , K −1 f ∗L i + kfL∗ k2L2 ,
bias (292)
2 ∗ −1 −1 ∗
d NL := hf NL , K
bias M 0K f NL i , (293)
2
d mix := 2hf ∗L , K −1 M 0 K −1 f ∗NL i − 2hv 0 , K −1 f ∗NL i .
bias (294)

We next show that the contribution of the constant term in fL∗ (x) and M 0 is negligible.

78
Lemma A.14. Under the assumptions of Theorem 4.13, let M 0 , B, v 0 be defined as in the statement of
Lemma A.3. Further define
2h0 (0)
RL := hXβ 0 , K −1 BK −1 Xβ 0 i − hXΣβ 0 , K −1 Xβ 0 i + hβ 0 , Σβ 0 i , (295)
d
RNL := hf ∗NL , K −1 BK −1 f ∗NL i , (296)
0
2h (0)
Rmix := 2hXβ 0 , K −1 BK −1 f ∗NL i − hXΣβ 0 , K −1 f ∗NL i . (297)
d
Then, with very high probability we have
2 C
d L − RL ≤ ,
bias (298)
n
2 C
d NL − RNL ≤ ,
bias (299)
n
2 C
d mix − Rmix ≤ .
bias (300)
n
Proof. The proof of this lemma is very similar to the one of Lemma A.11, and we omit it.
Lemma A.15. Under the assumptions of Theorem 4.13, let B(Σ, β 0 ) be defined as in Eq. (168), and RL
be defined as in the statement of Lemma A.14. Let a ∈ (0, 1/2). Then we have, with very high probability
RL − B(Σ, β 0 ) ≤ C n−a .

(301)

Proof. Recall the definition of K 1 in Eq. (163). and define R̃L as RL (cf. Eq. (295)) except with B replaced
by B 0 defined in Eq. (256), and K replaced by K 1 defined in Eq. (163). Namely:
2h0 (0)
R̃L := hXβ 0 , K −1 −1
1 B 0 K 1 Xβ 0 i − hXΣβ 0 , K −1
1 Xβ 0 i + hβ 0 , Σβ 0 i . (302)
d

Letting u = Xβ 0 = ZΣ1/2 β 0 , note that kuk2 ≤ kZkkΣ1/2 β 0 k2 ≤ C n with very high probability (using
Lemma A.12). We then have
RL − R̃L ≤ hu, K −1 BK −1 ui − hu, K −1 B 0 K −1 ui

C
+ hu, K −1 B 0 K −1 ui − hu, K −1 −1 hXΣβ 0 , K −1 ui − hXΣβ 0 , K −1 ui

1 B 0 K 1 ui +

1
d
=: E1 + E2 + E3 .
We bound each of the three terms with very high probability:
C C
E1 ≤ kB − B 0 k · kK −1 k2 · kuk22 ≤ 3/2
× C × Cn ≤ 1/2 , (303)
d n
E2 ≤ kB 0 K −1 uk2 + kB 0 K −1
1 uk2 kuk 2 kK −1
− K −1
1 k
−1 −1 2 −1 −1

≤ kB 0 k kK k + kK 1 k kuk2 kK − K 1 k (304)
C
≤ × C × Cn × n−c0 ≤ C n−c0 ,
d
C
E3 ≤ kXkkΣβ 0 k2 kuk2 kK −1 − K −1 1 k (305)
d
C √ √
≤ × C n × C × C n × Cn−c0 ≤ Cn−c0 .
d
Here in Eq. (303) we used Lemma A.2 and Lemma
√ A.9; in Eq. (304) Lemma A.2 and the fact that kB 0 k ≤
C/d; in Eq. (305), Lemma A.2 and kXk ≤ C d. Hence we conclude that
RL − R̃L ≤ Cn−c0 .

(306)

79
≈ T
Finally define RL as R̃L , with K 1 replaced by K 0 = β XX
d + βγIn .

R̃L − RL ≤ hu, (K −1 + K −1 )B 0 (K −1 − K −1 )ui + C hXΣβ 0 , (K −1 − K −1 )ui



1 0 1 0 1 0
d
=: G1 + G2 .

By the Sherman-Morrison formula, for any two vectors w1 , w2 ∈ Rn , we have


−1 −1

hw1 , (K −1 − K −1 )w2 i = α h1, K 0 w1 ih1, K 0 w2 i

1 0 (307)
1 + αh1, K −1
0 1i
C
h1, K −1 −1

≤ 0 w 1 i · h1, K 0 w 2 i .
(308)
d
Further notice that

|hu, K −1 −1
T T
p
0 , 1i| = |hβ 0 , X (βXX /d + βγIn ) 1i| ≤ C d log d ,

where the last inequality holds with very high probability by [KY17, Theorem 3.16] (cf. also Lemma 4.4 in
the same paper). We therefore have

C
hu, (K −1 −1 −1 −1

G1 ≤ 1 + K 0 )B 0 K 0 1i · hu, K 0 1i (309)
d
C
≤ kB 0 kkuk2 k1k2 hu, K −1

0 1i
(310)
d
1 √ √
r
C p log d
≤ × × d × d × d log d ≤ C . (311)
d d d
Analogously
C
hXΣβ 0 , K −1 −1

G2 ≤ 2 0 1i · hu, K 0 1i
(312)
d
C
≤ 2 kXkkΣβ 0 k2 kK −1 −1

0 kk1k2 hu, K 0 1i (313)

d

r
C √ p log d
≤ 2 × C d × C × C × C n × C d log d ≤ C . (314)
d d
Summarizing
r

R̃L − RL ≤ C log d
. (315)
d

We are left with the task of estimating RL which we rewrite explicitly as
≈ 2
RL = γ 2 Σ1/2 (XX T + γIn )−1 β 0 2 .

(316)

We recognize in this the bias of ridge regression with respect to the linear features xi , when the responses
are also linear hβ 0 , xi i. Using the results of [HMRT20], we obtain that, for any a ∈ (0, 1/2), the following
holds with very high probability.

RL − B(Σ, β 0 ) ≤ C n−c0 .

(317)

The proof is completed by using Eqs. (306), (315), (317).


We next consider the nonlinear term RNL , cf. Eq. (296).

80
Lemma A.16. Under the assumptions of Theorem 4.13, let V (Σ) be defined as in Eq. (167), and RNL be
defined as in the statement of Lemma A.14. Then there exists c0 > 0 such that, with probability at least
1 − Cn−1/4 ,
RNL − V (Σ)kP>1 f ∗ k2 2 ≤ C n−c0 .

L (318)

Proof. Define

RNL := hf ∗NL , K −1 −1 ∗
0 B 0 K 0 f NL i (319)
1
= 2 hf ∗NL , (XX T /d + γIn )−1 XΣX T (XX T /d + γIn )−1 f ∗NL i (320)
d
1
= 2 hf ∗NL , (ZΣZ T /d + γIn )−1 ZΣ2 Z T (ZΣZ T /d + γIn )−1 f ∗NL i . (321)
d
By the same argument as in the proof of Lemma A.15, we have, with very high probability,
r
RNL − RNL ≤ C log d .

(322)
d
We next use the following identity, which holds for any two symmetric matrices A, M , and any t 6= 0,
1  −1
A−1 M A−1 = A − (A + tM )−1 + tA−1 M A−1 M (A + tM )−1 .

(323)
t
Therefore, for any matrix U and any t > 0, we have

hA M A−1 , U i ≤ 1 hA−1 , U i + 1 h(A + tM )−1 , U i + tkA−1 k2 kM k2 k(A + tM )−1 kkU k∗ .


−1
(324)
t t
We apply this inequality to A = ZΣZ T /d + γIn , M = ZΣ2 Z T /d and Uij = fNL ∗ ∗
(xi )fNL (xi )1i6=j . Note that
−1 ∗ 2
−1
kA k, kM k, k(A + tM ) k ≤ C. Further kU k∗ ≤ 2kf NL k2 ≤ Cn with probability at least 1 − Cn−1/4
by Lemma A.12. Finally for any t ∈ (0, 1), by Theorem A.7, the following hold with probability at least
1 − Cd−1/4 :
1 −1 1
hA , U i ≤ C d−1/8 , h(A + tM )−1 , U i ≤ C d−1/8 .

(325)
d d
Therefore, applying Eq. (324) we obtain
1 −1 1
hA M A−1 , U i ≤ C d−1/8 + Ct ≤ Cd−1/16 , (326)
d t
where in the last step we selected t = d−1/16 . Recalling the definitions of A, M , U , we have proved:
n

≈ 1 X −1 −1

RNL − 2 [A M A ]ii fNL (xi ) ≤ Cd−1/16 .
∗ 2
(327)

d i=1

We are therefore left with the task of controlling the diagonal terms. Using the results of [KY17], we get

−1 −1 1 −1 −1
max [A M A ]ii − tr(A M A ) ≤ Cn−1/8 .

(328)
i≤n n

Further |kf ∗NL k22 /n − kfNL


∗ 2
kL2 | ≤ Cn−1/2 with probability at least 1 − Cn−1/4 by Lemma A.12. Therefore,
with probability at least 1 − Cd−1/4 ,

∗ 2
RNL − VRR kfNL kL2 ≤ Cd−1/16 , (329)

1 2
VRR := 2 Σ1/2 X T (XX T /d + γIn )−1 F . (330)
d

81
We finally recognize that the term VRR is just the variance of ridge regression with respect to the linear
features xi , and using [HMRT20], we obtain

∗ 2
RNL − V (Σ)kfNL kL2 ≤ Cd−1/16 . (331)

The proof of the lemma is concluded by using the last equation together with Eq. (322).
Lemma A.17. Under the assumptions of Theorem 4.13, Rmix be defined as in the statement of Lemma A.14.
Then we have, with probability at least 1 − Cd−1/4 ,
Rmix ≤ C n−1/16 .

(332)
Proof. The proof of this lemma is analogous to the one of Lemma A.16 and we omit it.
We are now in a position to prove Theorem 4.13.
Proof of Theorem 4.13: Bias term. Using Lemma A.13, Eq. (291) and Lemma A.14, we obtain that, with
very high probability,
r
2 ∗ 2
log n
d − (RL + RNL + Rmix + kfNL
bias kL2 ) ≤ C . (333)
n
Hence the proof is completed by using Lemmas A.15, A.16, A.17.

A.2.5 Consequences: Proof of Corollary 4.14


We denote by λ1 ≥ · · · ≥ λd the eigenvalues of Σ in decreasing order.
First note that the left hand side of Eq. (166) is strictly increasing in λ∗ , while the right hand side is
strictly decreasing. By considering the limits as λ∗ → 0 and λ∗ → ∞, it is easy to see that this equation
admits indeed a unique solution. 
Next denoting by F (x) := tr Σ(Σ + xI)−1 the function appearing on the right hand side of Eq. (166),
we have, for x ≥ c∗ λk+1 ,
d d
X λi X λi
F (x) = ≥ (334)
i=1
x + λi x + λi
i=k+1
d
c∗ X
≥ λi =: F (x) . (335)
(1 + c∗ )x
i=k+1

Let λ∗ be the unique non-negative solution of n(1 − (γ/λ∗ )) = F (λ∗ ). Then, the above inequality implies
that whenever λ∗ ≥ c∗ λk+1 we have λ∗ ≥ λ∗ . Solving explicitly for λ∗ , we get
d
(1 + c∗ )γ rk (Σ) c∗ 1 X
+ ≥ (1 + c∗ ) ⇒ λ∗ ≥ γ + λi . (336)
c∗ λk+1 n 1 + c∗ n
i=k+1

Next, we upper bound


d
 X λ2i
tr Σ2 (Σ + λ∗ I)−2 = (337)
i=1
(λi + λ∗ )2
d
1 X 2
≤k+ λi (338)
λ2∗
i=k+1
Pd 2
i=k+1 λi
≤ k + (1 + c−1 2 2
∗ ) n Pd . (339)
(nγ/c∗ + i=k+1 λi )2

82
If we assume that the right-hand side is less than 1/2, using Theorem 4.13, we obtain that, with high
probability,
Pd 2
1 −1 i=k+1 λi
var ≤ k + (1 + c∗ )2
n 2
+ n−c0 . (340)
σξ2
d d
(nγ/c∗ + i=k+1 λi )2
P

Next, considering again Eq. (166) and upper bounding the right-hand side, we get
d
 γ  1 X
n 1− ≤k+ λi . (341)
λ∗ λ∗
i=k+1

Hence, using the assumption that the right hand side of Eq. (339) is upper bounded by 1/2, which implies
k ≤ n/2, we get
d
2 X
λ∗ ≤ 2γ + λi . (342)
n
i=k+1

Next consider the formula for the bias term, Eq. (168). Denoting by (β0,i )i≤p the coordinates of β 0 in the
basis of the eigenvectors of Σ, we get
d
X 2
λ2∗ λi β0,i
λ2∗ hβ 0 , (Σ + λ∗ I)−2 Σβ 0 i = (343)
i=1
(λi + λ∗ )2
k
X d
X
≤ λ2∗ λ−1 2
i β0,i +
2
λi β0,i (344)
i=1 i=1
d 2
 1 X
≤4 γ+ λi kβ 0,≤k k2Σ−1 + kβ 0,>k k2Σ . (345)
n
i=k+1

Together with Theorem 4.13, this implies the desired bound on the bias.

B Optimization in the linear regime


Theorem 5.1. Assume
1 2
Lip(Dfn ) ky − fn (θ 0 )k2 < σ (Dfn (θ 0 )) . (346)
4 min
Further define
σmax := σmax (Dfn (θ 0 )), σmin := σmin (Dfn (θ 0 )).
Then the following hold for all t > 0:
2
1. The empirical risk decreases exponentially fast to 0, with rate λ0 = σmin /(2n):

L(θ b 0 ) e−λ0 t .
b t ) ≤ L(θ (347)

2. The parameters stay close to the initialization and are closely tracked by those of the linearized flow.
Specifically, letting Ln := Lip(Dfn ),
2
kθ t − θ 0 k2 ≤ ky − fn (θ 0 )k2 , (348)
σmin
n 32σ 16Ln o
max 2
kθ t − θ t k2 ≤ 2 ky − f n (θ 0 )k2 + 3 ky − fn (θ 0 )k 2
σmin σmin
2
180Ln σmax
∧ 5 ky − fn (θ 0 )k22 . (349)
σmin

83
3. The models constructed by gradient flow and by the linearized flow are similar on test data. Specifically,
writing f lin (θ) = f (θ 0 ) + Df (θ 0 )(θ − θ 0 ), we have

kf (θ t ) − f lin (θ t )kL2 (P)


n 1 Ln σ 2 o
≤ 4 Lip(Df ) 2 + 180kDf (θ 0 )k 5 max ky − fn (θ 0 )k22 . (350)
σmin σmin

Proof. Throughout the proof we let Ln := Lip(Dfn ), and we use ȧt to denote the derivative of quantity at
with respect to time.
Let y t = fn (θ t ). By the gradient flow equation,
1
ẏ t = Dfn (θ t ) θ̇ t = − Dfn (θ t )Dfn (θ t )T (y t − y) . (351)
n
Defining the empirical kernel at time t, K t := Dfn (θ t )Dfn (θ t )T , we thus have
1
ẏ t = − K t (y t − y) , (352)
n
d 2
ky − yk22 = − hy t − y, K t (y t − y)i . (353)
dt t n
Letting r∗ := σmin /(2Ln ) and t∗ := inf{t : kθ t − θ 0 k2 > r∗ }, we have λmin (K t ) ≥ (σmin /2)2 for all t ≤ t∗ ,
whence

t ≤ t∗ ⇒ ky t − yk22 ≤ ky 0 − yk22 e−λ0 t , (354)


2
with λ0 = σmin /(2n).
Note that, for any t ≤ t∗ , σmin (Dfn (θ t )) ≥ σmin /2. Therefore, by the gradient flow equations, for any
t ≤ t∗ ,
1 Dfn (θ t )T (y t − y) ,

kθ̇ t k2 = 2
(355)
n
2
d 1 kDfn (θ t )T (y t − y) 2
ky − yk2 = − · (356)
dt t n ky y − yk2
σmin
kDfn (θ t )T (y t − y) 2 .

≤− (357)
2n
Therefore, by Cauchy-Schwartz,
d  σmin  d σmin
ky t − yk2 + kθ t − θ 0 k2 ≤ ky t − yk2 + kθ̇ t k2 ≤ 0 . (358)
dt 2 dt 2
This implies, for all t ≤ t∗ ,
2
kθ t − θ 0 k2 ≤ ky − y 0 k2 . (359)
σmin
Assume by contradiction t∗ < ∞. The last equation together with the assumption (346) implies kθ t∗ −θ 0 k2 <
r∗ , which contradicts the definition of t∗ . We conclude that t∗ = ∞, and Eq. (347) follows from Eq. (354).
Equation (348) follows from Eq. (359).
In order to prove Eq. (349), let y t := fn (θ 0 ) + Dfn (θ 0 )(θ t − θ 0 ). Note that this satisfies an equation
similar to (352), namely
1
ẏ t = − K 0 (y t − y) . (360)
n

84
Define the difference r t := y t − y t . We then have ṙ t = −(K t /n)r t − ((K t − K 0 )/n)(y t − y), whence
d 2 2
kr t k22 = − hr t , K t r t i − hr t , (K t − K 0 )(y t − y)i (361)
dt n n
2 2
≤ − λmin (K t )kr t k22 + kr t k2 K t − K 0 ky t − yk2 .

(362)
n n
Using 2λmin (K t )/n ≥ λ0 and ky t − y t k2 ≤ ky 0 − yk2 e−λ0 t/2 , we get
d λ0 1
kr t k2 = − kr t k2 + K t − K 0 ky 0 − yk2 e−λ0 t/2 .

(363)
dt 2 n
Note that
K t − K 0 = Dfn (θ t )Dfn (θ t )T − Dfn (θ 0 )Dfn (θ 0 )T

(364)
2
≤ 2 Dfn (θ 0 ) Dfn (θ t ) − Dfn (θ 0 ) + Dfn (θ t ) − Dfn (θ 0 ) (365)
≤ 2σmax Ln kθ t − θ 0 k + L2n kθ t
− θ0 k 2
(366)
5
≤ σmax Ln kθ t − θ 0 k . (367)
2
(In the last inequality, we used the fact that Ln kθ t − θ 0 k ≤ σmin /2 by definition of r∗ .) Applying Grönwall’s
inequality, and using r 0 = 0, we obtain
Z t
1
kr t k2 ≤ e−λ0 t/2 ky 0 − yk2

K s − K 0 ds (368)
0 n
1
≤ e−λ0 t/2 tky 0 − yk2 sup K s − K 0

(369)
s∈[0,t] n
2 1
≤ e−λ0 t/4 ky 0 − yk2 sup K s − K 0

(370)
λ0 s≥0 n
(a) 2 5
≤ e−λ0 t/4 ky − yk2 Ln σmax sup kθ s − θ 0 k2 (371)
λ0 0 2n s≥0
(b) 2 5 2
≤ e−λ0 t/4 ky − yk2 Ln σmax · ky − yk2 (372)
λ0 0 2n σmin 0
σmax
≤ 20 e−λ0 t/4 3 Ln ky − y 0 k22 . (373)
σmin
Here in (a) we used Eq. (367) and in (b) Eq. (359). Further using kr t k2 ≤ ky t − yk2 + ky t − yk2 ≤
2ky 0 − yk exp(−λ0 t/2), we get
n 10σmax o
ky t − y t k2 ≤ 2e−λ0 t/4 ky − y 0 k2 1 ∧ 3 Ln ky − y 0 k2 . (374)
σmin
Recall the gradient flow equations for θ t and θ t :
1
θ̇ t = Dfn (θ t )T (y − y t ) , (375)
n
1
θ˙ t = Dfn (θ 0 )T (y − y t ) . (376)
n
Taking the difference of these equations, we get
d 1 1
kθ t − θ t k2 ≤ Dfn (θ t ) − Dfn (θ 0 ) ky t − yk2 + Dfn (θ 0 ) ky t − y t k2 (377)
dt n n
Ln σmax
≤ kθ t − θ 0 k2 ky t − yk2 + ky t − y t k2 (378)
n n
(a) L 2 σmax n 10σmax o
n
≤ · ky − y 0 k22 e−λ0 t/2 + · 2e−λ0 t/4 ky − y 0 k2 1 ∧ 3 Ln ky − y 0 k2 (379)
n σmin n σmin

85
where in (a) we used Eqs. (348), (354) and (374). Integrating the last expression (thanks to θ 0 = θ 0 ), we
get
2
8Ln n 16σ
max 160σmax o
kθ t − θ t k2 ≤ 3 ky − y 0 k22 + 2 ky − y 0 k2 ∧ 5 Ln ky − y 0 k22 . (380)
σmin σmin σmin
Simplifying, we get Eq. (349).
Finally, to prove Eq. (350), write
kf (θ t ) − flin (θ t )kL2 ≤ kf (θ t ) − flin (θ t )kL2 + kflin (θ t ) − flin (θ t )kL2 . (381)
| {z } | {z }
E1 E2
Rtd
By writing f (θ t ) − flin (θ t ) = 0 ds
− flin (θ s )]ds, we get
[f (θ s )
Z t

E1 = [Df (θ s ) − Df (θ 0 )]θ̇ s ds

2 (382)
0 L
Z t
≤ Lip(Df ) sup kθ s − θ 0 k2 kθ̇ s k2 ds (383)
s≥0 0
4ky − y 0 k22
≤ Lip(Df ) · 2 . (384)
σmin
In the last step we used Eq. (348) and noted that the same argument to prove the latter indeed also bounds
Rt
the integral 0 kθ̇ s k2 ds (see Eq. (358)).
Finally, to bound term E2 , note that flin (θ t ) − flin (θ t ) = Df (θ 0 )(θ t − θ t ), and using Eq. (349), we get
2
Ln σmax
E2 ≤ 180kDf (θ 0 )k 5 ky − y 0 k22 . (385)
σmin
Equation (350) follows by putting together the above bounds for E1 and E2 .
We next pass to the case of two-layers networks:
m
α X
f (x; θ) := √ bj σ(hwj , xi), θ = (w1 , . . . , wm ) . (386)
m j=1

Lemma 5.3. Under Assumption 5.2, further assume {(yi , xi )}i≤n to be i.i.d. with xi ∼iid N(0, Id ), and yi
B 2 -sub-Gaussian. Then there exist constants Ci , depending uniquely on σ, such that the following hold with
probability at least 1 − 2 exp{−n/C0 }, provided md ≥ C0 n log n, n ≤ d`0 (whenever not specified, these hold
(1) (2)
for both θ 0 ∈ {θ 0 , θ 0 }):
(1) √
ky − fn (θ 0 )k2 ≤ C1 B + α) n (387)
(2) √
ky − fn (θ 0 )k2 ≤ C1 B n , (388)

σmin (Dfn (θ 0 )) ≥ C2 α d , (389)
√ √ 
σmax (Dfn (θ 0 )) ≤ C3 α n + d , (390)
√ 
r
d √
Lip(Dfn ) ≤ C4 α n+ d . (391)
m
Further
kDf (θ 0 )k ≤ C10 α , (392)
r
d
Lip(Df ) ≤ C40 α . (393)
m

86

Proof. Since the yi are B 2 sub-Gaussian, we have kyk2 ≤ C1 B n with the stated probability. Equation (388)
(2)
follows since by construction fn (θ 0 ) = 0.
(1) √
For Eq. (387) we claim that kfn (θ 0 )k2 ≤ C1 α n with the claimed probability. To show this, it is
(1)
sufficient of course to consider α = 1. Let F (X, W ) := kfn (θ 0 )k2 , where X ∈ Rn×d contains as rows the
(1)
vectors xi , and W the vectors wi . We also write θ 0 = θ 0 for simplicity. We have
E{F (X, W )}2 ≤ E{kfn (θ 0 )k22 } = nE{f (x1 ; θ 0 )2 } (394)
= nVar{σ(hw1 , x1 i)} ≤ Cn . (395)
Next, proceeding as in the proof of [OS20, Lemma 7] (letting b = (bj )j≤m )

F (X, W 1 ) − F (X, W 2 ) ≤ √1 σ(XW T T



1 )b − σ(XW 2 )b 2
m
≤ σ(XW T T

1 ) − σ(XW 2 ) F

≤ C XW T T

1 − XW 2 F

≤ CkXk W T T

1 − W2 F .
√ √
We have kXk√≤ 2( √n + d) with the probability at least 1 − 2 exp{−(n ∨ d)/C) [Ver18]. On this event,
F (X, ·√
) is 2( n + d)-Lipschitz with respect to W . Recall that the uniform measure on the sphere of
radius d satisfies a log-Sobolev inequality with Θ(1) constant, [Led01, Chapter 5], that the log-Sobolev
constant for a product measure is the same as the worst constant of each of the terms. We then have
2
P F (X, W ) ≥ EF (X, W ) + t) ≤ e−dt /C(n+d)
+ 2e−(n∨d)/C . (396)

Taking t = C1 n for a sufficiently large constant C1 implies that the right-hand side is at most 2 exp(−(n ∨
d)/C), which proves the claim.
Notice that all the following inequalities are homogeneous in α > 0. Hence, we will assume—without loss
of generality—that α = 1. Equation (389) follows from [OS20, Lemma 4]. Indeed this lemma implies
C(n + d) log n p
m≥ ⇒ σmin (Dfn (θ 0 )) ≥ c0 dλmin (K) , (397)
dλmin (K)
where K is the empirical NT kernel
1  
K= E Dfn (θ 0 )Dfn (θ 0 ) = KNT (xi , xj ) i,j≤n . (398)
d
Under Assumption 5.2 (in particular σ 0 having non-vanishing Hermite coefficients µ` (σ) for all ` ≤ `0 ), and
n ≤ d`0 , we have λmin (K) ≥ c0 with the stated probability, see for instance [EK10]. This implies the claim.
For Eq. (390), note that, for any vector v ∈ Rn , kvk2 = 1 we have
m
Dfn (θ 0 )T v 2 = 1
X X
vi σ 0 (hw` , xi i)vj σ 0 (hw` , xj i)hxi , xj i

2
(399)
m
i,j≤n `=1

= hM , X, X T i , (400)
m
1 X
Mij := vi σ 0 (hw` , xi i)vj σ 0 (hw` , xj i) . (401)
m
`=1

Since M  0, we have
Dfn (θ 0 )T v 2 ≤ tr(M )kXk2

2
(402)
m n
1 XX 2 0
= vi σ (hw` , xi i)2 · kXk2 (403)
m i=1
`=1
≤ B 2 kvk22 kXk2 . (404)

87
Hence σmax (Dfn (θ 0 )) ≤ BkXk and the claim follows from standard estimates of operator norms of random
matrices with independent entries.
Equation (391) follows from [OS20, Lemma 5], which √ yields (after adapting to the different normalization
of the xi , and using the fact that maxi≤n kxi k2 ≤ C d with probability at least 1 − 2 exp(−d/C)):
r
Dfn (θ 1 ) − Dfn (θ 2 ) ≤ C d kXkkθ 1 − θ 2 k2 .

m
m×d
(Here kθ 1 − θ 2 k2 = kW 1 − W 2 kF , where W i ∈√ R √ is the matrix whose rows are the weight vectors.)
The claim follows once more by using kXk ≤ 2( n + d) with probability at least 1 − 2 exp{−(n ∨ d)/C).
In order to prove Eq. (392), note that, for h ∈ L2 (Rd , P),
Df (θ 0 )∗ h = E{Qh (x1 , x2 ) P (x1 , x2 )} ,

2
(405)
m
1 X 0
Qh (x1 , x2 ) := σ (hw` , x1 i)h(x1 )σ 0 (hw` , x2 i)h(x2 ) , (406)
m
`=1
P (x1 , x2 ) := hx1 , x2 i . (407)
Here expectation is with respect to independent random vectors x1 , x2 ∼ P. Denote by Qh and P the
integral operators in L2 (Rd , P) with kernels Qh and P . It is easy to see that P is the projector onto the
subspace of linear functions, and Qh is positive semidefinite. Therefore
m
Df (θ 0 )∗ h ≤ tr(Qh ) = 1
X
E σ 0 (hw` , x1 i)2 h(x1 )2

2
(408)
m
`=1
≤ B 2 khk2L2 . (409)
This implies kDf (θ 0 )k ≤ B.
In order to prove Eq. (393), define ∆` (x) := σ 0 (hw1,` , xi) − σ 0 (hw2,` , xi). Let h ∈ L2 (Rd , P) and note
that
m
Df (θ 0 )∗ h 2 = 1
X  2

2
xh(x)∆ ` (x) (410)
m
E
2
`=1
m
1 X  2
≤ E kxk |h(x)∆` (x)| (411)
m
`=1
m
1 X
E kxk2 ∆` (x)2 khkL2 .

≤ (412)
m
`=1

Note that |∆` (x)| ≤ B |hw1,` − w2,` , xi|. Using this and the last expression above, we get
m
2 X
Df (θ 0 ) 2 ≤ B E kxk2 hx, w1,` − w2,` i2

(413)
m
`=1
m
B2 X B2
≤ (d + 2) kw1,` − w2,` k22 = (d + 2)kW 1 − W 2 k2F , (414)
m m
`=1

where the second inequality follows from the Gaussian identity E{kxk2 xxT } = (d + 2)Id . This proves
Eq. (393).
Theorem 5.4. Consider the two layer neural network of (386) under the assumptions of Lemma 5.3.
(1) (2)
Further let α := α/(1 + α) for initialization θ 0 = θ 0 and α := α for θ 0 = θ 0 . Then there exist constants
Ci , depending uniquely on σ, such that if md ≥ C0 n log n, d ≤ n ≤ d`0 and
r
n2
α ≥ C0 , (415)
md

88
then, with probability at least 1 − 2 exp{−n/C0 }, the following hold for all t ≥ 0.
1. Gradient flow converges exponentially fast to a global minimizer. Specifically, letting λ∗ = C1 α2 d/n,
we have

L(θ b 0 ) e−λ∗ t .
b t ) ≤ L(θ (416)

2. The model constructed by gradient flow and linearized flow are similar on test data, namely
( r r )
α n2 1 n5
kf (θ t ) − flin (θ t )kL2 (P) ≤ C1 + . (417)
α2 md α2 md4

Proof. Throughout the proof, we use C to denote constants depending only on σ, that might change from
line to line. Using Lemma 5.3, the condition (346) reads

√ 2
r
dn α √
α · n≤C α d . (418)
m α
which is equivalent to Eq. (415). We can therefore apply Theorem 5.1.
Equation (416) follows from Theorem 5.1, point 1, using the lower bound on σmin given in Eq. (389).
Equation (417) follows from Theorem 5.1, point 3, using the estimates in Lemma 5.3.

89

You might also like