Professional Documents
Culture Documents
Jingling Li1,3 Yanchao Sun1 Jiahao Su4 Taiji Suzuki2,3 Furong Huang1
1
Department of Computer Science, University of Maryland, College Park
arXiv:2001.05070v2 [cs.LG] 10 May 2020
2
Graduate School of Information Science and Technology, The University of Tokyo
3
Center for Advanced Intelligence Project, RIKEN
4
Department of Electrical and Computer Engineering, University of Maryland, College Park
Abstract 1 Introduction
Deep neural networks generalize well on un- Deep neural networks recently have made ma-
seen data though the number of parameters jor breakthroughs in solving many difficult
often far exceeds the number of training ex- learning problems, especially in image classifica-
amples. Recently proposed complexity mea- tion (Simonyan and Zisserman, 2014; Szegedy et al.,
sures have provided insights to understand- 2015; He et al., 2016; Zagoruyko and Komodakis,
ing the generalizability in neural networks 2016) and object recognition (Krizhevsky et al., 2012;
from perspectives of PAC-Bayes, robustness, Sermanet et al., 2013; Simonyan and Zisserman, 2014;
overparametrization, compression and so on. Zeiler and Fergus, 2014). The success of deep neural
In this work, we advance the understanding networks depends on the high expressive power
of the relations between the network’s ar- and the ability to generalize. The high expressive
chitecture and its generalizability from the power has been demonstrated empirically (He et al.,
compression perspective. Using tensor anal- 2016; Zagoruyko and Komodakis, 2016) and theo-
ysis, we propose a series of intuitive, data- retically (Hornik et al., 1989; Mhaskar and Poggio,
dependent and easily-measurable properties 2016). Yet, fundamental questions on why deep
that tightly characterize the compressibility neural networks generalize and what enables their
and generalizability of neural networks; thus, generalizability remain unsettled.
in practice, our generalization bound outper-
forms the previous compression-based ones, A recent work by Arora et al. (2018) characterizes the
especially for neural networks using tensors generalizability of a neural network from a compres-
as their weight kernels (e.g. CNNs). More- sion perspective — the capacity of the network is char-
over, these intuitive measurements provide acterized through its compressed version. The com-
further insights into designing neural network pression algorithm in Arora et al. (2018) is based on
architectures with properties favorable for random projection: each weight matrix of the com-
better/guaranteed generalizability. Our ex- pressed network are represented by a linear combina-
perimental results demonstrate that through tion of basis matrices with entries i.i.d. sampled from
the proposed measurable properties, our gen- ±1. The effective number of parameters in the weight
eralization error bound matches the trend of matrix is the number of coefficients in this linear com-
the test error well. Our theoretical analysis bination obtained via projection — the inner product
further provides justifications for the empir- between the original weight matrix and these basis
ical success and limitations of some widely- matrices. Though the idea of using compression in
used tensor-based compression approaches. deriving the generalization bounds is novel, the com-
We also discover the improvements to the pression scheme in Arora et al. (2018) could be made
compressibility and robustness of current neu- more practical since (1) the cost of forwarding pass
ral networks when incorporating tensor oper- in the compressed network still remains the same as
ations via our proposed layer-wise structure. the cost in the original one, even though the effective
number of parameters to represent the original weight
matrices decreases; (2) storing these random projec-
Proceedings of the 23rd International Conference on Artifi-
cial Intelligence and Statistics (AISTATS) 2020, Palermo, tion matrices could require more spaces than storing
Italy. PMLR: Volume 108. Copyright 2020 by the au- the original set of parameters. We propose a new
thor(s). theoretical analysis based on a more practical, well-
Understanding Generalization in Deep Learning via Tensor Methods
developed, and principled compression scheme using low-rankness in return leads to a tighter generalization
tensor methods. Besides, we use tensor analysis to bound. Moreover, we are the first to provide theoret-
derive a much tighter bound for the layer-wise error ical guarantees for the usage of CP decomposition in
propagation by exploiting additional structures in the deep neural networks in terms of compressibility and
weight tensors of neural networks, which as a result generalizability.
significantly tightens the generalization error bound Definition 1.1 (Proposed Architecture Layer).
A CP Layer (CPL) with width R consists of R set
in Arora et al. (2018). n (r) N oR (r)
of parameters λ(r) , vj j=1 where vj is a
Our approach aims to characterize the network’s r=1
compressibility by measuring the low-rankness of vector in Rdj with unit norm. The weight kernel
the weight kernels. Existing compression meth- of
PRthis (r)
CPL is a N -order tensor defined as K :=
(r) (r)
ods in (Jaderberg et al., 2014; Denton et al., 2014; r=1 λ v1 ⊗ · · · ⊗ vN , where ⊗ denotes the vector
Lebedev et al., 2014; Kim et al., 2015; Garipov et al., outer-product (tensor product) defined in Appendix
2016; Wang et al., 2018; Su et al., 2018) implement B.9) 1 . Note that K ∈ Rd1 ×···×dN .
low-rank approximations by performing matrix/tensor Remark. CPL allows for flexible choices of the struc-
decomposition on weight matrices/kernels of well- tures since the number of components R is a tunable
trained models. However, the layers of SOTA net- hyper-parameter that controls the number of parame-
works, such as VGG (Simonyan and Zisserman, 2014) ters in CPL. The CP spectrum of this layer is denoted
and WRN (Zagoruyko and Komodakis, 2016), are not by {λ(r) }R
r=1 in a descending order. The size of the
necessarily low-rank: we apply CP-tensor decompo- weight kernel is d0 × d1 × · · · × dN , while the number of
sitions (Kolda and Bader, 2009; Anandkumar et al., parameters in CPL is (d0 +d1 +· · ·+dN +1)×R.
2014b; Huang et al., 2015; Li and Huang, 2018) to the
In contrast with existing works which apply CP de-
weight tensors of well-trained VGG-16 and WRN-28-
composition to each layer of a reference network, no
10, and the amplitudes of the components from the
CP decomposition is needed since the components are
CP decomposition (a.k.a CP spectrum) are demon-
explicitly stored as model parameters so that they can
strated by the brown curves in Figure 1, which indicate
be learned from scratch via back-propagation. More-
that the layers of these pre-trained networks are not
over, compression in CP layers is natural – simply pick-
low-rank. Therefore a straightforward compression of
ing the top R̂ components to retain and pruning out
the network cannot be easily achieved and computa-
the rest of them. Thus, the compression procedure us-
tionally expensive fine tuning is often needed.
ing CPL does not require any costly fine-tuning while
1.0 1.0
existing works on tensor-based compression may use
Normalized amplitude
Normalized amplitude
VGG WRN
0.8 CP-VGG 0.8 CP-WRN hundreds of epochs for fine-tuning.
0.6 0.6
0.4 0.4
We further propose a series of simple, intuitive, data-
0.2 0.2
dependent and easily-measurable properties to mea-
0.0 0.0 sure the low-rankness in current neural networks.
0 500 1,000 1,500 2,000 0 1,000 2,000 3,000 These properties not only guide the selection of the
Index of the components Index of the components
number of components to generate a good compres-
sion, but also tighten the bound of the layer-wise error
(a) VGG16 (layer 13) (b) WRN-28-10 (layer 28) propagation via tensor analysis. The proposed proper-
Figure 1: CP spectrum comparison (CP-VGG and ties
CP-WRN are neural networks with CP layers). • characterize the compressibility of the neural net-
work, i.e., how much the original network can be
To overcome this limitation, we propose a layer-wise
compressed without compromising the performance
structure design, CP Layer (CPL), by incorporating
on a training dataset more than certain range.
the variants of CP decompositions in (Jaderberg et al.,
2014; Kossaifi et al., 2017; Howard et al., 2017). CPL • characterize the generalizability of the compressed
re-parametrizes the weight tensors such that a network, i.e. tell if a neural network is trained using
Polyadic form (CP form) (Kolda and Bader, 2009) can normal data or corrupted data.
be easily learned in an end-to-end fashion. In our theoretical analysis, we derive generalization er-
ror bounds for neural networks with CP layers, which
We demonstrate that empirically, CPL allows the net- take both the input distribution and the compressibil-
work to learn a low-rank structure more easily, and ity of the network into account. We present a rigorous
thus helps with compression. For example, from the proof showing the connection of our proposed proper-
pink curves in Figure 1, we see that neural networks
1
with CPL have a spiky CP spectrum, which is an indi- The (i1 , i2 , . . . , iN )th element of the weight kernel is
PR (r) (r) (r)
cation of low-rankness. We rigorously prove that this r=1 λ v1 (i1 ) × · · · × vN (iN ).
Jingling Li1,3 , Yanchao Sun1 , Jiahao Su4 , Taiji Suzuki2,3 , Furong Huang1
Comparison with Arora et al. (Arora et al., of a network’s name to denote such a network with
2018). Besides practical improvements of general- CP layers (e.g. CP-VGG denotes a VGG with CP lay-
ization error, our work improves the results obtained ers). For any positive integer n, let [n] := {1, 2, ..., n}.
by (Arora et al., 2018): 1) we provide a tightened Let |a| denote the absolute value of a scalar a. Given
layer-wise analysis using tensor method to directly a vector a ∈ Rd , a matrix A ∈ Rd×k , and a tensor
bound the operator norm of the weight kernel (e.g. A ∈ Rd1 ×d2 ×d3 , their norms are defined as follows: (1)
Lemma C.5 and Lemma C.8). The interlayer proper- Vector norm: kak denotes the ℓ2 norm. (2) Ma-
ties introduced by (Arora et al., 2018) are orthogonal trix norms: Let kAk∗ denote its nuclear norm, kAkF
to our proposed layer-wise properties and they can be denote its Frobenius norm, and kAk denote its oper-
well-combined; 2) in practice, our bound outperforms ator norm (spectral norm), where σi (A) denotes the
that of (Arora et al., 2018) in terms of the achieved de- ith largest singular value of A. (3) Tensor norms:
gree of compression (detailed discussions in Section 5.2 |A(x,y,z)|
Let kAk = maxx∈Rd1 ,y∈Rd2 ,z∈Rd3 kxkkykkzk denote its
and Section A.2); 3) for fully connected (FC) neu- operator norm, and kAkF its Frobenius norm. More-
ral networks, our proposed reshaping factor (defini- over, we use ⊗ to denote the outer product op-
tion E.2) further tightens the generalization bound as erator, and ∗ to denote the convolution opera-
long as the inputs to the FC layers have some low- tor. We use Fm to denote m-dimensional discrete
rank structures; 4) we extend our theoretical analysis Fourier transform, and use tilde symbols to denote
to neural networks with skip connections, while the tensors after DFT (e.g. T̃ = Fm (T )). A Polyadic de-
theoretical analysis in Arora et al. (2018) only applies composition (CP decomposition) (Kruskal, 1989;
to FC and CNN. Kolda and Bader, 2009) of a N -order tensor K ∈
Comparison with existing CP decomposition Rd1 ×d2 ×···×dN is a linear combination
PR of rank-one ten-
(r)
for network compression. While CP decom- sors that is equal to K: K =
r=1 λ
(r) v1 ⊗ · · · ⊗
(r)
(r)
position has been commonly used in neural net- vN where ∀r ∈ [R], ∀j ∈ [N ],
vj
= 1. Mar-
work compression (Denton et al., 2014; Lebedev et al., gin loss Arora et al. (2018): we use Lγ (M) and
2014; Kossaifi et al., 2017), our proposed compression L̂γ (M) to denote the expected and empirical margin
method is very different from theirs. First, the the loss of a neural network M with respect to a margin
tensor contraction layer Kossaifi et al. (2017) is a spe- γ ≥ 0. The expected margin loss of a neural net-
cial case of our CPL for FC layers when we set the work M is defined as Lγ (M) := P(x,y)∈D M(x)[y] ≤
number of components to be 1. Second, the number
γ + maxi6=y M(x)[i] .
of components in our proposed CPL can be arbitrar-
ily large (as it is a tunable hyper-parameter), while
the number of components of layers in (Denton et al., 4 CNNs with CPL: Compressibility
2014; Lebedev et al., 2014; Kossaifi et al., 2017) are and Generalization
determined by the compression ratio. Third, no ten-
sor decomposition is needed for evaluating the gen- In this section, we derive the generalization bound for
eralizability and compressing neural networks with a convolutional neural network (denoted as M) using
CP layers as the components from the CP decom- tensor methods and standard Fourier analysis. The
position are already stored as model parameters. complete proof is in Appendix Section D. For simplic-
Moreover, as the smaller components in CPL are ity, we assume that there is no pooling layer (e.g. max
pruned during the compression, the performance of pooling) in M since adding pooling layer will only
the compressed neural net is often preserved and lead to a smaller generalization bound (the perturba-
thus no expensive fine tuning is required (see Ta- tion error in our analysis decreases with the presence
ble 5). The depthwise-separable convolution used in of pooling layers). The derived generalization bound
MobileNet (Howard et al., 2017) is a specific imple- can be directly extended to various neural network ar-
mentation of CPL; thus, our theoretical analysis can chitectures (e.g. neural networks with pooling layers,
provide generalization guarantees for the MobileNet and neural networks with batch normalization). The
architecture. generalization bounds for fully connected neural net-
works and neural networks with skip connections are
3 Notations and Preliminaries presented in Appendix Section E.4 and F.3 respec-
tively.
In this paper, we use S to denote the set of training
samples drawn from a distribution D with |S| = m. 4.1 Compression of a CNN with CPL
Let n denote the number of layers in a given neural
network, and superscripts of form (k) denote proper- We first illustrate how to compress any given CNN M
ties related to the k th layer. We put “CP” in front by presenting a compression algorithm (Algorithm 1).
Jingling Li1,3 , Yanchao Sun1 , Jiahao Su4 , Taiji Suzuki2,3 , Furong Huang1
We will see that this compression algorithm guarantees needed when compressing CNNs with CPL as we can
a good estimation of the generalization bound for the prune out the components with smaller amplitudes di-
compressed network M̂. rectly.
Original CNN M is of n layers with ReLU activation,
Algorithm 1 Compression of Convolutional
its k th layer weight tensor M(k) is a 4th order tensor Neural Networks
of size = # of input channel s(k) × # of output chan-
FBRC (in Appendix G) calculates a set of number of com-
(k) (k)
nel o(k) × kernel height kx × kernel width ky . Let ponents {R̂(k) }n
for the compressed network such that
k=1
(k) (k) (k)
the 3rd order tensor X (k) ∈ RH ×W ×s denote
M(X ) − M̂(X )
≤ ǫ kM(X )kF holds for any given ǫ and
(k) (k) (k)
the input to the k th layer, and Y (k) ∈ RH ×W ×o F
for any input X in the training dataset S.
th
denote the output of the k layer before activation. △
CNN-Project (in Appendix G) takes a given set of num-
Therefore X (k) = ReLU Y (k−1) . We use i to denote
ber of components {R̂(k) }nk=1 and returns a compressed
the index of input channels, and j to denote the index network M̂ by pruning out the smaller components in the
of output channels. We further use f and g to de- CP spectrum of the weight tensors of M.
note the indices of width and height in the frequency More intuitions of the sub-procedures FBRC and CNN-
domain. Project are described in Section 4.2 and Appendix G.
Proposition 4.1 (Polyadic Form of original
CNN M). For each layer k, the weight tensor Input: A CNN M of n layers and a margin γ
M(k) has a Polyadic form with number of compo- Output: A compressed M̂ whose ex-
(k) (k) (k) (k) pected error L 0 (M̂) ≤ L̂ γ (M) +
nents R(k) ≤ min{s(k) o(k) , s(k) kx ky , o(k) kx ky } q Pn (k) (s(k) +o(k) +k(k) ×k(k) +1)
R (k)
(k) (k) R̂ x y
(Kolda and Bader, 2009): M(k) = Õ k=1
P
r=1 λr ar ⊗ m
(k) (k)
br ⊗ Cr , where the CP-spectrum is in a descending 1: Calculate all layer cushions {ζ (k) }n
k=1 based on def-
(k) (k) (k) (k) (k)
order, i.e., λ1 ≥ λ2 ≥ · · · ≥ λR(k) . All ar , br are inition 4.4
(k) (k) (k) (k)
s(k) o(k) (k) 2: Pick R(k) = min{s(k) o(k) , s(k) kx ky , o(k) kx ky }
unit vectors in R and R respectively, and Cr
(k) (k) (k) for each layer k
is a matrix in Rkx ×ky with kCr kF = 1. The R (k)
3: If M does not have CPL, apply a CP-
required for the Polyadic Form is called tensor rank.
decomposition to the weight tensor of each layer
Transform original CNN to a CNN with CP k
layers. By Proposition 4.1, each weight tensor M(k) γ
4: Set the perturbation parameter ǫ := 2 maxX kM(X )k
in M can be represented in a Polyadic form (CP form) F
and thus is transformed to a CPL. The total number of 5: Compute number of components needed for each
(k) (k)
parameters in CPL is R(k) × (s(k) + o(k) + kx ky + 1). layer ofthe compressed network {R̂(k) }nk=1
Thus, a smaller R(k) leads to fewer number of effective ←
FBRC {M(k) }nk=1 , {R(k) }nk=1 , {ζ (k) }nk=1 , ǫ
parameters and indicates more compression.
6: M̂ ← CNN-Project△ M, {R̂(k) }n i=1
Compress Original CNN M to M̂. We illustrate
the compression procedure in Algorithm 1. Feeding 7: Return the compressed convolutional neural net-
a CNN M to the compression algorithm, we obtain work M̂
a compressed CNN M̂, where for each layer k, the
PR̂(k) (k) (k) (k)
weight tensor in M̂ is M̂(k) = r=1 λr ar ⊗ br ⊗
(k) 4.2 Characterizing Compressibility of CNN
cr for some R̂(k) ≤ R(k) . Similarly, we use X̂ (k) to
with CPL: Network Properties
denote the input tensor of the k th layer in M̂ and Ŷ (k)
to denote the output tensor of the k thlayer inM̂ before
In this section, we propose the following layer-wise
activation. Therefore Xˆ(k) = ReLU Ŷ (k−1) . Notice properties that can be evaluated based on the training
that X̂ (k) , Ŷ (k) are of the same shapes as X (k) , Y (k) data S: tensorization factor (TF), tensor noise bound
respectively and X (1) = X̂ (1) since the input data to (TNB), and layer cushion (LC) (Arora et al., 2018).
both networks M and M̂ is the same. These proposed properties are very effective at char-
acterizing the compressibility of a neural network. As
The compression Algorithm 1 is designed to compress Algorithm 1’s sub-procedure FBRC selects a set of num-
any CNN, and therefore requires applying explicit CP ber of components {R̂(k) }nk=1 to obtain a compressed
decompositions to the weight tensors of traditional
network M̂ whose output is similar
to that of the orig-
CNNs (the step 3 in Algorithm 1). However, for a
inal network (i.e.,
M(X ) − M̂(X )
≤ ǫ kM(X )kF for
CNN with CP layers, these CP components are al- F
ready stored as weight parameters in our CPL struc- any input X ∈ S), our proposed properties will as-
ture, and thus are known to the compression algorithm sist the selections of {R̂(k) }nk=1 to guarantee that Algo-
in advance. Therefore, no tensor decomposition is rithm 1 returns a “good” compressed network.
Understanding Generalization in Deep Learning via Tensor Methods
(k)
Definition 4.2. [tensorization factor tj ] The ten- Our proposed properties, orthogonal to the interlayer
n oR(k)
(k)
properties introduced in (Arora et al., 2018), provide
sorization factors tj of the k th layer is defined better measurements of the compressibility in each in-
j=1
as dividual convolutional layer via the use of tensor anal-
j
(k)
X
(k) (f,g) ysis and Fourier analysis, and thus lead to a tighter
tj := max λr C̃r (1) bound of the layer-wise error propagation.
f,g
r=1
(k)
where λr is the rth largest value in the CP spectrum 4.3 Generalization Guarantee of CNNs
(f,g)
of the weight tensor M(k) and C̃r denotes the am-
Based on Algorithm 1 and our proposed properties in
plitude at the frequency (f, g).
section 4.2, we obtain a generalization bound for the
Remark. The tensorization factor characterizes both compressed convolutional neural network M̂ and, in
the generalizability and the expressive power of a given section 5, we will evaluate this bound explicitly.
network. For a fixed j, a smaller tensorization factor Theorem 4.5 (Main Theorem). For any convolu-
indicates the original network is more compressible tional neural network M with n layers, Algorithm 1
and thus has a smaller generalization bound. How- generates a compressed CNN M̂ such that with high
ever, a smaller tensorization factor may also indicate probability, the expected error L0 (M̂) is bounded by
that the given network do not possess enough expres- the empirical margin loss L̂γ (M) (for any margin γ ≥
sive power. Thus, during the compression of a neural 0) and a complexity term defined as follows
network with good generalizability, we need to find a
“good” j that generates a tensorization factor demon- L0 (M̂) ≤ L̂γ (M)+
s
strating the balance between a small generalization
(k) (s(k) + o(k) + k (k) k (k) + 1)
Pn
R̂ x y (4)
gap and high expressive power. Õ k=1
(k)
Definition 4.3. [tensor noise bound ξj ] The tensor m
n oR(k)
(k) given that for all layer k, the number of components
noise bound ξj of the k th layer measures the
j=1 R̂(k) in the compressed network satisfies that
amplitudes of the remaining components after pruning
(k)
n o
the ones with amplitudes smaller than the λj : (k) (i)
R̂(k) = min j ∈ [R(k) ]|ξj Πni=k+1 tj ≤ C (5)
R (k) γ
with C = Πni=k ζ (i)
M(i)
(k)
X (k) (f,g)
ξj := max λr C̃r (2) 2n maxX ∈S kM(X )kF F
f,g
r=j+1 (k) (k)
where tj , ξj and ζ (k) are data dependent measur-
able properties — tensorization factor, tensor noise
Remark. For a fixed j, a smaller tensor noise bound bound, and layer cushion of the k th layer in defini-
indicates the original neural network’s weight tensor is tions 4.2, 4.3 and 4.4 respectively.
more low-rank and thus more compressible.
Definition 4.4. [layer cushion ζ (k) ] As introduced Remark. How well the compressed neural network ap-
in Arora et al. (2018), the layer cushion of the k th layer proximates the original network is related to the choice
is defined to be the largest value ζ (k) such that for any of R̂(k) . Inside equation (5), C is some value indepen-
X (k) ∈ S, dent of the choice of j in the inequality. Therefore, the
number of components for the k th layer in the com-
pressed network, R̂(k) , is the smallest j ∈ [R(k) ] such
.p
ζ (k)
M(k)
H (k) W (k)
X (k)
≤
M(k+1)
(k) (i)
F F F that the inequality ξj Πni=k+1 tj ≤ C holds. Hence,
(3)
smaller tensorization factors and tensor noise bounds
Following Arora et al. (2018),
layer cushion considers
will make the LHS smaller, and larger layer cushions
how much the output tensor
M(k+1)
F grows w.r.t.
(k)
will make the RHS, C, larger. As a result, if the above
the weight tensor
M
F and the input
X (k)
F . inequality for each layer can be satisfied by a smaller
Remark. As introduced in Arora et al. (2018), the j, the obtained generalization bound will be tighter as
layer cushion considers how much smaller the output we can obtain a smaller R̂(k) .
(k+1)
X
of the k th layer (after activation) compares
F
Analysis of generalization bounds in Theo-
with the product between the weight tensor
M(k)
F rem 4.5: This proposed generalization error bound
and the input
X (k)
F . Note that our layer cushion is proportional to the number of components in the
can be larger than 1 if models use batchnorm, and CP layers of the compressed neural network. There-
larger layer cushions will render smaller generalization fore, when the original neural network is highly com-
bounds as also shown in (Arora et al., 2018). pressible or very low-rank, the number of components
Jingling Li1,3 , Yanchao Sun1 , Jiahao Su4 , Taiji Suzuki2,3 , Furong Huang1
Table 1: Comparison of the training and test accu- is orders of magnitude tighter than other capacity
racies between neural networks (NNs) with CPL (CP- measures, such as ℓ1,∞ (Bartlett and Mendelson,
VGG-16, CP-WRN-28-10) and traditional NNs (VGG- 2002), Frobenius (Neyshabur et al., 2015b),
16. WRN-28-10) on CIFAR10 dataset. spec ℓ1,2 (Bartlett et al., 2017) and spec-
Acc. Architect. VGG-16 WRN-28-10 fro (Neyshabur et al., 2017a) as shown in their
Dataset
with CPL without CPL with CPL without CPL Figure 4 Left. The use of a more effective and practi-
CIFAR10
Training 100% 100% 100% 100% cal compression approach allows us to achieve better
Test 93.68% 92.64%† 95.09% 95.83%∗
Training 100% 100% 100% 100% compression (detailed discussions are in Appendix
CIFAR100
Test 71.8% 70.84%‡ 76.36% 79.5%∗ 2 Section A.2).
Table 2: Test accuracy on CIFAR10 with various la-
·107
bel corruptions rates (CR). 16 our bound
Generalization bound
2 test error
Table 3: Average test accuracy on MNIST over the last ten epochs. Baseline simply denotes training a neural
network on the corrupted training set without further processing. PairFlip denotes that the label mistakes
can only happen within very similar classes and Symmetric denotes that the label mistakes may happen across
different classes uniformly (Han et al., 2018).
Baseline F-correction MentorNet CT
Task: Rate CT + CPL
(Han et al., 2018) (Han et al., 2018) (Jiang et al., 2017) (Han et al., 2018)
PairFlip: 45% 56.52 ± 0.55 0.24 ± 0.03 80.88 ± 4.45 87.63 ± 0.21 92.43 ± 0.01
Symmetric: 50% 66.05 ± 0.61 79.61 ± 1.96 90.05 ± 0.30 91.32 ± 0.06 94.70 ± 0.05
Symmetric: 20% 94.05 ± 0.16 98.80 ± 0.12 96.70 ± 0.22 97.25 ± 0.03 97.91 ± 0.01
using “good” data or corrupted data by evaluating our ratios.
proposed properties.
Our CPL, combined with co-teaching (CT) (Han et al.,
2.00
well trained well trained
·10−2
well trained
1.2
well trained
2018) (the SOTA method for defeating label noise) fur-
1.75 1.50
1.50
1.25
corrupted
6
corrupted
1.25
1.00
corrupted 1.0
0.8
corrupted
ther improves its performance as shown in Table 3
1.00
0.75
4 0.75
0.50
0.6
0.4
where we also compare our method CT+CPL against
other different label-noise methods (Han et al., 2018).
0.50 2 0.25
0.2
0.25 0.00
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12
layers layers layers layers
Besides, in Figure 5, our method CT+CPL consis-
(a) Rank r (k) (b) TF tj
(k)
(c) TNB ξj
(k)
(d) LC ζ (k) tently outperforms the SOTA method (CT) with vari-
ous choices of number of components.
Figure 3: Comparison of our proposed properties
across layers between well-trained and corrupted CP- 5.3 CPL Is Natural for Compression
VGG-16. The statistics are obtained from 200 models
trained under the same optimization settings. Applying CPL for neural network compression is ex-
tensively studied in Su et al. (2018), therefore we fo-
We further apply Algorithm 1 to these well-trained cus on explaining why CPL is natural for compression
and corrupted models to investigate the consistency and analyzing the compressibility of CPLs.
between the compression performance of Algorithm 1
Low Rankness in Neural Networks with CPL vs
and our theoretical results: on average, Algorithm 1
Traditional Neural Networks. The low rankness of
achieves a 31.83% compression rate on the well-trained
a CP-VGG and a traditional VGG is demonstrated by
models, but only an 89.7% compression rate on the cor-
Figure 4 where we display the ratios of the number of
rupted models (lower compression rate is better as it
components with amplitudes above a given threshold
implies a smaller generalization error bound). Clearly,
0.2. We clearly see that VGG with CPL exhibits low
the low-rank structures in well-trained models allow
rankness consistently for all layers while the traditional
them to be compressed much further, consistent with
VGG is not low-rank. Notice that the CP spectrum in
our theoretical analysis of Algorithm 1.
each CPL is normalized by dividing the largest ampli-
tude and the CP components of traditional VGG are
5.2 Generalization Improvement on Real obtained via explicit CP decompositions with recon-
Data Experiments struction error set to 1e-3.
% of components ≥ 0.2
% of components ≥ 0.2
(2015); Garipov et al. (2016); Wang et al. (2018); decompositions for learning latent variable models.
Su et al. (2018). However, the compression we per- The Journal of Machine Learning Research, 15(1):
form does not require any fine tuning since it directly 2773–2832, 2014b.
prunes out the components with amplitudes below
Animashree Anandkumar, Rong Ge, and Majid Janza-
some given threshold. In experiments, we compress a
min. Learning overcomplete latent variable models
CP-WRN-28-10, which has the same number of param-
through tensor methods. In Conference on Learning
eters as WRN-28-10, by 8× with only 0.56% perfor-
Theory (COLT), June 2015.
mance drop on CIFAR10 image classification. The full
compression results for CP-WRN-28-10 under differ- Sanjeev Arora, Rong Ge, Behnam Neyshabur, and
ent cutting-off thresholds are shown in Table 5, where Yi Zhang. Stronger generalization bounds for deep
components whose amplitudes are under the cutting- nets via a compression approach. 2018.
off threshold are pruned.
Peter L Bartlett and Shahar Mendelson. Rademacher
and gaussian complexities: Risk bounds and struc-
6 Conclusion and Discussion tural results. Journal of Machine Learning Research,
3(Nov):463–482, 2002.
In this work, we derive a practical compression-based
Peter L Bartlett, Vitaly Maiorov, and Ron Meir. Al-
generalization bound via the proposed layerwise struc-
most linear vc dimension bounds for piecewise poly-
ture CP layers, and demonstrate the effectiveness of
nomial networks. In Advances in Neural Information
using tensor methods in theoretical analyses of deep
Processing Systems, pages 190–196, 1999.
neural networks. With a series of benchmark experi-
ments, we show the practical usage of our generaliza- Peter L Bartlett, Dylan J Foster, and Matus J Telgar-
tion bound and the effectiveness of our proposed struc- sky. Spectrally-normalized margin bounds for neural
ture CPL in terms of compression and generalization. networks. In Advances in Neural Information Pro-
A possible future direction is studying the effectiveness cessing Systems, pages 6240–6249, 2017.
of other tensor decomposition methods such as Tucker
Emily L Denton, Wojciech Zaremba, Joan Bruna,
or Tensor Train.
Yann LeCun, and Rob Fergus. Exploiting linear
structure within convolutional networks for efficient
Acknowledgement evaluation. In Advances in neural information pro-
cessing systems, pages 1269–1277, 2014.
This research was supported by startup fund from
Simon S Du and Jason D Lee. On the power of over-
Department of Computer Science of University of
parametrization in neural networks with quadratic
Maryland, National Science Foundation IIS-1850220
activation. arXiv preprint arXiv:1803.01206, 2018.
CRII Award 030742- 00001, DOD-DARPA-Defense
Advanced Research Projects Agency Guaranteeing AI Gintare Karolina Dziugaite and Daniel M Roy. Com-
Robustness against Deception (GARD), Laboratory puting nonvacuous generalization bounds for deep
for Physical Sciences at University of Maryland. This (stochastic) neural networks with many more pa-
research was also supported in part by JSPS Kakenhi rameters than training data. arXiv preprint
(26280009, 15H05707 and 18H03201), Japan Digital arXiv:1703.11008, 2017.
Design and JST-CREST. Huang was also supported
Timur Garipov, Dmitry Podoprikhin, Alexander
by Adobe, Capital One and JP Morgan faculty fellow-
Novikov, and Dmitry Vetrov. Ultimate tensoriza-
ships. We thank Ziyin Liu for supporting this research
tion: compressing convolutional and fc layers alike.
with great advice and efforts. We thank Jin-peng Liu,
arXiv preprint arXiv:1611.03214, 2016.
Kai Wang, and Dongruo Zhou for helpful discussions
and comments. We thank Jingxiao Zheng for support- Noah Golowich, Alexander Rakhlin, and Ohad Shamir.
ing additional computing resources. Size-independent sample complexity of neural net-
works. arXiv preprint arXiv:1712.06541, 2017.
References Noah Golowich, Alexander Rakhlin, and Ohad
Shamir. Size-independent sample complexity of
Anima Anandkumar, Rong Ge, and Majid Janza-
neural networks. In Sébastien Bubeck, Vianney
min. Guaranteed non-orthogonal tensor decomposi-
Perchet, and Philippe Rigollet, editors, Proceed-
tion via alternating rank-1 updates. arXiv preprint
ings of the 31st Conference On Learning Theory,
arXiv:1402.5180, 2014a.
volume 75 of Proceedings of Machine Learning Re-
Animashree Anandkumar, Rong Ge, Daniel Hsu, search, pages 297–299. PMLR, 06–09 Jul 2018. URL
Sham M Kakade, and Matus Telgarsky. Tensor http://proceedings.mlr.press/v75/golowich18a.html.
Understanding Generalization in Deep Learning via Tensor Methods
Alon Gonen and Shai Shalev-Shwartz. Fast rates for Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li,
empirical risk minimization of strict saddle prob- and Li Fei-Fei. Mentornet: Learning data-driven
lems. In COLT, 2017. curriculum for very deep neural networks on cor-
rupted labels. arXiv preprint arXiv:1712.05055,
Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao
2017.
Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama.
Co-teaching: Robust training of deep neural net- Kenji Kawaguchi, Leslie Pack Kaelbling, and Yoshua
works with extremely noisy labels. In Advances in Bengio. Generalization in deep learning. arXiv
Neural Information Processing Systems, pages 8536– preprint arXiv:1710.05468, 2017.
8546, 2018.
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge
Song Han, Huizi Mao, and William J Dally. Deep com- Nocedal, Mikhail Smelyanskiy, and Ping Tak Pe-
pression: Compressing deep neural networks with ter Tang. On large-batch training for deep learn-
pruning, trained quantization and huffman coding. ing: Generalization gap and sharp minima. arXiv
arXiv preprint arXiv:1510.00149, 2015. preprint arXiv:1609.04836, 2016.
Moritz Hardt, Benjamin Recht, and Yoram Singer. Yong-Deok Kim, Eunhyeok Park, Sungjoo Yoo, Tae-
Train faster, generalize better: Stability of lim Choi, Lu Yang, and Dongjun Shin. Compres-
stochastic gradient descent. In Proceedings of the sion of deep convolutional neural networks for fast
33rd International Conference on International and low power mobile applications. arXiv preprint
Conference on Machine Learning - Volume 48, arXiv:1511.06530, 2015.
ICML’16, pages 1225–1234. JMLR.org, 2016. URL
http://dl.acm.org/citation.cfm?id=3045390.3045520. Tamara G Kolda and Brett W Bader. Tensor decom-
positions and applications. SIAM review, 51(3):455–
Nick Harvey, Christopher Liaw, and Abbas Mehrabian. 500, 2009.
Nearly-tight vc-dimension bounds for piecewise lin-
ear neural networks. In Conference on Learning The- Jean Kossaifi, Aran Khanna, Zachary Lipton, Tom-
ory, pages 1064–1068, 2017. maso Furlanello, and Anima Anandkumar. Ten-
sor contraction layers for parsimonious deep nets.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian In Computer Vision and Pattern Recognition Work-
Sun. Deep residual learning for image recognition. shops (CVPRW), 2017 IEEE Conference on, pages
In Proceedings of the IEEE Conference on Com- 1940–1946. IEEE, 2017.
puter Vision and Pattern Recognition, pages 770–
778, 2016. Jean Kossaifi, Yannis Panagakis, Anima Anandkumar,
and Maja Pantic. Tensorly: Tensor learning in
Kurt Hornik, Maxwell Stinchcombe, and Halbert python. The Journal of Machine Learning Research,
White. Multilayer feedforward networks are univer- 20(1):925–930, 2019.
sal approximators. Neural networks, 2(5):359–366,
1989. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-
ton. Imagenet classification with deep convolutional
Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry neural networks. In Advances in neural information
Kalenichenko, Weijun Wang, Tobias Weyand, processing systems, pages 1097–1105, 2012.
Marco Andreetto, and Hartwig Adam. Mobilenets:
Efficient convolutional neural networks for mobile vi- Joseph B Kruskal. Rank, decomposition, and unique-
sion applications. arXiv preprint arXiv:1704.04861, ness for 3-way and n-way arrays. Multiway data
2017. analysis, pages 7–18, 1989.
Furong Huang, UN Niranjan, Mohammad Umar Ha- Ilja Kuzborskij and Christoph H. Lampert. Data-
keem, and Animashree Anandkumar. Online tensor dependent stability of stochastic gradient descent.
methods for learning latent variable models. Journal In ICML, 2018.
of Machine Learning Research, 16:2797–2835, 2015.
John Langford and Rich Caruana. (not) bounding the
Max Jaderberg, Andrea Vedaldi, and Andrew Zis- true error. In Advances in Neural Information Pro-
serman. Speeding up convolutional neural net- cessing Systems, pages 809–816, 2002.
works with low rank expansions. arXiv preprint
Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba,
arXiv:1405.3866, 2014.
Ivan Oseledets, and Victor Lempitsky. Speeding-
Daniel Jakubovitz, Raja Giryes, and Miguel RD Ro- up convolutional neural networks using fine-tuned
drigues. Generalization error in deep learning. arXiv cp-decomposition. arXiv preprint arXiv:1412.6553,
preprint arXiv:1808.01174, 2018. 2014.
Jingling Li1,3 , Yanchao Sun1 , Jiahao Su4 , Taiji Suzuki2,3 , Furong Huang1
Jialin Li and Furong Huang. Guaranteed simultane- Karen Simonyan and Andrew Zisserman. Very deep
ous asymmetric tensor decomposition via orthogo- convolutional networks for large-scale image recog-
nalized alternating least squares. arXiv preprint nition. arXiv preprint arXiv:1409.1556, 2014.
arXiv:1805.10348, 2018.
Jure Sokolic, Raja Giryes, Guillermo Sapiro, and
David A McAllester. Pac-bayesian model averag- Miguel RD Rodrigues. Generalization error of in-
ing. In Proceedings of the twelfth annual conference variant classifiers. arXiv preprint arXiv:1610.04574,
on Computational learning theory, pages 164–170. 2016.
ACM, 1999a.
Jiahao Su, Jingling Li, Bobby Bhattacharjee, and
David A McAllester. Some pac-bayesian theorems. Furong Huang. Tensorial neural networks: General-
Machine Learning, 37(3):355–363, 1999b. ization of neural networks and application to model
compression. https://arxiv.org/pdf/1805.10352.pdf,
Hrushikesh N Mhaskar and Tomaso Poggio. Deep 2018.
vs. shallow networks: An approximation theory per-
spective. Analysis and Applications, 14(06):829–848, Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Ser-
2016. manet, Scott Reed, Dragomir Anguelov, Dumitru
Erhan, Vincent Vanhoucke, and Andrew Rabinovich.
Behnam Neyshabur, Ruslan R Salakhutdinov, and Going deeper with convolutions. In Proceedings of
Nati Srebro. Path-sgd: Path-normalized optimiza- the IEEE Conference on Computer Vision and Pat-
tion in deep neural networks. In Advances in Neural tern Recognition, pages 1–9, 2015.
Information Processing Systems, pages 2422–2430,
2015a. Wenqi Wang, Yifan Sun, Brian Eriksson, Wenlin
Wang, and Vaneet Aggarwal. Wide compression:
Behnam Neyshabur, Ryota Tomioka, and Nathan Sre- Tensor ring nets. learning, 14(15):13–31, 2018.
bro. Norm-based capacity control in neural net-
works. In Conference on Learning Theory, pages Huan Xu and Shie Mannor. Robustness and general-
1376–1401, 2015b. ization. Machine learning, 86(3):391–423, 2012.
Behnam Neyshabur, Srinadh Bhojanapalli, David Sergey Zagoruyko and Nikos Komodakis. Wide resid-
McAllester, and Nathan Srebro. A pac-bayesian ap- ual networks. arXiv preprint arXiv:1605.07146,
proach to spectrally-normalized margin bounds for 2016.
neural networks. arXiv preprint arXiv:1707.09564, Matthew D Zeiler and Rob Fergus. Visualizing and
2017a. understanding convolutional networks. In Euro-
Behnam Neyshabur, Srinadh Bhojanapalli, David pean conference on computer vision, pages 818–833.
McAllester, and Nati Srebro. Exploring generaliza- Springer, 2014.
tion in deep learning. In Advances in Neural Infor- Chiyuan Zhang, Samy Bengio, Moritz Hardt, Ben-
mation Processing Systems, pages 5947–5956, 2017b. jamin Recht, and Oriol Vinyals. Understanding
Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojana- deep learning requires rethinking generalization. In-
palli, Yann LeCun, and Nathan Srebro. Towards ternational Conference on Learning Representations,
understanding the role of over-parametrization in 2017.
generalization of neural networks. arXiv preprint Wenda Zhou, Victor Veitch, Morgane Austern, Ryan P
arXiv:1805.12076, 2018. Adams, and Peter Orbanz. Non-vacuous gener-
Adam Paszke, Sam Gross, Soumith Chintala, Gre- alization bounds at the imagenet scale: a pac-
gory Chanan, Edward Yang, Zachary DeVito, Zem- bayesian compression approach. arXiv preprint
ing Lin, Alban Desmaison, Luca Antiga, and Adam arXiv:1804.05862, 2018.
Lerer. Automatic differentiation in pytorch. In
NIPS-W, 2017.
Hanie Sedghi, Vineet Gupta, and Philip M Long.
The singular values of convolutional layers. arXiv
preprint arXiv:1805.10408, 2018.
Pierre Sermanet, David Eigen, Xiang Zhang, Michaël
Mathieu, Rob Fergus, and Yann LeCun. Over-
feat: Integrated recognition, localization and detec-
tion using convolutional networks. arXiv preprint
arXiv:1312.6229, 2013.
Understanding Generalization in Deep Learning via Tensor Methods
Supplementary Material
Supplementary material for the paper: “Understanding Generalization in Deep Learning via Tensor Methods”.
This appendix is organized as follows:
• Appendix A: Experimental details and additional results
• Appendix B: Technical definitions and propositions
• Appendix C: Main technical contributions
• Appendix D, E, and F: Generalization bounds on three types of neural networks: convolutional neural
networks, fully-connected neural networks, and neural networks with residual connections
• Appendix G: Additional algorithms and algorithmic details
We train these four models (VGG-16, CP-VGG-16, WRN-28-10 and CP-WRN-28-10) using standard optimiza-
tion settings with no dropouts and default initializations provided by PyTorch (Paszke et al., 2017). We use
a SGD optimizer with momentum=0.9, weight decay=5e-4, and initial learning rate=0.05 to start the training
process. The learning rate is scheduled to be divided by 2 every 30 epochs for VGG-16 and CP-VGG-16. While
for WRN-28-10 and CP-WRN-28-10, the learning rate is scheduled to be divided by 5 at the 60th, 120th and
160th epoch. We run 300 epochs to train each VGG-16 and CP-VGG-16, and we run 200 epochs to train each
WRN-28-10 and CP-WRN-28-10.
The generalization bound we calculated for a well-trained CP-VGG-16 (with the same # of parameters as VGG-
16) on CIFAR10 dataset is around 12 (thus, of order 101 ) according to the transformation f (x) = x/20 − 0.5
applied in Figure 2b. Our evaluated bound is much better than naive counting of # parameters. Although we may
not be able to directly compare our calculated bound with that in (Arora et al., 2018), which is roughly of order
105 as (Arora et al., 2018) uses a VGG-19 to evaluate their generalization bound while our evaluation is done
using a CP-VGG-16, we present in Table 4 the effective number of parameters identified by our proposed bound.
Compared with the effective number of parameters in (Arora et al., 2018) (Table 1 of (Arora et al., 2018)), we can
see that (1) our effective number of parameters is upper bounded by the total number of parameters in original
network (thus, the compression ratio is bounded by 1), while the effective number identified by (Arora et al.,
2018) could be several times larger than the original number of parameters (e.g. based on Table 1 of (Arora et al.,
2018), their effective number of parameters in layer 4 and 6 are more than 4 times of the original number of
parameters); (2) the effective number of parameters in (Arora et al., 2018) ignores the dependence on depth,
log factors and constants, while our effective number of parameters in Table 4 is exactly the actual number of
parameters in the compressed network without these dependences.
The compression results in Table 5 are obtained directly without any fine tuning.
We provide additional experimental details in the improved generalization ability achieved by CPL under label
noise setting. Our CPL combined with co-teaching (CT) (Han et al., 2018) outperforms SOTA method. Co-
teaching (Han et al., 2018) is a training procedure for defeating label noise: it avoids overfitting to noisy labels
by selecting clean samples out of the noisy ones and using them to update the network. Given the experimental
results that neural networks with CPL tend to overfit less to noisy labels (Table 3), we combine Co-teaching to
train networks with CPL on three different types of corrupted data (Table 3). The hyperparameters we use in
these experiments are the same as the ones in Co-teaching [2].
Jingling Li1,3 , Yanchao Sun1 , Jiahao Su4 , Taiji Suzuki2,3 , Furong Huang1
Table 4: Effective number of parameters identified by our proposed bound in Theorem 4.5.
Table 5: The compression results of a 28-layer Wide-ResNet equipped with CPL (CP-WRN-28-10) on CIFAR10
dataset. The compression is done via normalizing the CP spectrum and then deleting the components in CPL
which have amplitudes smaller than the given cut-off-threshold.
Remark. The results displayed in Figure 5 and Figure 6 are based on our implementation of the CT method in
order to achieve a fair comparison, while the results displayed in Table 3 are based on the reported accuracies
by (Han et al., 2018) as we would like to compare our CT+CPL with other different label-noise methods as
well.
Figure 7b displays the CP spectral of a well-trained, a corrupted, and a randomly initialized CP-VGG-16 (at
the 13th convolutional layer). For the unnormalized CP spectra of three models in Figure 7b(a), we can see
that the largest amplitude in the CP spectrum of the corrupted CP-VGG-16 is much smaller than that of well-
trained and random models. Yet, a smaller leading value in the CP spectrum does not necessarily mean that
the corrupted is more low rank. As shown in Figure 7b(b), after normalizing the CP spectrum of each model by
its largest amplitude, well-trained CP-VGG-16 still has the most low-rank CP spectrum (the blue curve) than
3
PairFlip denotes that the label mistakes can only happen within very similar classes (Han et al., 2018)
4
Symmetric denotes that the label mistakes may happen across different classes uniformly (Han et al., 2018)
Understanding Generalization in Deep Learning via Tensor Methods
93 95
CT CT
CT+CPL CT+CPL
Test accuracy(%)
Test accuracy(%)
92.5
94.5
92
94
91.5
0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8
Compression ratio Compression ratio
100 100
CT CT
98 CT+CPL CT+CPL
98
Accuracy(%)
Accuracy(%)
96
94 96
92
94
90
0 50 100 150 200 0 50 100 150 200
Epoch Epoch
Figure 6: Convergence plots of test accuracy vs. number of epochs on MNIST data
well-trained 1 well-trained
Unnormalized amplitude
Normalized amplitude
0.5
0.05
0 0
0 500 1,000 1,500 2,000 0 500 1,000 1,500 2,000
Index of the components Index of the components
(a) CP spectrum with unnormalized amplitudes (b) CP spectrum with normalized amplitudes
Figure 7: Comparison of the CP spectra of a well-trained, a corrupted, and a randomly initialized CP-VGG-16
a the 13th convolutional layer
Jingling Li1,3 , Yanchao Sun1 , Jiahao Su4 , Taiji Suzuki2,3 , Furong Huang1
that of corrupted or random models. Notice that the random model has the least low rankness since its weight
tensors are the closest to random noise and thus it is hard to compress them.
We also compare our proposed properties among the three different sets of CP-VGG-16: well-trained, corrupted,
and randomly initialized. As shown in Figure 8, since random models have the least compressibility as their weight
tensors are closest to random noise, properties that focus more on the compressibility of the model are larger
on random models (e.g tensor noise bound), which will lead to larger generalization bounds. In the meantime,
properties that focus more on measuring the information loss after compression as well as the expressive power
of the models (e.g. Fourier factors) are smaller for random models. The reason why well-trained models have
rank across layers tensorization factor across layers tensor noise bound across layers
·10−2
well trained 1.0 well trained 1.50 well trained
2.0 corrupted corrupted corrupted
random 0.8 random 1.25 random
1.5 1.00
values
values
values
0.6
0.75
1.0 0.4
0.50
0.5 0.2 0.25
0.0 0.00
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12
layers layers layers
(k) (k)
(a) Rank R (b) TF tj (c) TNB ξj
norm across layers right hand side across layers
·10−2
1.2 3.5
well trained well trained well trained
1.0 3.0
corrupted corrupted corrupted
6 2.5
random random random
0.8
2.0
values
values
0.6 4
1.5
0.4
1.0
0.2 2
0.5
0.0 0.0
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12
layers layers layers
Figure 8: Comparison of proposed properties among well-trained, corrupted and randomly-initialized CP-VGG-
16 models
the largest tensorization factor is in Figure 7b as the corrupted model usually has a very small leading value in
its CP spectrum of later layers; yet as explained before, this does not necessarily indicate that corrupted models
have more compressibility or low-rankness. The reason why the CP spectrum of corrupted models tend to have
a small leading value is still a interesting question to study and we defer this to future work.
Optimization settings for obtaining the well-trained, corrupted, and randomly initialized models
of CP-VGG-16. We obtain well-trained CP-VGG-16 using the same hyperparameter settings as mentioned in
Appendix Section A.1. For corrupted CP-VGG-16, we train the model under 50% of label noise but using the
same set of hyperparameters as the well-trained models. For CP-VGG-16 with random initialization, we just
train the models for less than 1 epoch. For each set of these models, we obtain 200 instances using different
random seeds.
In this section, we will briefly review three key concepts underlying all analysis in this work, including (multidi-
mensional discrete Fourier transform), CP decomposition and 2D-convolutional layer in neural networks.
Definition B.1. (Multidimensional discrete Fourier transform, MDFT) An m-dimensional MDFT Fm defines
a mapping from an m-order tensor X ∈ RN1 ×···×Nm to another complex m-order tensor X̃ ∈ CN1 ×···×Nm such
Understanding Generalization in Deep Learning via Tensor Methods
that
m
!− 21 N1 Nm m
!
fl nl
X˜f1 ,...,fm =
Y X X Y
Nl ··· Xn1 ,··· ,nm ωN l
(6)
l=1 n1 =1 nm =1 l=1
− 21
where ωNl = exp (−j2π/Nl ) and ( m
Q
l=1 Nl ) is the normalization factor that makes Fm unitary. Through out
the paper, we will use symbols with tilde (e.g. X˜ ) to denote tensors after MDFT.
MDFT can also be applied on a subset of the dimensions I ⊆ [m], and in this case we denote the mapping as
I
Fm .
!− 21 !
fl nl
Y X Y
X̃i1 ,...,im = Nl Xn1 ,··· ,nm ωN l
(7)
l∈I ∀l∈I l∈I
I
Similarly, Fm is identical to a composition of |I| unidimensional DFTs over corresponding dimensions.
Fact B.3. (MDFT is unitary) For an MDFT F , its adjoint F ∗ is equal to its inverse F −1 , i.e. F ∗ = F −1 . An
immediate corollary of this property is that the operator norm is invariant to MDFT: Given an operator A, its
operator norm of A is equal to F ∗ AF , i.e. kAk = kF ∗ AF k.
B.2 CP decomposition
Definition B.4. (CP decomposition) Given an m-order tensor T ∈ RN1 ×···×Nm , a CP decomposition factorizes
T into m core factors {K l }m l
l=1 with K ∈ R
R×Nl
(with its rth column as krl ∈ RNl ) such that
R
X
T = λr kr1 ⊗ · · · ⊗ krm (9a)
r=1
R
X
1 m
Tn1 ,··· ,nm = λr Kr,n1
· · · Kr,nm
(9b)
r=1
where each column krl has unit ℓ2 norm, i.e. kkrl k2 = 1, ∀r ∈ [R], l ∈ [m]. Without loss of generality, we assume
the CP eigenvalues are positive and sorted in decreasing order, i.e. λ1 ≥ λ2 ≥ · · · ≥ λm > 0. If the columns in K l
are orthogonal, i.e. hkrl , krl ′ i = 1 for r 6= r′ , the factorization is further named as orthogonal CP decomposition.
Lemma B.5. (MDFT of CP decomposition) If an m-order tensor T ∈ RN1 ×···×Nm takes a CP decomposition
as in Eq. (9a), its (all-dimensional) MDFT T̃ = Fm (T ) ∈ CN1 ×···×Nm also takes a CP format as
R
X
T̃ = λr k̃r1 ⊗ · · · ⊗ k̃rm (10)
r=1
R
X
1 m
T̃f1 ,··· ,fm = λr K̃r,f1
· · · K̃r,fm
(11)
r=1
where K̃ l = F22 (K l ), ∀l ∈ [m]. The result can be extended to MDFT where a subset of dimensions are trans-
formed.
Jingling Li1,3 , Yanchao Sun1 , Jiahao Su4 , Taiji Suzuki2,3 , Furong Huang1
Proof. (of Lemma B.5) According to the definition of multidimensional discrete Fourier transform, we have
m
!− 21 N1 Nm m
!
fl nl
Y X X Y
T̃n1 ,··· ,nm = Nl ··· Tn1 ,··· ,nm ωN l
(12)
l=1 n1 =1 nm =1 l=1
m
!− 21 N1 Nm R
! m
!
fl nl
Y X X X Y
1 m
= Nl ··· λr K̃r,n1
· · · K̃r,nm
ωN l
(13)
l=1 n1 =1 nm =1 r=1 l=1
R N1 Nm
! !
−1/2
ω f1 n1 ω fm nm
X X X
1 −1/2 m
= λr N1 Kr,n1 N1
··· Nm Kr,nm Nm
(14)
r=1 n1 =1 nm =1
R
X
1 m
= λr K̃r,f1
· · · K̃r,fm
(15)
r=1
Definition B.6. (2D-convolutional layer) In CNNs, a 2D-convolutional layer is parametrized by a 4th -order
tensor M ∈ Rkx ×ky ×T ×S (with kx ×ky kernels). It defines a mapping from a 3rd -order input tensor X ∈ RH×W ×S
(with S channels) to another 3rd -order output tensor Y ∈ RH×W ×T (with T channels).
S
X
Y:,:,t = M:,:,t,s ∗ X:,:,s (16)
s=1
S X
X
Yi,j,t = Mi−p,j−q,t,s Xp,q,s (17)
s=1 p,q
Proof. (of Lemma B.7) The theorem can be easily proved by applying MDFT on both sides of Eq. (17).
1 X if jg
Ỹf,g,t = √ Yi,j,t ωH ωW (19)
HW i,j
S
!
1 X XX if jg
=√ Mi−p,j−q,t,s Xp,q,s ωH ωW (20)
HW i,j s=1 p,q
S
!
√ X 1 X (i−p)f (j−q)g 1 X pf qg
= HW √ Mi−p,j−q,t,s ωH ωW √ Xp,q,s ωH ωW (21)
s=1
HW i,j
HW p,q
√ S
M̃f,g,t,s X˜f,g,s
X
= HW (22)
s=1
Understanding Generalization in Deep Learning via Tensor Methods
Lemma B.8. (Operator norm of 2D-convolutional layer) Suppose we rewrite the tensors in matrix/vector form,
(f,g) (f,g) (f,g)
i.e. X̃f,g,s = x̃s , M̃f,g,t = M̃t,s , Ỹf,g,t = ỹt , then Eq. (18) can be written using matrix/vector products:
S
(f,g) (f,g) (f,g)
X
ỹt = M̃t,s ỹt , ∀f, g (23)
s=1
The operator norm of M, defined as kMk = maxkX kF =1 kYkF , can be obtained by spectral norms of M̃ (f,g) as:
√
kMk = HW max
M (f,g)
(24)
f,g 2
Remarks. The bound is first given by Sedghi et al. (2018). In this work, we provide a much simpler proof
compared to the original one in Sedghi et al. (2018). In the next section, we show that the bound can be computed
without evaluating the spectral norm if the weights tensor M takes a CP format similar to Eq. (9a).
Proof. (of Lemma B.8) From Fact B.3, we know that kMk = kM̃k, where kM̃k = maxkX̃ kF =1 kỸkF . Next, we
P
2 P
2
bound kỸk2F (i.e. f,g
ỹ (f,g)
F ) assuming kX̃ k22 = 1 (i.e. f,g
x̃(f,g)
2 = 1).
(f,g)
2
X
kỸk2F =
ỹ
(25)
2
f,g
(f,g)
2
(f,g)
2
X
≤ HW
M̃
x̃
(26)
2
f,g
2 X
(f,g)
2
≤ HW max
M̃ (f,g)
(27)
x̃
f,g 2
f,g
2
= HW max
M̃ (f,g)
(28)
f,g
√
kỸkF ≤ HW max
M̃ (f,g)
(29)
f,g
We complete the proof by observing all inequalities can achieve equality simultaneously.
Definition B.9. (Tensor product) For vectors a ∈ Rn , b ∈ Rm , and c ∈ Rp , their tensor product a ⊗ b ⊗ c is a
3-way tensor in Rm×n×p , with the (i, j, k)th entry being ai bj ck . Similarly, for a matrix A ∈ Rn×m and a vector
c ∈ Rp , their tensor product A ⊗ c is a m × n × p tensor with the (i, j, k)th entry being Aij ck .
Definition B.10. (Kronecker product). Let A be an n × p matrix and B an m × q matrix. The mn × pq matrix
a1,1 B a1,2 B ··· a1,p B
a2,1 B a2,2 B ··· a2,p B
A⊗B = . .. .. ..
.. . . .
an,1 B an,2 B · · · an,p B
is called the Kronecker product of A and B. The outer product is an instance of Kronecker products.
In this section, we will introduce three types of neural network layers, whose parameters are factorized in CP
format as in Eq. (9a) (with small variations). For brevity, we omit the layer superscript and denote the input,
layer parameters and output as X , M and Y, and we use Y = M (X ) to denote the relations between X , M
and Y.
Jingling Li1,3 , Yanchao Sun1 , Jiahao Su4 , Taiji Suzuki2,3 , Furong Huang1
Definition C.1. (CP 2D-convolutional layer) For a given 2D-convolutional layer in Eq. (17), a CP decomposition
factorizes the weights tensor M ∈ RH×W ×T ×S into three core factors C ∈ RR×kx ×ky , U ∈ RR×T , V ∈ RR×S
and a vector of CP eigenvalues λ ∈ RR such that
R
X
M= λr Cr ⊗ ur ⊗ vr (30)
r=1
R
X
Mi,j,t,s = λr Cr,i,j Ur,t Vr,s (31)
r=1
Proof. (of Lemma C.2) From Fact B.3, the operator norm of M is equal to the one of its MDFT M̃ = F41,2 (M),
i.e. kMk = kM̃k. According to Lemma B.8, it is sufficient to compute the spectral norm for each matrix M̃ (f,g)
individually. Notice that if M takes a CP format, each M̃ (f,g) has a decomposed form as follows
R
λr C˜r(f,g) ur vr⊤
X
M̃ (f,g) = (33a)
r=1
R
(f,g)
λr C˜r(f,g) Ur,t Vr,s
X
M̃t,s = (33b)
r=1
(f,g)
where C˜ = F32,3 (C) and C˜r = C˜r,f,g . The rest of the proof follows the definition of spectral norm of M̃ , i.e.
(f,g) (f,g)
kM̃ k2 = maxkak=1 kM̃ ak. Let b = M̃ (f,g) a, we can bound the ℓ2 norm of b:
R
X
(f,g)
kbk2 =
M̃ a
=
˜(f,g) ⊤
λr Cr ur Vr a
(34)
2
r=1 2
R
X ˜(f,g) ⊤
≤ λr Cr vr a kur k2 (35)
r=1
R
X ˜(f,g) ⊤
= λr Cr vr a (36)
r=1
R
|λr | C˜r(f,g)
X
≤ (37)
r=1
√ √ PR
(f,g)
Therefore, kMk = kM̃k = HW maxf,g kM̃ (f,g) k ≤ HW r=1 |λr | maxf,g C˜r .
Definition C.3. (Higher-order fully-connected layer) The layer is parameterized by a 2mth -order tensor M ∈
RT1 ×···×Tm ×S1 ×···×Sm . It maps an mth -order input tensor X ∈ RS1 ×···×Sm to another mth -order output tensor
Y ∈ RT1 ×···×Sm with the following equation:
X
Yt1 ,··· ,tm = Mt1 ,··· ,tm ,s1 ,··· ,sm Xs1 ,··· ,sm (38)
∀l:Sl
Understanding Generalization in Deep Learning via Tensor Methods
Definition C.4. (Higher-order CP fully-connected layer) Given a higher-order fully-connected layer in Eq. (38), a
CP decomposition factorizes the weights tensor M ∈ RT1 ×···×Tm ×S1 ×···×Sm into m core factors Km ∈ RR×Tm ×Sm .
R
X
1 m
Mt1 ,··· ,tm ,s1 ,··· ,sm = λr Kr,t1 ,s1
· · · Kr,tm ,sm
(39)
r=1
R
X
kMk ≤ |λr | (40)
r=1
Proof. (of Lemma C.5) The proof follows directly the definition of operator norm kMk = maxkX kF =1 kYkF .
R
X
l
· · ·
K1l
kX kF
kYkF ≤ |λr |
Km 2 2
(41)
r=1
R
X
l
· · ·
K l
kX kF
≤ |λr |
Km F 1 F (42)
r=1
R
X R
X
= |λr |kX kF = |λr | (43)
r=1 r=1
Definition C.6. (Higher-order 2D-convolutional layer) The layer is parameterized by a (2m + 2)th -order tensor
M ∈ Rk×k×T1 ×···×Tm ×S1 ×···×Sm . It maps an (m + 2)th -order input tensor X ∈ RH×W ×S1 ×···×Sm to another
(m + 2)th -order output tensor Y ∈ RH×W ×T1 ×···×Sm as:
Sl
X
Y:,:,t1 ,··· ,tm = M:,:,t1 ,·,tm ,s1 ,··· ,sm ∗ X:,:,s1 ,··· ,sm (44a)
∀l:sl =1
Sl
X X
Yi,j,t1 ,··· ,tm = Mi−p,j−q,t1 ,··· ,tm ,s1 ,··· ,sm Xp,q,s1 ,··· ,sm (44b)
∀l:sl =1 p,q
R
X
1 m
Mi,j,t1 ,··· ,tm ,s1 ,··· ,sm = λr Cr,i,j Kr,t1 ,s1
· · · Kr,tm ,sm
(45)
r=1
1,2
Proof. (of Lemma C.8) The proof is a combination of Lemmas C.2 and C.5. Let M̃ = Fm (M), we have
√
kMk = kM̃k = HW max kW̃ (f,g) k (47)
f,g
R
λr C˜r(f,g) Kr,t
X
M̃(f,g) = 1
1 ,s1
m
· · · Kr,tm ,sm
(48)
r=1
PR ˜(f,g)
The operator norm is bounded using Lemma C.5: kM̃(f,g) k ≤ r=1 |λr | maxf,g C r .
j
(k)
X (k)
tj := λr max C̃r(f,g) (49)
f,g
r=1
(k)
where λr is the rth largest value in the CP spectrum of M(k) .
n oR(k)
(k) (k)
Definition D.2. [tensor noise bound ξj ] The tensor noise bound ξj of the k th layer measures the
j=1
(k)
amplitudes of the remaining components after pruning the ones with amplitudes smaller than the λj :
(k)
R
(k)
X (k)
ξj := λr max C̃r(f,g) (50)
f,g
r=j+1
Definition D.3. [layer cushion ζ (k) ] As introduced in Arora et al. (2018), the layer cushion of the k th layer is
defined to be the largest value ζ (k) such that for any X (k) ∈ S,
(k)
M
(k)
ζ √ F
X (k)
≤
M(k+1)
(51)
(k)
H W (k) F F
Following Arora et al. (2018), the layer cushion considers how much smaller the output
M(k+1)
F of the k th
(k)
layer (after activation) compared with the product between the weight tensor
M
F and the input
X (k)
F .
Note that H (k) and W (k) are constants and will not influence the results of the theorem and the lemmas. For
simplicity, we use H and W to denote the maximum H (k) and W (k) over the n layers for the following proofs
where upper bounds are desired.
Given these definitions, we can bound the difference of outputs from a given model and its compressed counterpart.
(k) (k)
The following lemma characterizes the relation between the difference and the factors tj , ξj , ζ (k) .
Lemma D.4. (Compression bound of convolutional neural networks) Suppose a convolutional neural network
M has n layers, and each convolutional layer takes a CP format as in Eq. (31) with rank R(k) . If an algorithm
(k)
generates a compressed network M̂ such that only R̂(k) components with largest λr ’s are retained at the k th
th (m+1)
layer, the difference of their outputs at the m is bounded by X as
m−1 m−1
!
(k) (l)
(m)
X ξ Y t
− X̂ (m) )
≤
(m)
) (52)
ζ (k)
M(k)
F l=k+1 ζ (l)
M(l)
F
X
X
F F
k=1
Therefore for the whole network with n layers, the difference between M(X ) and M̂(X ) is bounded by
n n
!
X ξ (k) Y t(l)
M(X ) − M̂(X )
≤ kM(X )kF (53)
F
k=1
ζ (k)
M(k)
F l=k+1 ζ (l)
M(l)
F
Understanding Generalization in Deep Learning via Tensor Methods
Proof. (of Lemma D.4) We prove this lemma by induction. For m = 2, the lemma holds since
(2)
X − X̂ (2)
=
ReLU Y (1) − ReLU Ŷ (1)
(54)
F F
≤
Y − Ŷ
=
M − M̂(1) X (1)
(1) (1)
(1)
(55)
F F
√
ξ (1)
≤ HW ξ (1)
X (1)
≤ (2)
(56)
ζ (1)
M(1)
M
F F
F
where Y = M(X ) denotes the computation of a convolutional layer. (1) The first inequality follows the Lips-
chitzness of the ReLU activations; (2) The second inequality uses Lemma C.2; and (3) the last inequality holds
by the definition of ζ (1) . For m + 1 > 2, we assume the lemma already holds for m
− Xˆ(m+1)
=
ReLU Y (m) − ReLU Ŷ (m)
(m+1)
X (57)
F
F
≤
Y (m) − Ŷ (m)
=
M(m) X (m) − M̂(m) X̂ (m)
(58)
F
F
=
Mˆ(m) X (m) − X̂ (m) + M(m) − M̂(m) X (m)
(59)
F
√
≤ HW t(m)
X (m) − X̂ (m)
+ ξ (m)
X (m)
(60)
F F
m−1 m−1
!
X ξ (k) Y t(l)
ξ (m)
≤ t(m)
(m)
(m)
+
(61)
ζ (k)
M(k)
F l=k+1 ζ (l)
M(l)
F ζ (m)
M(m)
F
X
X
F F
k=1
m m
!
X ξ (k) Y t(l)
(m+1)
≤
(62)
ζ (k)
M(k)
F l=k+1 ζ (l)
M(l)
F
M
F
k=1
Remark. Equation (64) is slightly different with equation 5, as the margin γ is replaced by a perturbation error ǫ.
Therefore, how well the compressed tensorial neural network can approximate the original network is related to
the choice of R̂(k) . Notice that when Rˆ(k) = R(k) , the inequality for the k th layer will be automatically satisfied
as θ(k) = 0 in this case by definition.
Lemma D.6. For any convolutional neural network M of n layers satisfying the assumptionsP in section 3 and
n
any margin γ ≥ 0, M can be compressed to a tensorial convolutional neural network M̂ with k=1 R̂(k) (s(k) +
(k) (k)
t(k) + kx × ky + 1) total parameters such that for any X ∈ S, Lˆ0 (M̂) ≤ L̂γ (M). Here, for each layer k,
n
(k) (i) ǫ
o
R̂(k) = min j ∈ [R(k) ]|ξj Πni=k+1 tj ≤ Πni=k ζ (i)
M(i)
(68)
n F
|M(X )[y] − max M(X )[j]|2 ≤ (|M(X )[y]| + | max M(X )[j]|)2
j6=y j6=y
2
≤ 4 max kM(X )kF
X ∈S
≤ γ2
Then the output margin of M cannot be greater than γ for any X ∈ S. Thus L̂γ (M) = 1.
If γ < 2 maxX ∈S kM(X )kF , setting
γ
ǫ=
2 maxX ∈S kM(X )kF
in Lemma D.5, we obtain a compressed fully-connected tensorial neural network M̂ with the desired number of
parameters and
γ γ
M(X ) − M̂(X )
< ⇒ ∀j, |M(X )[j] − M̂(X )[j]| <
F 2 2
Then for any pair (X , y) ∈ S, if M(X )[y] > γ + maxj6=y M(X )[j], M̂ classifies X correctly as well because:
γ γ
M̂(X )[y] > M(X )[y] − > max M(X )[j] + > max M̂(X )[j]
2 j6=y 2 j6=y
Now we prove the main theorem 4.5 by bounding the covering number given any ǫ.
Proof. (of Theorem 4.5) To be more specific, let us bound the covering number of the compressed network M̂ by
approximating each parameter with accuracy µ.
Lemma D.7. For any given constant accuracy µ, the covering number of the compressed convolutional network
Pn (k)
M̂ is of order Õ(d) where d denotes the total number of parameters in M̂: d := k=1 R̂(k) (s(k) + o(k) + kx ×
(k)
ky + 1).
Let M̃ denote the network after approximating each parameter in M̂ with accuracy µ (and
M̃(k) denote
its√weight
th (k) (k)
(k) (k)
tensor on the k layer). Based on the given accuracy, we know that ∀k, |λ̂r − λ̃r | ≤ µ,
âr − ãr
≤ s(k) µ,
√
q
(k) (k)
(k) (k)
(k) (k)
b̂r − b̃r
≤ o(k) µ,
Ĉr − C̃r
≤ kx ky µ, where s, o, kx and ky are the number of input channels,
the number of output channels, the height of the kernel and the width of the kernel, as defined in Section 3. For
(k) (k) (k) (k) (k) (k)
simplicity, in this proof, let us just use X (k) , Y (k) , ar , br , Cr to denote X̂ (k) , Ŷ (k) , âr , b̂r , Ĉr . X (k) ∈
(k) (k)
RH×W ×s , Y (k) ∈ RH×W ×o .
We have
(k)
R̂
√ (k) (k)
F31,2 (Y (k) )f gj
X 1,2 X
= HW [F3 (X (k) )f gi λ(k) (k)
r ari brj F2 (Cr )f g ]
i r=1
(k)
R̂
√ (k) (k)
F31,2 (Ỹ (k) )f gj
X 1,2 X
= HW [F3 (X̃ (k) )f gi λ̃(k) (k)
r ãri b̃rj F2 (C̃r )f g ]
i r=1
Understanding Generalization in Deep Learning via Tensor Methods
√
where HW is a normalization factor defined in Lemma B.7
P (k)
R̂ (k) (k) (k) (k) (k) 2
Let ǫ(k) =
Ỹ (k) − Y (k)
. Then for each k, let ϕ = f,g,i,j
P
λr ari brj F2 (Cr )f g − F2 (C̃r )f g
r
P (k) F 2
P R̂ (k) (k) (k) (k) (k) (k) (k)
and ψ = f,g,i,j r λr a ri b rj − λ̃r ã ri b̃ rj F 2 (C̃r )f g . We first bound ϕ and ψ as follows.
P
R̂(k) (k) (k) (k) (k) (k) 2
: All calculations are based on the k th
P
Bound ϕ = f,g,i,j r λr ari brj F2 (Cr )f g − F2 (C̃r )f g
layer, we remove the layer number (k) for ease of reading. So a = a(k) (the same for b, c, and R). Then
(k)
X R̂
X (k) (k) 2
λ(k) (k) (k)
r ari brj F2 (Cr )f g − F2 (C̃r )f g
f,g,i,j r
R̂
X X R̂
X 2
≤ (λr ari brj )2 F2 (Cr )f g − F2 (C̃r )f g
f,g,i,j r r
R̂ R̂ X
X X X X 2
≤ (λ2r a2ri b2rj ) F2 (Cr )f g − F2 (C̃r )f g
r i j r f,g
R̂
X
≤ λ2r R̂kx ky µ2
r
2
R̂(k)
P
P (k) (k) (k) (k) (k) (k) (k)
Bound ψ = f,g,i,j r λr ari brj − λ̃r ãri b̃rj F2 (C̃r )f g : Similarly, we remove the layer number
(k) for ease of reading. Then we have
(k)
X R̂ (k) (k)
2
(k) (k) (k)
X
λ(k) (k)
r ari brj − λ̃r ãri b̃rj F2 (C̃r )f g
f,g,i,j r
X R̂
X R̂
X
(λr ari brj − λ̃r ãri b̃rj )2 F2 (C̃r )2f g
≤
f,g,i,j r r
R̂
X X R̂
2 X
= λr (ari brj − ãri b̃rj ) + (λr − λ̃r )ãri b̃rj F2 (C̃r )2f g
f,g,i,j r r
X R̂
X R̂
X R̂
X
≤ 2 λ2r (ari brj − ãri b̃rj )2 + 2 (λr − λ̃r )2 ã2ri b̃2rj F2 (C̃r )2f g
f,g,i,j r r r
R̂ R̂ R̂
X X !
X
2
2 X
2 2 2 2
= 2 λr ari (brj − b̃rj ) + (ari − ãri )b̃rj + 2 (λr − λ̃r ) ãri b̃rj F2 (C̃r )f g
f,g,i,j r r r
R̂ R̂ R̂
X X !
2 2 2
X X
2 2 2 2 2 2 2
≤ 4 λr ari (brj − b̃rj ) + (ari − ãri ) b̃rj + 2 (λr − λ̃r ) ãri b̃rj F2 (C̃r )f g
f,g,i,j r r r
XR̂ X X X X 2 R̂
X X X X R̂ X
= 4 λ2r a2ri (brj − b̃rj )2 + (ari − ãri )2 b̃2rj + 2 (λr − λ̃r )2 ã2ri b̃2rj F2 (C̃r )2f g
r i j i j r i j r f.g
R̂
X
λ2r (oµ2 + sµ2 ) + 2R̂µ2 R̂
≤ 4
r
R̂
X
λ2r (o + s)R̂ + 2R̂2 µ2
= 4
r
Jingling Li1,3 , Yanchao Sun1 , Jiahao Su4 , Taiji Suzuki2,3 , Furong Huang1
Bound ǫ(k) =
Ỹ (k) − Y (k)
: Similarly, we remove the layer number (k). And we let wi = F31,2 (X (k) )f gi ,
F
PR̂ (k) (k) (k) (k) PR̂ (k) (k) (k) (k)
w̃i = F31,2 (X̃ (k) )f gi , ui = r λr ari brj F2 (Cr )f g and ũi = r λ̃r ãri b̃rj F2 (C̃r )f g .
2
(k)
Ỹ − Y (k)
F
2
=
F3 (Ỹ ) − F31,2 (Y (k) )
1,2 (k)
F
X 1,2 1,2
2
= [F3 (Ỹ )]f gj − [F3 (Y (k) )]f gj
(k)
f,g,j
X X X
= HW ( wi ui − w̃i ũi )2
f,g,j i i
X X X 2
= HW wi (ui − ũi ) + (wi − w̃i )ũi
f,g,j i i
X X 2 X X 2
≤ 2HW wi (ui − ũi ) + 2 (wi − w̃i )ũi
f,g,j i f,g,j i
X X X X X X
wi2 ) (ui − ũi )2 + 2 (wi − w̃i )2 ( ũi )2
≤ 2HW (
f,g,j i i f,g,j i i
X X X X
≤ 2HW ( wi2 ) (ui − ũi )2 + 2 (wi − w̃i )2 ũ2i
f,g,i f,g,i,j f,g,i f,g,i,j
X X X
≤ 2HW ( wi2 )(2ϕ + 2ψ) + 2 (wi − w̃i ) 2
ũ2i
f,g,i f,g,i f,g,i,j
2 R̂
X R̂
X X X
≤ 4HW
X (k)
µ2 λ2r R̂kx ky + 4 λ2r (o + s)R̂ + 2R̂2 + 2 (wi − w̃i )2 ũ2i
F
r r f,g,i f,g,i,j
2 R̂ R̂
2
2
λ2r (o + s)R̂ + 2R̂2 + 2(
X (k) − X˜(k)
M̃
)
X X
≤ 4HW
X (k)
µ2 λ2r R̂kx ky + 4
F F F
r r
2
(1)
Ỹ − Y (1)
F
R̂ R̂
(1)
2 2 X 2
X
λ2r (o + s)R̂ + 2R̂2
≤ 4HW
X
µ λr R̂kx ky + 4
F
r r
2
(k)
Ỹ − Y (k)
F
2 R̂
X R̂
X
2
2
≤ 4HW
X (k)
µ2 λ2r R̂kx ky + 4 λ2r (o + s)R̂ + 2R̂2 + 2(
X (k) − X̃ (k)
M̃
)
F F F
r r
2 R̂
X R̂
X
2
2
≤ 4HW
X (k)
µ2 λ2r R̂kx ky + 4 λ2r (o + s)R̂ + 2R̂2 + 2(
ReLU Y (k−1) − ReLU Ỹ (k−1)
M̃
)
F F F
r r
2 R̂
X R̂
X
2
2
≤ 4HW
X (k)
µ2 λ2r R̂kx ky + 4 λ2r (o + s)R̂ + 2R̂2 + 2(
Y (k−1) − Ỹ (k−1)
M̃
)
F F F
r r
2 R̂
X R̂
X
2
≤ 4HW
X (k)
µ2 λ2r R̂kx ky + 4 λ2r (o + s)R̂ + 2R̂2 + 2((ǫ(k−1) )2
M̃
)
F F
r r
kX (n+1) k
Since ∀k ∈ [n],
X (k)
≤ Πni=k ζ (i) M(i) F , to obtain an ǫ-cover of the compressed network, we can first assume
k kF
β (k) ≥ 1 ∀k ∈ [n]. Then µ need to satisfy:
ǫ
µ≤ √ √
2kM̃(∗) k
q
(∗) (∗)
2 HW n
X (n+1)
F R̂(∗) ( ζ (∗) M(∗) F )n (λ(∗) )2 kx ky + 4(λ(∗) )2 (o(∗) + s(∗) ) + 2
k kF
In this section, we derive generalization bounds for fully connected (FC) neural networks (denoted as M) using
tensor methods.
Jingling Li1,3 , Yanchao Sun1 , Jiahao Su4 , Taiji Suzuki2,3 , Furong Huang1
Original Fully Connected Neural Network: Let M denote an n-layer fully connected network with ReLU
(k) (k+1) (k)
activations, where A(k) ∈ Rh ×h denotes the weight matrix of the k th layer, x(k) ∈ Rh denotes the
input to k th layer, and y (k) denotes the output of the k th layer before activation in M. Transform original
FCN to a CP-FCN: We transform the original fully connected network M to a network M with CPL. The
(k) (k) (k+1) (k+1)
k th layer of M is denoted by M(k) ∈ Rs1 ×s2 ×s1 ×s2 is a 4-dimensional tensor reshaped from A(k) where
(k) (k)
s1 × s2 = hk , ∀k ∈ [n].
Input and Output of M: The original input and output vectors of M are reshaped into matrices. The input
(k) (k)
to the k th layer of the M, denoted by X (k) ∈ Rs1 ×s2 , is a matrix reshaped from the input vector x(k) of
the k th layer in the original network M. Similarly, the output of the k th layer before activation in M, denoted
(k) (k)
by Y (k) ∈ Rs1 ×s2 , is a matrix reshaped from the output vector y (k) of the k th layer in the original network
M. For prediction purposes, we reshape the output Y (n) of the last layer in M back into a vector. So the final
outputs of M and M are of the same dimension.
Assumption E.1 (Polyadic Form of M). For each layer k, assume the weight tensor M(k) of M has a Polyadic
(k) (k) (k+1) (k+1)
form with rank R(k) ≤ min{s1 , s2 , s1 , s2 }:
(k)
R
(k) (k) (k) (k) (k)
X
(k)
M = λi ai ⊗ bi ⊗ ci ⊗ di (69)
i=1
Compression of a FC Network with CPL: In Li and Huang (2018), a tensor decomposition algorithm (pro-
cedure 1 in Li and Huang (2018)) on tensors with asymmetric orthogonal components is guaranteed to recover the
top-r components with the largest singular values. To compress M, we apply top-R̂(k) (R̂(k) ≤ R(k) ) CP decom-
(k) (k) (k) (k) (k)
position algorithm on each M(k) , obtaining the components from CP decomposition (λ̂i , âi , b̂i , ĉi , dˆi ),
i ∈ [R̂(k) ]. Therefore, we achieve a compressed network M̂ of M, and the j th layer of the compressed network
M̂ has weight tensor as follows
(k)
R̂
(k) (k) (k) (k) (k)
⊗ dˆi .
X
(k)
T̂ = λ̂i âi ⊗ b̂i ⊗ ĉi (70)
i=1
As each M(k) has a low rank orthogonal CP decomposition by our assumption, the returned re-
(k) (k) (k) (k) (k) R̂(k)
sults {λ̂i , âi , b̂i , ĉi , dˆi }i=1 from procedure 1 in Li and Huang (2018) are perfect recoveries of
(k) (k) (k) (k) (k) R̂(k)
{λi , ai , bi , ci , di }i=1 according to the robustness theorem in Li and Huang (2018). Our compression
procedure is depicted in Algorithm 2.
We denote the input matrix of the k th layer in M̂ as X̂ (k) , and the output matrix before activation as Ŷ (k) . Note
that X (1) = X̂ (1) as the input data is not being modified.
Algorithm 2 is desigend for general neural networks. For neural networks with CP Layer, line 3 can be done
by pruning out small components from CP decomposition, and only keeping top-R̂(k) components. For notation
simplicity, assume for each layer in M, the width of the k th layer is a square of some integer s(k) . Then the
input to the k th layer of M is a ReLu transformation of the output of the k − 1th layer as in equation (71). The
output of the k th layer of M is illustrated in equation (72) as the weight tensor which permits a CP forms as in
Understanding Generalization in Deep Learning via Tensor Methods
Now we characterize the compressibility of the fully connected network with CPL M through properties defined
in the following, namely reshaping factor, tensorization factor, layer cushion and tensor noise bound.
Definition E.2. (reshaping factor). The reshaping factor ρ(k) of layer k is defined to be the smallest value ρ(k)
such that for any x ∈ S,
(k)
X
≤ ρ(k)
X (k)
(75)
F
The reshaping factor upper bounds the ratio between the spectral norm and Frobenius norm of the reshaped
input in the k th layer over any data example in the training dataset. Reshaping the vector examples into matrix
examples improves the compressibility of the network (i.e., renders smaller ρ(k) ) as illustrated and empirically
verified in Su et al. (2018). Note that X̂ (k) is the input to the k th layer of the compressed network M̂, and
ρ(k) ≤ 1, ∀k.
(k) R(k)
Definition E.3. (tensorization factor) The tensorization factor {tj }j=1 of the k th layer regarding the network
with CPL M and the original network M is defined as:
j
(k)
X
tj = |λ(k)
r |, ∀j. (76)
r=1
Jingling Li1,3 , Yanchao Sun1 , Jiahao Su4 , Taiji Suzuki2,3 , Furong Huang1
The tensorization factor measures the amplitudes of the leading components. By Lemma C.5, the tensorization
factor is the upper bound of operator norm of the weight tensor.
Definition E.4. (layer cushion). Our definition of layer cushion for each layer k is similar to Arora et al.
(2018).
The
layer
cushion ζ (k)
of layer k is defined to be the largest value ζ (k) such that for any x ∈ S,
(k+1)
(k)
(k)
(k)
ζ A F
x ≤ x
.
The layer
(k+1)
cushion (k) defined in Arora et al. (2018) is sligntly larger than ours since our RHS is
x
= ReLU A x(k) while the RHS of the inequality in the definition of layer cushion in Arora et al.
(2018) is A(k) x(k) . The layer cushion under our
settings
also
considers
how much smaller the output
x(k+1)
is compared to is compared to the upper bound
A(k)
F
x(k)
.
(k)
Definition E.5. (tensor noise bound). The tensor noise bound {ξ (k) }j=1
R
of the the k th layer measures the
amplitudes of the remaining components after pruning out the ones with amplitudes smaller than the j th com-
ponent:
(k)
R
(k)
X
ξj := |λ(k)
r | (77)
r=j+1
The tensor noise bound measures the amplitudes of the CP components that are pruned out by the compression
algorithm, and the smaller it is, the more low-rank the weight matrix is. We will see that a network equipped
with CPL will be much more low-rank than standard networks.
We have introduced the compression mechanism in Algorithm 2. For a fully connected network with CPL M
that is characterized by the properties such as reshaping factor, tensorization factor, layer cushion and tensor
noise bound, in section E.2, we derive the generalization error bound of a compression network with any chosen
ranks {R̂(k) }nk=1 as follows.
Theorem E.6. For any fully connected network M of n layers satisfying the Assumptions E.1, Algorithm 2
generates a compressed network M̂ such that with high probability over the training set , the expected error
L0 (M̂) is bounded by s
Pn R̂(k) (2s(k) + 2s(k+1) + 1)
k=1
L0 (M̂) ≤ L̂γ (M) + Õ (78)
m
for any margin γ ≥ 0, and the rank of the k th layer, R̂(k) , satisfies that
n
(k) γ
o
R̂(k) = min j ∈ [R(k) ] nρ(k) ξj Πni=k+1 t(i) ≤
(k)
n (i)
A
Πi=k ζ
2 maxx∈S kM(x)kF F
and ρ(k) , t(k) , ζ (k) are reshaping factor, tensorization factor, layer cushion and tensor noise bound of the k th layer
(k)
in Definitions E.2, E.3, and E.4 respectively. ξj is defined in the same way with ξ (k) , where R̂(k) is replaced
by j.
The generalization error of the compressed network L0 (M̂) depends on the compressibility of the M. The
compressibility of the M determines the rank that the compression mechanism should select according to The-
orem E.6, which depends on reshaping factor ρ(k) , tensorization factor t(k) , layer cushion ζ (k) and tensor noise
(k)
bound ξj .
Proof sketch of Theorem E.6: To prove this theorem, we introduce the following Lemma E.7, which reveals
that the difference between the output of the original fully connected network M and that of the compressed M
is bounded by ǫ kM(x)kF . Then we show the covering number of the compressed network M by approximating
each parameter with some certain accuracy is Õ(d) w.r.t to a given ǫ. After bounding the covering number, the
rest of the proof follows from conventional learning theory.
Lemma E.7. For any fully connected network M of n layers satisfying Assumption E.1 , Algorithm 2 generates
a compressed T ensorial − F C M̂ where for any x ∈ S and any error 0 ≤ ǫ ≤ 1:
M(x) − M̂(X)
≤ ǫ kM(x)kF (79)
F
Understanding Generalization in Deep Learning via Tensor Methods
Pn
The compressed T ensorial − F C M̂ consists of k=1 R̂(k) [2(s(k) + s(k+1) ) + 1)] number of parameters, where
each R̂(k) is defined as what is stated in Algorithm 4: for each layer k ∈ [n],
n
(k) ǫ
o
R̂(k) = min j ∈ [R(k) ] ρ(k) ξj Πni=k+1 t(i) ≤
A(k)
Πni=k ζ (i)
n F
Proof. (of Lemma E.8) Based on Algorithm 2, since for each layer k in the compressed network M̂, representing
(k) (k) (k) (k) (k) R̂(k)
{λ̂i , âi , b̂i , ĉi , dˆi }i=1 only needs R̂(k) [2(s(k) + s(k+1) ) + 1)] parameters, the total number of parameters
Pn (k) (k) (k+1)
in M̂ is k=1 R̂ [2(s + s ) + 1)].
Then as for any x ∈ S, M(x) =
M(X), and by construction,
M(X) = X (n+1) and M̂(X) = X̂ (n+1) , we
(n+1)
can prove the lemma by showing
X − X̂ (n+1)
satisfies the above inequality, and we will prove this by
F
induction. Notice
ρ(k) ξ (k) m−1 t(i)
(m)
Induction Hypothesis: For any layer m ≥ 0,
X (m) − X̂ (m)
≤ ( m−1
P
k=1 ζ (k) kA(k) k · Πi=k+1 ζ (i) ) X
F F
F
Base case: when m = 1, the above inequality hold trivially as X (1) = X̂ (1) as we cannot modify the input, and
the RHS is always ≥ 0.
Inductive Step: Now we assume show that the induction hypothesis is true for all m, let us look what
(k) (k) (k) (k) (k) R̂(k)
happens at layer m + 1. As we assume perfect recovery in each layer, ∀k, {λ̂i , âi , b̂i , ĉi , dˆi }i=1 =
(k) (k) (k) (k) (k) R̂(k)
{λi , ai , bi , ci , di }i=1 .
PR(k) (k) (k) (k) (k) (k)
Let φ(k) := i= R̂(k) +1 λi ai ⊗ bi ⊗ ci ⊗ di , and note that M(k) = M̂(k) + φ.
Then we have
(m+1)
− X̂ (m+1)
X
F
(m)
=
ReLU Y − ReLU Ŷ (m)
F
R̂(m) R̂ (m)
X (m) (m) ⊤ (m) (m) (m) (m) (m) (m) ⊤ (m) (m) (m) ˆ(m)
X
(m) (m)
≤
λi (a i ) X b i c i ⊗ di + φ (X ) − λ̂i (â i ) X̂ b̂ i ĉ i ⊗ di
i=1 i=1
F
(m)
R̂
X (m) (m) ⊤ (m) (m) (m) (m)
− X̂ (m) )bi ci ⊗ di + φ(m) (X (m) )
=
λi (ai ) (X
i=1
F
So
(m+1)
− X̂ (m+1)
X
F
R̂(m)
X (m) (m) ⊤ (m) (m) (m) (m) (m)
(m) (m)
≤
λi (a i ) (X − X̂ )b i c i ⊗ di
+
φ (X )
i=1
F
F
Jingling Li1,3 , Yanchao Sun1 , Jiahao Su4 , Taiji Suzuki2,3 , Furong Huang1
v
u (k)
u R
(k) (k) (k)
(m) (m)
X
φ (X ) = [λi (ai )⊤ X (m) bi ]2
u
t
F
i=R̂(k) +1
v
u (k)
R
(k)
2
u
2
(k) (k)
X
≤t (λi )2
ai
X (m) bi
u
i=R̂(k) +1
v
u (k)
R
(k)
2
u
2
(k)
X
≤t (λi )2
X (m)
bi
u
i=R̂(k) +1
v
u (k)
u R
(k)
X
=t (λi )2
X (m)
u
i=R̂(k) +1
= ξ (m)
X (m)
≤ ξ (m) ρ(m)
X (m)
F
ρ(m) ξ (m)
X (m+1)
F
≤
ζ (m)
A(m)
F
P (m)
R̂ (m) (m) (m) (m) (m)
Similarly, we can bound
i=1 λi (ai )⊤ (X (m) − X̂ (m) )bi ci ⊗ di
as follows:
F
R̂(m)
X (m) (m) ⊤ (m) (m) (m) (m) (m)
λi (ai ) (X − X̂ )bi ci ⊗ di
i=1
F
v
uR̂(m)
uX
(m) (m) (m)
=t [λi (ai )⊤ (X (m) − X̂ (m) )bi ]2
i=1
v
uR̂(m)
uX
(m)
≤t (λi )2
X (m) − X̂ (m)
i=1
v
uR̂(m)
uX
(m)
≤t (λi )2
X (m) − X̂ (m)
F
i=1
q
2
= (t(m) )2
A(m)
F
X (m) − X̂ (m)
F
m−1 (k) (k)
X ρξ t(i)
≤t (m)
(m)
A
· (
· Πm−1i=k+1 )
(m)
ζ (k)
A(k)
F ζ (i)
X
F F
k=1
m−1
X (m+1)
X ρ(k) ξ (k) t(i)
≤ ρ(m) t(m)
A(m)
(m)
F
× (
· Πm−1
i=k+1 (i) )
F ζ
A (m)
ζ (k)
A (k)
ζ
F k=1 F
m−1
X ρ(k) ξ (k) t(i)
=(
· Πm
i=k+1 (i) ) ·
(m+1)
X
ζ (k)
A(k)
F ζ
F
k=1
Understanding Generalization in Deep Learning via Tensor Methods
Where the second to the last equality is due to the fact that for any αi , β ∈ R, (Πki=k+1 αi ) × β = β.
Proof. (of Lemma E.7) Based on the assumptions of the components from CP decomposition for each M(k) in
section 3, the {R̂(k) }nk=1 returned by Algorithm 4 will satisfy:
• ∀k, R̂(k) ≤ R(k)
ǫ
(k)
n (i)
• ρ(k) ξ (k) Πni=k+1 t(i) ≤ n
A
Π ζ
F i=k
Thus,
ρ(k) ξ (k) t(i) ǫ
· Πni=k+1 ≤
ζ (k)
A
(k) ζ (i) n
F
Then by lemma E.8,
M(x) − M̂(X)
≤ ǫ kM(x)kF
F
Proof. (of Lemma E.9) If γ ≥ 2 maxx∈S kM(x)kF , for any pair (x, y) ∈ S, we have
|M(x)[y] − max M(x)[j]|2 ≤ (|M(x)[y]| + | max M(x)[j]|)2
j6=y j6=y
≤ 4 max kM(x)k2F
x∈S
≤ γ2
Then the output margin of M cannot be greater than γ for any x ∈ S. Thus L̂γ (M) = 1.
If γ < 2 maxx∈S kM(x)kF , setting
γ
ǫ=
2 maxx∈S kM(x)kF
in Lemma E.7, we obtain a compressed fully-connected tensorial neural network M̂ with the desired number of
parameters and
γ γ
M(x) − M̂(X)
< ⇒ ∀j, |M(x)[j] − M̂(X)[j]| <
F 2 2
Jingling Li1,3 , Yanchao Sun1 , Jiahao Su4 , Taiji Suzuki2,3 , Furong Huang1
Then for any pair (x, y) ∈ S, if M(x)[y] > γ + maxj6=y M(x)[j], M̂ classifies x correctly as well because:
γ γ
M̂(X)[y] > M(x)[y] − > max M(x)[j] + > max M̂(X)[j]
2 j6=y 2 j6=y
Now we prove the main theorem E.6 by bounding the covering number given any ǫ.
Proof. (of Theorem E.6) To be more specific, let us bound the covering number of the compressed network M̂
by approximating each parameter with accuracy µ.
Covering Number Analysis for Fully Connected Neural Network Let T̃ denote the network after ap-
proximating each parameter in M̂ with accuracy µ (and T̃ (k)
denote its
weight tensor on the k th layer). Based on
(k) (k) (k) (k)
√
the given accuracy, we know that ∀k, |λ̂i − λ̃i | ≤ µ and
âi − ãi
≤ s(k) µ (similar inequalities also hold
(k) (k) (k) (k) (k) (k) (k) (k) (k) (k) (k)
for b̂i , ĉi , dˆi ). For simplicity, in this proof, let us just use ai , bi , ci , di to denote âi , b̂i , ĉi , dˆi
Notice that
(k)
r
(k) (k) (k) (k) (k)
X
(k)
Y = λi (ai )⊤ X (k) bi ci ⊗ di
i=1
(k)
r
(k) (k) (k) (k) (k)
⊗ d˜i
X
(k)
Ỹ = λ̃i (ãi )⊤ X̃ (k) b̃i c̃i
i=1
(k) (k) (k) (k)
Let ǫ(k) =
Ỹ (k) − Y (k)
. Then for each k, let us first bound |(ai )⊤ X (k) bi − (ãi )⊤ X̃ (k) b̃i | and
F
(k) (k) (k) (k)
ci ⊗ di − c̃i ⊗ d˜i
separately.
F
The second to the last inequality is because singular values are invariant to matrix transpose.
(k) (k)
When k ≥ 1, similarly, let a = ai , ã = ãi (define b in a similar way), X = X (k) , and X̃ = X̃ (k) . Let
Understanding Generalization in Deep Learning via Tensor Methods
Y = Y (k−1) , and Ỹ = Ỹ (k−1) (basically the output from the (k − 1)th layer before activation). Then
2
(k) (k) (k) (k)
ci ⊗ di − c̃i ⊗ d˜i
F
2
⊤ ˜⊤
=
cd − c̃d
F
= Tr((cd⊤ − c̃d˜⊤ )⊤ (cd⊤ − c̃d˜⊤ ))
˜ ⊤ )(cd⊤ − c̃d˜⊤ ))
= Tr((dc⊤ − dc̃
˜ ⊤ cd⊤ − dc⊤ c̃d˜⊤ + dc̃
= Tr(dc⊤ cd⊤ − dc̃ ˜ ⊤ c̃d˜⊤ )
˜ ⊤ cd⊤ ) − Tr(dc⊤ c̃d˜⊤ ) + Tr(dc̃
= Tr(dc⊤ cd⊤ ) − Tr(dc̃ ˜ ⊤ c̃d˜⊤ )
= Tr(c⊤ cd⊤ d) − Tr(c⊤ c̃d˜⊤ d) + Tr(c̃⊤ c̃d˜⊤ d)
˜ − Tr(c̃⊤ cd⊤ d)
˜
= Tr(c⊤ (cd⊤ − c̃d˜⊤ )d + c̃⊤ (c̃d˜⊤ − cd⊤ )d)
˜
˜
= c⊤ (cd⊤ − c̃d˜⊤ )d + c̃⊤ (c̃d˜⊤ − cd⊤ )d˜
≤ kck
cd⊤ − c̃d˜⊤
kdk + kc̃k
c̃d˜⊤ − cd⊤
kdk
≤ 2
cd⊤ − c̃d˜⊤
, as the norms of c, d, c̃, d˜ are ≤ 1
= 2
cd⊤ − cd˜⊤ + cd˜⊤ − c̃d˜⊤
= 2
c(d⊤ − d˜⊤ ) + (c − c̃)d˜⊤
≤ 2(
c(d⊤ − d˜⊤ )
+
(c − c̃)d˜⊤
)
≤ 2(kck
d − d˜
+ kdk kc − c̃k), as they are rank 1 matrices
p
≤ 4 s(k+1) µ
(k) (k) (k) (k) (k) (k)
Bound ǫ(k) =
Ỹ (k) − Y (k)
: Similarly, for simplicity, let wi = λi (ai )⊤ X (k) bi , w̃i = λ̃i (ãi )⊤ X̃ (k) b̃i ,
F
(k) (k) (k) (k)
Ui = ci ⊗ di , and Ũi = c̃i ⊗ d˜i .
Understanding Generalization in Deep Learning via Tensor Methods
P (k) Pr(k)
r
Since
Ỹ (k) − Y (k)
=
i=1 wi Ui − i= d˜i Ũi
,
F F
r(k) r (k)
X X
wi U i − w̃i Ũi
i=1 i=
F
(k)
r
X
=
(w U
i i − w̃ Ũ )
i i
i=1
F
r (k)
X
≤
wi Ui − w̃i Ũi
F
i=1
r (k)
X
=
wi Ui − wi Ũi + wi Ũi − w̃i Ũi
F
i=1
r (k)
X
≤
wi Ui − wi Ũi
+
wi Ũi − w̃i Ũi
F F
i=1
r (k)
X
=
wi (Ui − Ũi )
+
(wi − w̃i )Ũi
F F
i=1
(k)
(80)
r
X
= |wi |
Ui − Ũi
+ |wi − w̃i |
Ũi
F F
i=1
(k)
r q p
X p
≤ |wi | × 4 s(k+1) µ + µ
X (k)
(1 + 2|λi | s(k) ) + |λi |ǫ(k−1) ×
Ũi
F
i=1
(k)
r
q p
(k)
p
(k)
(k)
× 4 s(k+1) µ + µ
X (k)
(1 + 2|λi | s(k) ) + |λi |ǫ(k−1) ×
c̃i ⊗ d˜i
X
≤ |λi |
X (k)
F
i=1
(k)
r
qp
p
(k)
X
= 2|λi |
X (k)
× s(k+1) µ + µ
X (k)
(1 + 2|λi | s(k) ) + |λi |ǫ(k−1)
i=1
r (k)
p p qp p
(k) (k)
X
µ
X (k)
1 + 2|λi |( s(k) + s(k+1) ) + |λi |ǫ(k−1) , assume
≤ s(k+1) µ ≤ s(k+1) µ
i=1
p p
≤ r(k) × {µ
X (k)
1 + 2|λ(k) s(k+1) ) + |λ(k) (k−1)
(k) +
max |( s max |ǫ }
p p
≤ µr(k) [1 + 2|λmax
(k)
|( s(k) + s(k+1) )]
X (k)
+ r(k) |λ(k)
max |ǫ
(k−1)
(k)
√ √
(k)
Let α(k) := µr(k) [1+2|λmax|( s(k) + s(k+1) )]
X (k)
, and β (k) = r(k) |λmax |, then by the recurrence relationship
in 80, the difference between the final output of the two networks are bounded by:
M̂(X) − M̃(X)
F
=
ReLU Ŷ (n) − ReLU Y (n)
(= X (n+1) − X (n+1) )
F
(n) (n)
≤
Ỹ −Y
F
n
X
≤ α(k) Πni=k+1 β (i)
k=1
(i)
Since ∀k ∈ [n],
X (k)
≤ Πni=k ζ (i) ρA(i)
X (n+1)
F , to obtain an ǫ-cover of the compressed network, we can
k kF
Jingling Li1,3 , Yanchao Sun1 , Jiahao Su4 , Taiji Suzuki2,3 , Furong Huang1
As when µ is fixed, the number of networks in our cover will at most be ( µ1 )d where d denote the number of
parameters in the original network. Hence, the covering number w.r.t to a given ǫ is Õ(nd) (n is the number of
layers in the given neural network). As for practical neural networks, the number of layers n is usually much
less than O(log(d)), thus the covering number we obtained w.r.t to a given ǫ is just Õ(d) for practical neural
networks.
For neural nets with skip connections, the current theoretical analyses consider convolutional neural networks
with one skip connection used on each layer, since our theoretical results can easily extend to general neural nets
with skip connections. Therefore, we used the same the notations for neural nets with skip connections as what
we defined for convolutional neural networks.
Forward pass functions Under the above assumptions, the only difference that we need to take into account
between our analysis of CNN with skip connections and our analysis of standard CNN is the forward pass
functions. In neural networks with skip connections, we have
X (k) = ReLU Y (k−1)
Y (k) = M(k) X (k) + X (k)
and
X̂ (k) = ReLU Ŷ (k−1)
Ŷ (k) = M̂(k) Xˆ(k) + Xˆ(k)
where M(k) X (k) and M̂(k) X̂ (k) compute the outputs of the k th convolutional layer.
Similarly, we use tensorization factor, tensor noise bound and layer cushion as in convolutional neural network
defined in 4.2, 4.3 and 4.4. But note that the input X (k) in the definition of layer cushion is the input of k th
layer after skip connection.
Theorem F.1. For any convolutional neural network M of n layers with skip connection satisfying the assump-
tions in section 3 and any margin γ ≥ 0, Algorithm 1 generates a compressed network M̂ such that with high
probability over the training set, the expected error L0 (M̂) is bounded by
s
Pn R̂(k) (s(k) + o(k) + k (k) × k (k) + 1)
k=1 x y
L̂γ (M) + Õ (81)
m
where n
(k) (i) γ
o
R̂(k) = min j ∈ [R(k) ]|ξj Πni=k+1 (tj + 1) ≤ Πni=k ζ (i)
M(i)
(82)
2n maxX ∈S kM(X )kF F
Lemma F.2. For any convolutional neural network M of n layers with skip connection satisfying the assumptions
in section 3 and any error 0 ≤ ǫ ≤ 1, Algorithm 1 generates a compressed tensorial neural network M̂ such that
for any X ∈ S:
M(X ) − M̂(X )
≤ ǫ kM(X )kF (83)
F
Pn (k) (k)
The compressed convolutional neural network M̂ has with k=1 R̂(k) (s(k) +o(k) +kx ×ky +1) total parameters,
where each R̂(k) satisfies:
n
(k) (i) ǫ
o
R̂(k) = min j ∈ [R(k) ]|ξj Πni=k+1 (tj + 1) ≤ Πni=k ζ (i)
M(i)
(84)
n F
n n
!
X ξ (k) Y t(l) + 1
M(X ) − M̂(X )
≤ kM(X )kF
F
k=1
ζ (k)
M(k)
F l=k+1 ζ (l)
M(l)
F
where ξ, ζ, and t are tensor noise bound, layer cushion, tensorization factor defined in 4.3, 4.4 and 4.2 respectively.
We know by construction, M(X ) = X (n+1) and M̂(X ) = X̂ (n+1) , we can just show
X (n+1) − Xˆ(n+1)
satisfies
F
the above inequality, and we will prove this by induction. Notice
m m
!
(m)
X ξ (k) Y t(l) + 1
− X̂ (m)
≤
(m)
ζ (k)
M(k)
ζ (l)
M(l)
X
X
F F
k=1 F l=k+1 F
Base case: when m = 1, the above inequality hold trivially as X (1) = Xˆ(1) as we cannot modify the input, and
the RHS is always ≥ 0.
Inductive Step: Now we assume show that the induction hypothesis is true for all m, then at layer m + 1 we
have
Jingling Li1,3 , Yanchao Sun1 , Jiahao Su4 , Taiji Suzuki2,3 , Furong Huang1
(m+1)
− X̂ (m+1)
X
F
=
ReLU Y (m) − ReLU Ŷ (m)
F
(m) (m)
≤
Y − Ŷ
F
(m) (m)
≤
M X + X (m) − M̂(m) X̂ (m) + X̂ (m)
F
Xˆ − Xˆ
(m) (m) (m) (m) (m) (m)
≤
M X − M̂
+
X
F
F
+
X (m) − Xˆ(m)
(m) (m) (m) (m) (m) (m)
≤
M̂ X − X̂ + M − M̂ X
F F
√
√
≤ HW t(m) + 1
X (m) − Xˆ(m)
+ HW ξ (m)
X (m)
F F
m−1 m−1
!
√ X ξ (k) Y t (l)
+ 1
√
≤ HW t(m) + 1
(m)
(m)
(m)
+ HW ξ
ζ (k)
M(k)
F l=k+1 ζ (l)
M(l)
F
X
X
F F
k=1
m−1 m−1
!
X ξ (k) Y t(l) + 1 (t(m) + 1)
(m+1)
ξ (m)
(m+1)
≤
+
ζ (k)
M(k)
F l=k+1 ζ (l)
M(l)
F ζ (m)
M(m)
F ζ (m)
M(m)
F
X
X
F F
k=1
m m
!
X ξ (k) Y t(l) + 1
(m+1)
≤
ζ (k)
M(k)
ζ (l)
M(l)
X
F
k=1 F l=k+1 F
each layer k, n
(k) (i) ǫ
o
R̂(k) = min j ∈ [R(k) ]|ξj Πni=k+1 (tj + 1) ≤ Πni=k ζ (i)
M(i)
n F
γ
The proof of Lemma F.4 is the same with Lemma E.9. And by setting ǫ = 2 maxX ∈S , we get the desired expression
of R̂(k) in the main theorem.
Proof. (of Theorem F.1) Similarly, let us bound the covering number of the compressed network M̂ by approxi-
mating each parameter with accuracy µ.
Covering Number Analysis for Convolutional Neural Network Let M̃ denote the network after approx-
imating each parameter in M̂ with accuracy µ. We use the same assumptions and notations with the proof of
Theorem 4.5. And we still use X (k) , Y (k) , M(k) to denote Xˆ(k) , Ŷ (k) , M̂(k)
Bound τ (k) =
Ỹ (k) − Ŷ (k)
:
F
Understanding Generalization in Deep Learning via Tensor Methods
(k)
Ỹ − Y (k)
F
=
M̃ (X ) + X˜(k) − M(k) (X (k) ) + X (k)
(k) ˜(k)
F
(k) ˜(k) (k) (k)
˜(k) (k)
≤
M̃ (X ) − M (X )
+
X − X
F
F
=
M̃(k) (X˜(k) ) − M(k) (X (k) )
+
ReLU Ỹ (k) − ReLU Y (k)
F F
(k) ˜(k) (k) (k)
(k−1) (k−1)
≤
M̃ (X ) − M (X )
+
Ỹ −Y
F F
(k) ˜(k) (k) (k)
(k−1)
=
M̃ (X ) − M (X )
+ τ
F
Based on the proof of Theorem 4.5 (in Appendix D), we can easily get
(k)
Ỹ − Y (k)
F
X k
n X k
Y
= α(i) β (t)
k=1 i=1 t=i+1
ǫ
µ≤ √ √
2kM̃(∗) k
q
(∗) (∗)
2 HW n2
X (n+1)
F R̂(∗) (λ(∗) )2 kx ky + 4(λ(∗) )2 (o(∗) + s(∗) ) + 2( ζ (∗) M(∗) F )n
k kF
Details of Step 3 in Algorithm 1. We use the alternating least squares (ALS) in the implementation of step
3, which is the ‘parafac’ method of the tensorly library (Kossaifi et al., 2019), to obtain the CP decomposition.
Though CP decomposition is in general NP-hard, the ALS method usually converges for most tensors, with
polynomial convergence rate w.r.t. the given precision of the allowed reconstruction error (Anandkumar et al.,
2015, 2014a,b). In addition, step 3 obtains a CP parametrization of the weight tensor rather than recovers the
true components of the weight tensor’s CP decomposition. The rank in the CP decomposition is selected in step
2 and is an upper bound of the true rank of the tensor (Proposition 4.1). Thus, with the chosen rank, we can
obtain a CP decomposition with a very low reconstruction error. In practice, for our cases, the CP decomposition
method (ALS) used in step 3 always converges within a few iterations, with reasonable run time.
1: For each layer k, calculate the following properties: layer cushion ζ (k) , weight norm
M(k)
F , then calculate
the RHS nǫ Πni=k ζ (i)
M(i)
F for each k
ǫ (n)
(n)
2: Find the smallest R̂(n) such that the tensor noise bound for the last layer ξ (n) satisfies ξ (n) ≤ nζ
M
F
3: for k = n − 1 to 1 do
4: if M does not have skip connections then
(i)
5: Calculate the multiplication of tensorization factor for layers upper than k, i.e., Πni=k+1 t (i) , based on
R̂
the choices of R̂(i) for k ≤ i ≤ n
6: Find the smallest R̂(k) by calculating the largest possible ξ (k) such that Equation 85 holds.
7: else
(i)
8: Calculate the multiplication of tensorization factor for layers upper than k, i.e., Πni=k+1 (tR̂(k) + 1), based
on the choices of R̂(i) for k ≤ i ≤ n
9: Find the smallest R̂(k) by calculating the largest possible ξ (k) such that Equation 86 holds.
10: Return {R̂(k) }n
k=1
Remark. The FBRC algorithm finds a set of ranks that satisfies inequality 85 (CNNs) or 86 (NNs with skip
connections) within polynomial time because of the following guarantees. The total number of possible sets of
ranks (say T ), which the FBRC algorithm will at most search through, is equal to the product of the ranks of
all layers. The rank of each layer is upper bounded by Proposition 4.1 and thus T is polynomial w.r.t. the
shape of the original weight tensors and the number of layers. Moreover, the search will definitely succeed as the
inequalities 85 and 86 automatically hold when R̂(k) = R(k) .
Understanding Generalization in Deep Learning via Tensor Methods
(k) (i) ǫ
ρ(k) ξR̂(k) Πni=k+1 tR̂(k) ≤
(k)
n (i)
A
Πi=k ζ } (87)
n F
1: For
each
layer k, calculate the following properties: reshaping factor ρ(k) , layer cushion ζ (k) , weight norm
A(k)
, then calculate the RHS ǫ
A(k)
Πn ζ (i) for each k
F n F i=k
(n)
2: Find
the smallest R̂ such that the tensor noise bound for the last layer ξ (n) satisfies ρ(n) ξ (n) ≤
ǫ (n)
(k)
nζ A F
3: for k = n − 1 to 1 do
(i)
4: Calculate the multiplication of tensorization factor for layers upper than k, i.e., Πni=k+1 tR̂(i) , based on the
choices of R̂(i) for k ≤ i ≤ n
5: Find the smallest R̂(k) by calculating the largest possible ξ (k) such that Equation 87 holds.
6: Return {R̂(k) }n
k=1
Algorithm 5 CNN-Project
Input: A convolutional neural network M of n layers where its weight tensor M(k) of the k th layer is parametrized
(k) (k) (k) (k) R(k)
by {λr , ar , br , cr }r=1 , and a list of ranks {R̂(k) }ni=1 .
Output: Returns a compressed network M̂ of M where for each layer k,
M̂(k)
is constructed by the top R̂(k)
4: Return M̂
Algorithm 6 TNN-Project
Input: A fully connected neural network M of n layers where its weight tensor M(k) of the k th layer is
(k) (k) (k) (k) (k) R(k)
parametrized by {λr , ar , br , cr , dr }r=1 , and a list of ranks {R̂(k) n
}i=1
.
Output: Returns a compressed network M̂ of M where for each layer k,
T̂ (k)
is constructed by the top R̂(k)