Professional Documents
Culture Documents
Redunet: A White-Box Deep Network From The Principle of Maximizing Rate Reduction
Redunet: A White-Box Deep Network From The Principle of Maximizing Rate Reduction
hqi@berkeley.edu
John Wright ‡ johnwright@ee.columbia.edu
Yi Ma † yima@eecs.berkeley.edu
†
Department of Electrical Engineering and Computer Sciences
University of California, Berkeley, CA 94720-1776, USA
‡
Department of Electrical Engineering
Department of Applied Physics and Applied Mathematics
Columbia University, New York, NY, 10027, USA
Abstract
This work attempts to provide a plausible theoretical framework that aims to interpret
modern deep (convolutional) networks from the principles of data compression and discrimi-
native representation. We argue that for high-dimensional multi-class data, the optimal
linear discriminative representation maximizes the coding rate difference between the whole
dataset and the average of all the subsets. We show that the basic iterative gradient ascent
scheme for optimizing the rate reduction objective naturally leads to a multi-layer deep
network, named ReduNet, which shares common characteristics of modern deep networks.
The deep layered architectures, linear and nonlinear operators, and even parameters of
the network are all explicitly constructed layer-by-layer via forward propagation, although
they are amenable to fine-tuning via back propagation. All components of so-obtained
“white-box” network have precise optimization, statistical, and geometric interpretation.
Moreover, all linear operators of the so-derived network naturally become multi-channel
convolutions when we enforce classification to be rigorously shift-invariant. The derivation in
the invariant setting suggests a trade-off between sparsity and invariance, and also indicates
that such a deep convolution network is significantly more efficient to construct and learn
in the spectral domain. Our preliminary simulations and experiments clearly verify the
effectiveness of both the rate reduction objective and the associated ReduNet. All code and
data are available at https://github.com/Ma-Lab-Berkeley.
Keywords: rate reduction, white-box deep network, linear discriminative representation,
multi-channel convolution, sparsity and invariance trade-off
©2021 Kwan Ho Ryan Chan, Yaodong Yu, Chong You, Haozhi Qi, John Wright, and Yi Ma.
License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/.
Chan, Yu, You, Qi, Wright, and Ma
Contents
5 Experimental Verification 38
5.1 Experimental Verification of the MCR2 Objective . . . . . . . . . . . . . . . 38
5.2 Experimental Verification of the ReduNet . . . . . . . . . . . . . . . . . . . 41
2
ReduNet: A White-box Deep Network from Rate Reduction
References 90
3
Chan, Yu, You, Qi, Wright, and Ma
The practice of deep networks as a “black box.” Nevertheless, the practice of deep
networks has been shrouded with mystery from the very beginning. This is largely caused
by the fact that deep network architectures and their components are often designed based
on years of trial and error, then trained from random initialization via back propagation
(BP) (Rumelhart et al., 1986), and then deployed as a “black box”. Many of the popular
techniques and recipes for designing and training deep networks (to be surveyed in the related
work below) were developed through heuristic and empirical means, as opposed to rigorous
mathematical principles, modeling and analysis. Due to lack of clear mathematical principles
that can guide the design and optimization, numerous techniques, tricks, and even hacks
need to be added upon the above designing and training process to make deep learning “work
at all” or “work better” on real world data and tasks. Practitioners constantly face a series
of challenges for any new data and tasks: What architecture or particular components they
should use for the network? How wide or deep the network should be? Which parts of the
networks need to be trained and which can be determined in advance? Last but not the least,
after the network has been trained, learned and tested: how to interpret functions of the
operators; what are the roles and relationships among the multiple (convolution) channels;
how can they be further improved or adapted to new tasks? Besides empirical evaluation, it
is challenging to provide rigorous guarantees for certain properties of so obtained networks,
such as invariance and robustness. Recent studies have shown that popular networks in fact
are not invariant to common deformations (Azulay and Weiss, 2018; Engstrom et al., 2017);
they are prone to overfit noisy or even arbitrary labels (Zhang et al., 2017a); and they remain
vulnerable to adversarial attacks (Szegedy et al., 2013); or they suffer catastrophic forgetting
when incrementally trained to learn new classes/tasks (McCloskey and Cohen, 1989; Delange
et al., 2021; Wu et al., 2021). The lack of understanding about the roles of learned operators
in the networks often provoke debates among practitioners which architecture is better than
others, for example MLPs versus CNNs versus Transformers (Tay et al., 2021) etc. This
situation seems in desperate need of improvements if one wishes deep learning to be a science
rather than an “alchemy.” Therefore, it naturally raises a fundamental question that we
aim to address in this work: how to develop a principled mathematical framework for better
understanding and design of deep networks?
4
ReduNet: A White-box Deep Network from Rate Reduction
and Tegmark, 2017; Hanin, 2017), generalization bounds (Bartlett et al., 2017; Golowich
et al., 2017) etc. Nevertheless, these efforts are typically based on simplified models (e.g.
networks with only a handful layers (Zhong et al., 2017; Soltanolkotabi et al., 2018; Zhang
et al., 2019c; Mei and Montanari, 2019)); or results are conservative (e.g. generalization
bounds (Bartlett et al., 2017; Golowich et al., 2017)); or rely on simplified assumptions
(e.g. ultra-wide deep networks (Jacot et al., 2018; Du et al., 2019; Allen-Zhu et al., 2019;
Buchanan et al., 2020)); or only explain or justify one of the components or characteristics
of the network (e.g. dropout (Cavazza et al., 2018; Mianjy et al., 2018; Wei et al., 2020)).
On the other hand, the predominant methodology in practice still remains trial and error.
In fact, this is often pushed to the extreme by searching for effective network structures and
training strategies through extensive random search techniques, such as Neural Architecture
Search (Zoph and Le, 2016; Baker et al., 2017), AutoML (Hutter et al., 2019), and Learning
to Learn (Andrychowicz et al., 2016).
A new theoretical framework based on data compression and representation.
Our approach deviates from the above efforts in a very significant way. Most of the existing
theoretical work view the deep networks themselves as the object for study. They try
to understand why deep networks work by examining their capabilities in fitting certain
input-output relationships (for given class labels or function values). We, in this work,
however, advocate to shift the attention of study back to the data and try to understand
what deep networks should do. We start our investigation with a fundamental question:
What exactly do we try to learn from and about the data? With the objective clarified,
maybe deep networks, with all their characteristics, are simply necessary means to achieve
such an objective. More specifically, in this paper, we develop a new theoretical framework
for understanding deep networks around the following two questions:
1. Objective of Representation Learning: What intrinsic structures of the data should we
learn, and how should we represent such structures? What is a principled objective
function for learning a good representation of such structures, instead of choosing
heuristically or arbitrarily?
2. Architecture of Deep Networks: Can we justify the structures of modern deep
networks from such a principle? In particular, can the networks’ layered architecture
and operators (linear or nonlinear) all be derived from this objective, rather than
designed heuristically and evaluated empirically?
This paper will provide largely positive and constructive answers to the above questions.
We will argue that, at least in the classification setting, a principled objective for a deep
network is to learn a low-dimensional linear discriminative representation of the data (Section
1.3). The optimality of such a representation can be evaluated by a principled measure from
(lossy) data compression, known as rate reduction (Section 2). Appropriately structured deep
networks can be naturally interpreted as optimization schemes for maximizing this measure
(Section 3 and 4). Not only does this framework offer new perspectives to understand and
interpret modern deep networks, they also provide new insights that can potentially change
and improve the practice of deep networks. For instance, the resulting networks will be
entirely a “white box” and back propagation from random initialization is no longer the
only choice for training the networks (as we will verify through extensive experiments in
Section 5).
5
Chan, Yu, You, Qi, Wright, and Ma
For the rest of this introduction, we will first provide a brief survey on existing work
that are related to the above two questions. Then we will give an overview of our approach
and contributions before we delve into the technical details in following sections.
Despite its effectiveness and enormous popularity, there are two serious limitations with this
approach: 1) It aims only to predict the labels y even if they might be mislabeled. Empirical
studies show that deep networks, used as a “black box,” can even fit random labels (Zhang
et al., 2017b). 2) With such an end-to-end data fitting, despite plenty of empirical efforts in
trying to interpret the so-learned features (Zeiler and Fergus, 2014), it is not clear to what
extent the intermediate features learned by the network capture the intrinsic structures of the
data that make meaningful classification possible in the first place. Recent work of Papyan
et al. (2020); Fang et al. (2021); Zhu et al. (2021) shows that the representations learned
via the cross-entropy loss (1) exhibit a neural collapsing phenomenon,2 where within-class
variability and structural information are getting suppressed and ignored, as we will also see
in the experiments. The precise geometric and statistical properties of the learned features are
also often obscured, which leads to the lack of interpretability and subsequent performance
1. Classification is where deep learning demonstrated the initial success that has catalyzed the explosive
interest in deep networks (Krizhevsky et al., 2012). Although our study in this paper focuses on
classification, we believe the ideas and principles can be naturally generalized to other settings such as
regression.
2. Essentially, once an over-parameterized network fits the training data, regularization (such as weight
decay) would collapse weight components or features that are not the most relevant for fitting the class
labels. Besides the most salient feature, informative and discriminative features that also help define a
class can be suppressed.
6
ReduNet: A White-box Deep Network from Rate Reduction
The information bottleneck (IB) formulation (Tishby and Zaslavsky, 2015) further hypoth-
esizes that the role of the network is to learn z as the minimal sufficient statistics for
predicting y. Formally, it seeks to maximize the mutual information I(z, y) (Cover and
Thomas, 2006) between z and y while minimizing I(x, z) between x and z:
.
max IB(x, y, z(θ)) = I(z(θ), y) − βI(x, z(θ)), β > 0. (3)
θ∈Θ
Given one can overcome some caveats associated with this framework (Kolchinsky et al.,
2018), such as how to accurately evaluate mutual information with finitely samples of
degenerate distributions, this framework can be helpful in explaining certain behaviors
of deep networks. For example, recent work (Papyan et al., 2020) indeed shows that the
representations learned via the cross-entropy loss expose a neural collapse phenomenon.
That is, features of each class are mapped to a one-dimensional vector whereas all other
information of the class is suppressed. More discussion on neural collapse will be given
in Section 2.3. But by being task-dependent (depending on the label y) and seeking a
minimal set of most informative features for the task at hand (for predicting the label y
only), the so learned network may sacrifice robustness in case the labels can be corrupted or
transferrability when we want to use the features for different tasks. To address this, our
framework uses the label y as only side information to assist learning discriminative yet
diverse (not minimal) representations; these representations optimize a different intrinsic
objective based on the principle of rate reduction.4
Reconciling contractive and contrastive learning. Complementary to the above
supervised discriminative approach, auto-encoding (Baldi and Hornik, 1989; Kramer, 1991;
3. For example, the digits in MNIST approximately live on a manifold with intrinsic dimension no larger
than 15 (Hein and Audibert, 2005), the images in CIFAR-10 live closely on a 35-dimensional manifold
(Spigler et al., 2019), and the images in ImageNet have intrinsic dimension of ∼ 40 (Pope et al., 2021).
4. As we will see in the experiments in Section 5, indeed this makes learned features much more robust to
mislabeled data.
7
Chan, Yu, You, Qi, Wright, and Ma
When the data contain complicated multi-modal low-dimensional structures, naive heuristics
or inaccurate metrics may fail to capture all internal subclass structures or to explicitly
discriminate among them for classification or clustering purposes. For example, one con-
sequence of this is the phenomenon of mode collapsing in learning generative models for
data that have mixed multi-modal structures (Li et al., 2020). To address this, we propose a
principled rate reduction measure (on z) that promotes both the within-class compactness
and between-class discrimination of the features for data with mixed structures.
If the above contractive learning seeks to reduce the dimension of the learned representa-
tion, contrastive learning (Hadsell et al., 2006; Oord et al., 2018; He et al., 2019) seems to do
just the opposite. For data that belong to k different classes, a randomly chosen pair (xi , xj )
is of high probability belonging to difference classes if k is large.5 Hence it is desirable that
the representation z i = f (xi , θ) of a sample xi should be highly incoherent to those z j
of other samples xj whereas coherent to feature of its transformed version τ (xi ), denoted
as z(τ (xi )) for τ in certain augmentation set T in consideration. Hence it was proposed
heuristically that, to promote discriminativeness of the learned representation, one may seek
to minimize the so-called contrastive loss:
exp(hz i , z(τ (xi ))i)
min − log P i j
, (6)
θ j6=i exp(hz , z i)
which is small whenever the inner product hz i , z(τ (xi ))i is large and hz i , z j i is small for
i 6= j.
As we may see from the practice of both contractive learning and contrastive learning,
for a good representation of the given data, people have striven to achieve certain trade-off
between the compactness and discriminativeness of the representation. Contractive learning
aims to compress the features of the entire ensemble, whereas contrastive learning aims to
expand features of any pair of samples. Hence it is not entirely clear why either of these
two seemingly opposite heuristics seems to help learn good features. Could it be the case
that both mechanisms are needed but each acts on different part of the data? As we will
5. For example, when k ≥ 100, a random pair is of probability 99% belonging to different classes.
8
ReduNet: A White-box Deep Network from Rate Reduction
see, the rate reduction principle precisely reconciles the tension between these two seemingly
contradictory objectives by explicitly specifying to compress (or contract) similar features in
each class whereas to expand (or contrast) the set of all features in multiple classes.
9
Chan, Yu, You, Qi, Wright, and Ma
Almost all of the above networks inherit architectures and initial parameters from sparse
coding algorithms, but still rely on back propagation (Rumelhart et al., 1986) to tune these
parameters. There have been efforts that try to construct the network in a purely forward
fashion, without any back propagation. For example, to ensure translational invariance
for a wide range of signals, Bruna and Mallat (2013); Wiatowski and Bölcskei (2018) have
proposed to use wavelets to construct convolution networks, known as ScatteringNets. As
a ScatteringNet is oblivious to the given data and feature selection for classification, the
required number of convolution kernels grow exponentially with the depth. Zarka et al. (2020,
2021) have also later proposed hybrid deep networks based on scattering transform and
dictionary learning to alleviate scalability. Alternatively, Chan et al. (2015) has proposed to
construct much more compact (arguably the simplest) networks using principal components
of the input data as the convolution kernels, known as PCANets. However, in both cases of
ScatteringNets and PCANets the forward-constructed networks seek a representation of the
data that is not directly related to a specific (classification) task. To resolve limitations of
both the ScatteringNet and the PCANet, this work shows how to construct a data-dependent
deep convolution network in a forward fashion that leads to a discriminative representation
directly beneficial to the classification task. More discussion about the relationships between
our construction and these networks will be given in Section 4.5 and Appendix E.5.
Whether the given data X of a mixed distribution D = {Dj }kj=1 can be effectively classified
depends on how separable (or discriminative) the component distributions Dj are (or can be
made). One popular working assumption is that the distribution of each class has relatively
low-dimensional intrinsic structures. There are several reasons why this assumption is
plausible: 1). High dimensional data are highly redundant; 2). Data that belong to the
same class should be similar and correlated to each other; 3). Typically we only care
about equivalent structures of x that are invariant to certain classes of deformation and
augmentations. Hence we may assume the distribution Dj of each class has a support on a
low-dimensional submanifold, say Mj with dimension dj D, and the distribution D of x
is supported on the mixture of those submanifolds, M = ∪kj=1 Mj , in the high-dimensional
ambient space RD , as illustrated in Figure 1 left.
With the manifold assumption in mind, we want to learn a mapping z = f (x, θ) that
maps each of the submanifolds Mj ⊂ RD to a linear subspace S j ⊂ Rn (see Figure 1 middle).
To do so, we require our learned representation to have the following properties, called a
linear discriminative representation (LDR):
10
ReduNet: A White-box Deep Network from Rate Reduction
Mj Rn Sj
RD M 1
M
xi
M2
f (x, θ) zi
S2
S1
The third item is desired because we want the learned features to reveal all possible causes
why this class is different from all other classes7 (see Section 2 for more detailed justification).
Notice that the first two items align well with the spirit of the classic linear discriminant
analysis (LDA) (Hastie et al., 2009). Here, however, although the intrinsic structures of each
class/cluster may be low-dimensional, they are by no means simply linear (or Gaussian)
in their original representation x and they need be to made linear through a nonlinear
transform z = f (x). Unlike LDA (or similarly SVM8 ), here we do not directly seek a
discriminant (linear) classifier. Instead, we use the nonlinear transform to seek a linear
discriminative representation9 (LDR) for the data such that the subspaces that represent all
the classes are maximally incoherent. To some extent, the resulting multiple subspaces {S j }
6. Linearity is a desirable property for many reasons. For example, in engineering, it makes interpolat-
ing/extrapolating data easy via superposition which is very useful for generative purposes. There is also
scientific evidence that the brain represents objects such as faces as a linear subspace (Chang and Tsao,
2017).
7. For instance, to tell “Apple” from “Orange”, not only do we care about the color, but also the shape and
the leaves. Interestingly, some images of computers are labeled as “Apple” too.
8. For instance, Vinyals et al. (2012) has proposed to use SVMs in a recursive fashion to build (deep)
nonlinear classifiers for complex data.
9. In this work, to avoid confusion, we always use the word “discriminative” to describe a representation,
and “discriminant” a classifier. To some extent, one may view classification and representation are dual
to each other. Discriminant methods (LDA or SVM) are typically more natural for two-class settings
(despite many extensions to multiple classes), whereas discriminative representations are natural for
multi-class data and help reveal the data intrinsic structures more directly.
11
Chan, Yu, You, Qi, Wright, and Ma
can be viewed as discriminative generalized principal components (Vidal et al., 2016) or, if
orthogonal, independent components (Hyvärinen and Oja, 2000) of the resulting features z
for the original data x. As we will see in Section 3, deep networks precisely play the role of
modeling this nonlinear transform from the data to an LDR!
For many clustering or classification tasks (such as object detection in images), we
consider two samples as equivalent if they differ by certain classes of domain deformations
or augmentations T = {τ }. Hence, we are only interested in low-dimensional structures
that are invariant to such deformations (i.e., x ∈ M iff τ (x) ∈ M for all τ ∈ T ), which are
known to have sophisticated geometric and topological structures (Wakin et al., 2005) and
can be difficult to learn precisely in practice even with rigorously designed CNNs (Cohen
and Welling, 2016a; Cohen et al., 2019). In our framework, this is formulated in a very
natural way: all equivariant instances are to be embedded into the same subspace, so that
the subspace itself is invariant to the transformations under consideration (see Section 4).
There are previous attempts to directly enforce subspace structures on features learned
by a deep network for supervised (Lezama et al., 2018) or unsupervised learning (Ji et al.,
2017; Zhang et al., 2018; Peng et al., 2017; Zhou et al., 2018; Zhang et al., 2019b,a; Lezama
et al., 2018). However, the self-expressive property of subspaces exploited by these work
does not enforce all the desired properties listed above as shown by Haeffele et al. (2021).
Recently Lezama et al. (2018) has explored a nuclear norm based geometric loss to enforce
orthogonality between classes, but does not promote diversity in the learned representations,
as we will soon see. Figure 1 right illustrates a representation learned by our method on the
CIFAR10 dataset. More details can be found in the experimental Section 5.
In this work, to learn a discriminative linear representation for intrinsic low-dimensional
structures from high-dimensional data, we propose an information-theoretic measure that
maximizes the coding rate difference between the whole dataset and the sum of each
individual class, known as rate reduction. This new objective provides a more unifying view
of above objectives such as cross-entropy, information bottleneck, contractive and contrastive
learning. We can rigorously show that when the intrinsic dimensions the submanifolds are
known and this objective is optimized, the resulting representation indeed has the desired
properties listed above.
10. After all, at least in theory, Barron (1991) already proved back in early 90’s that a single-layer neural
network can efficiently approximate a very general class of functions or mappings.
12
ReduNet: A White-box Deep Network from Rate Reduction
11. Convolution kernels/dictionaries such as wavelets have been widely practiced to model and process 2D
signals such as images. Convolution kernels used in modern CNNs are nevertheless multi-channel, often
involving hundreds of channels altogether at each layer. Most theoretical modeling and analysis for
images have been for the 2D kernel case. The precise reason and role for multi-channel convolutions have
been elusive and equivocal. This work aims to provide a rigorous explanation.
13
Chan, Yu, You, Qi, Wright, and Ma
not depend exclusively on class labels so that it can work in all supervised, self-supervised,
semi-supervised, and unsupervised settings.
Measure of compactness from information theory. In information theory (Cover
and Thomas, 2006), the notion of entropy H(z) is designed to be such a measure. However,
entropy is not well-defined for continuous random variables with degenerate distributions.
This is unfortunately the case for data with relatively low intrinsic dimension. The same
difficulty resides with evaluating mutual information I(x, z) for degenerate distributions.
To alleviate this difficulty, another related concept in information theory, more specifically
in lossy data compression, that measures the “compactness” of a random distribution is
the so-called rate distortion (Cover and Thomas, 2006): Given a random variable z and a
prescribed precision > 0, the rate distortion R(z, ) is the minimal number of binary bits
needed to encode z such that the expected decoding error is less than , i.e., the decoded zb
satisfies E[kz − zbk2 ] ≤ . This quantity has been shown to be useful in explaining feature
selection in deep networks (MacDonald et al., 2019). However, the rate distortion of an
arbitrary high-dimensional distribution is intractable, if not impossible, to compute, except
for simple distributions such as discrete and Gaussian.12 Nevertheless, as we have discussed
in Section 1.3, our goal here is to learn a final representation of the data as linear subspaces.
Hence we only need a measure of compactness/goodness for this class of distributions.
Fortunately, as we will explain below, the rate distortions for this class of distributions can
be accurately and easily computed, actually in closed form!
Rate distortion for finite samples on a subspace. Another practical difficulty in
evaluating the rate distortion is that we normally do not know the distribution of z.
Instead, we have a finite number of samples as learned representations {z i = f (xi , θ) ∈
Rn , i = 1, . . . , m}, for the given data samples X = [x1 , . . . , xm ]. Fortunately, Ma et al.
(2007) provides a precise estimate on the number of binary bits needed to encode finite
samples from a subspace-like distribution. In order to encode the learned representation
Z = [z 1 , . . . , z m ] up to a precision, say , the total number of bits needed is given by the
. m+n n ∗ .13 This formula can be derived
following expression: L(Z, ) = 2 log det I + m 2 ZZ
either by packing -balls into the space spanned by Z as a Gaussian source or by computing
the number of bits needed to quantize the SVD of Z subject to the precision, see Ma et al.
(2007) for proofs. Therefore, the compactness of learned features as a whole can be measured
in terms of the average coding length per sample (as the sample size m is large), a.k.a. the
coding rate subject to the distortion :
. 1 n ∗
R(Z, ) = log det I + ZZ . (7)
2 m2
See Figure 2 for an illustration.
Rate distortion of samples on a mixture of subspaces. In general, the features Z
of multi-class data may belong to multiple low-dimensional subspaces. To evaluate the rate
distortion of such mixed data more accurately, we may partition the data Z into multiple
12. The same difficulties lie with the information bottleneck framework (Tishby and Zaslavsky, 2015) where
one needs to evaluate (difference of) mutual information for degenerate distributions in a high-dimensional
space (3).
13. We use superscript ∗ to indicate (conjugate) transpose of a vector or a matrix
14
ReduNet: A White-box Deep Network from Rate Reduction
Figure 2: Lossy coding scheme: Given a precision , we pack the space/volume spanned by the data
Z with small balls of diameter 2. The number of balls needed to pack the space gives
the number of bits needed to record the location of each data point z i , up to the given
precision E[kz − zbk2 ] ≤ .
k
. X tr(Πj ) n j ∗
Rc (Z, | Π) = log det I + ZΠ Z . (8)
2m tr(Πj )2
j=1
When Z is given, Rc (Z, | Π) is a concave function of Π. The function log det(·) in the
above expressions has been long known as an effective heuristic for rank minimization
problems, with guaranteed convergence to local minimum (Fazel et al., 2003). As it nicely
characterizes the rate distortion of Gaussian or subspace-like distributions, log det(·) can be
very effective in clustering or classification of mixed data (Ma et al., 2007; Wright et al.,
2008; Kang et al., 2015).
14. By a little abuse of notation, we here use Z to denote the set of all samples from all the k classes. For
convenience, we often represent Z as a matrix whose columns are the samples.
15
Chan, Yu, You, Qi, Wright, and Ma
span a space (or subspace) of a very small volume and the coding rate should be as small as
possible. Shortly put, learned features should follow the basic rule that similarity contracts
and dissimilarity contrasts.
To be more precise, a good (linear) discriminative representation Z of X is one such
that, given a partition Π of Z, achieves a large difference between the coding rate for the
whole and that for all the subsets:
.
∆R(Z, Π, ) = R(Z, ) − Rc (Z, | Π). (9)
If we choose our feature mapping to be z = f (x, θ) (say modeled by a deep neural network),
the overall process of the feature representation and the resulting rate reduction w.r.t. certain
partition Π can be illustrated by the following diagram:
f (x, θ) Π,
X −−−−−−→ Z(θ) −−−−→ ∆R(Z(θ), Π, ). (10)
The role of normalization. Note that ∆R is monotonic in the scale of the features Z.
So to make the amount of reduction comparable between different representations, we need
to normalize the scale of the learned features, either by imposing the Frobenius norm of each
class Z j to scale with the number of features in Z j ∈ Rn×mj : kZ j k2F = mj or by normalizing
each feature to be on the unit sphere: z i ∈ Sn−1 . This can be compared to the use of “batch
normalization” in the practice of training deep neural networks (Ioffe and Szegedy, 2015).15
Besides normalizing the scale, normalization could also act as a precondition mechanism that
helps accelerate gradient descent (Liu et al., 2021).16 This interpretation of normalization
becomes even more pertinent when we realize deep networks as an iterative scheme to
optimize the rate reduction objective in the next two sections. In this work, to simplify the
analysis and derivation, we adopt the simplest possible normalization schemes, by simply
enforcing each sample on a sphere or the Frobenius norm of each subset being a constant.17
Once the representations can be compared fairly, our goal becomes to learn a set of
features Z(θ) = f (X, θ) and their partition Π (if not given in advance) such that they
maximize the reduction between the coding rate of all features and that of the sum of
features w.r.t. their classes:
max ∆R Z(θ), Π, = R(Z(θ), )−Rc (Z(θ), | Π), s.t. kZ j (θ)k2F = mj , Π ∈ Ω. (11)
θ, Π
We refer to this as the principle of maximal coding rate reduction (MCR2 ), an embodiment
of Aristotle’s famous quote: “the whole is greater than the sum of the parts.” Note that for
the clustering purpose alone, one may only care about the sign of ∆R for deciding whether
to partition the data or not, which leads to the greedy algorithm in (Ma et al., 2007). More
15. Notice that normalizing the scale of the learned representations helps ensure that the mapping of each
layer of the network is approximately isometric. As it has been shown in the work of Qi et al. (2020),
ensuring the isometric property alone is adequate to ensure good performance of deep networks, even
without the batch normalization.
16. Similar to the role that preconditioning plays in the classic conjugate gradient descent method (Shewchuk,
1994; Nocedal and Wright, 2006a).
17. In practice, to strive for better performance on specific data and tasks, many other normalization schemes
can be considered such as layer normalization (Ba et al., 2016), instance normalization (Ulyanov et al.,
2016), group normalization (Wu and He, 2018), spectral normalization (Miyato et al., 2018).
16
ReduNet: A White-box Deep Network from Rate Reduction
Figure 3: Comparison of two learned representations Z and Z 0 via reduced rates: R is the number
of -balls packed in the joint distribution and Rc is the sum of the numbers for all the
subspaces (the green balls). ∆R is their difference (the number of blue balls). The MCR2
principle prefers Z (the left one).
specifically, in the context of clustering finite samples, one needs to use the more precise
measure of the coding length mentioned earlier, see (Ma et al., 2007) for more details. Here
to seek or learn the most discriminative representation, we further desire that the whole
is maximally greater than the sum of the parts. This principle is illustrated with a simple
example in Figure 3.
Relationship to information gain. The maximal coding rate reduction can be viewed
as a generalization to information gain (IG), which aims to maximize the reduction of entropy
.
of a random variable, say z, with respect to an observed attribute, say π: maxπ IG(z, π) =
H(z) − H(z | π), i.e., the mutual information between z and π (Cover and Thomas,
2006). Maximal information gain has been widely used in areas such as decision trees
(Quinlan, 1986). However, MCR2 is used differently in several ways: 1) One typical setting
of MCR2 is when the data class labels are given, i.e. Π is known, MCR2 focuses on learning
representations z(θ) rather than fitting labels. 2) In traditional settings of IG, the number
of attributes in z cannot be so large and their values are discrete (typically binary). Here the
“attributes” Π represent the probability of a multi-class partition for all samples and their
values can even be continuous. 3) As mentioned before, entropy H(z) or mutual information
I(z, π) (Hjelm et al., 2018) is not well-defined for degenerate continuous distributions or
hard to compute for high-dimensional distributions, whereas here the rate distortion R(z, )
is and can be accurately and efficiently computed for (mixed) subspaces, at least.
17
Chan, Yu, You, Qi, Wright, and Ma
Z is multiple subspaces (or Gaussians), the rates R and Rc in (11) are given by (7) and
(8), respectively. At any optimal representation, denoted as Z? = Z?1 ∪ · · · ∪ Z?k ⊂ Rn , it
should achieve the maximal rate reduction. One can show that Z? has the following desired
properties (see Appendix A for a formal statement and detailed proofs).
• P
Between-class Discriminative: As long as the ambient space is adequately large (n ≥
k i ∗ j
j=1 dj ), the subspaces are all orthogonal to each other, i.e. (Z? ) Z? = 0 for i 6= j.
In other words, the MCR2 principle promotes embedding of data into multiple independent
subspaces,19 with features distributed isotropically in each subspace (except for possibly
one dimension). In addition, among all such discriminative representations, it prefers the
one with the highest dimensions in the ambient space (see Section 5.1 and Appendix D for
experimental verification). This is substantially different from objectives such as the cross
entropy (1) and information bottleneck (3). The optimal representation associated with
MCR2 is indeed an LDR according to definition given in Section 1.3.
Relation to neural collapse. A line of recent work Papyan et al. (2020); Han et al.
(2021); Mixon et al. (2020); Zhu et al. (2021) discovered, both empirically and theoretically,
that deep networks trained via cross-entropy or mean squared losses produce neural collapse
features. That is, features from each class become identical, and different classes are
maximally separated from each other. In terms of the coding rate function, the neural
collapse solution is preferred by the objective of minimizing the coding rate of the “parts”,
namely Rc (Z(θ), | Π) (see Table 4 of the Appendix D). However, the neural collapse
solution leads to a small coding rate for the “whole”, namely R(Z(θ)), hence is not an
optimal solution for maximizing the rate reduction. Therefore, the benefit of MCR2 in
preventing the collapsing of the features from each class and producing maximally diverse
representations can be attributed to introducing and maximizing the term R(Z(θ)).
Comparison to the geometric OLE loss. To encourage the learned features to be
uncorrelated between classes, the work of Lezama et al. (2018) has proposed to maximize the
difference between the nuclear norm of the whole Z and its subsets Z j , called the orthogonal
18. Notice that here we assume we know a good upper bound for the dimension dj of each class. This requires
us to know the intrinsic dimension of the submanifold Mj . In general, this can be a very challenging
problem itself even when the submanifold is linear but noisy, which is still an active research topic
(Hong et al., 2020). Nevertheless, in practice, we can decide the dimension empirically through ablation
experiments, see for example Table 5 for experiments on the CIFAR10 dataset.
19. In this case, the subspaces can be viewed as the independent
P components (Hyvärinen and Oja, 2000) of
the so learned features. However, when the condition n ≥ kj=1 dj is violated, the optimal subspaces may
not be orthogonal. But experiments show that they tend to be maximally incoherent, see Appendix E.4.
18
ReduNet: A White-box Deep Network from Rate Reduction
. P
low-rank embedding (OLE) loss: maxθ OLE(Z(θ), Π) = kZ(θ)k∗ − kj=1 kZ j (θ)k∗ , added
as a regularizer to the cross-entropy loss (1). The nuclear norm k · k∗ is a nonsmooth convex
surrogate for low-rankness and the nonsmoothness potentially poses additional difficulties in
using this loss to learn features via gradient descent, whereas log det(·) is smooth concave
instead. Unlike the rate reduction ∆R, OLE is always negative and achieves the maximal
value 0 when the subspaces are orthogonal, regardless of their dimensions. So in contrast to
∆R, this loss serves as a geometric heuristic and does not promote diverse representations.
In fact, OLE typically promotes learning one-dimensional representations per class, whereas
MCR2 encourages learning subspaces with maximal dimensions (Figure 7 of Lezama et al.
(2018) versus our Figure 17). More importantly, as we will see in the next section, the precise
form of the rate distortion plays a crucial role in deriving the deep network operators, with
precise statistical and geometrical meaning.
In the above section, we have presented rate reduction (11) as a principled objective for
learning a linear discriminative representation (LDR) for the data. We have, however,
not specified the architecture of the feature mapping z(θ) = f (x, θ) for extracting such a
representation from input data x. A straightforward choice is to use a conventional deep
network, such as ResNet, for implementing f (x, θ). As we show in the experiments (see
Section 5.1), we can effectively optimize the MCR2 objective with a ResNet architecture
and obtain discriminative and diverse representations for real image data sets.
There remain several unanswered problems with using a ResNet. Although the learned
feature representation is now more interpretable, the network itself is still not. It is unclear
why any chosen “black-box” network is able to optimize the desired MCR2 objective at all.
The good empirical results (say with a ResNet) do not necessarily justify the particular choice
19
Chan, Yu, You, Qi, Wright, and Ma
in architectures and operators of the network: Why is a deep layered model necessary;20
what do additional layers try to improve or simplify; how wide and deep is adequate; or
is there any rigorous justification for the convolutions (in the popular multi-channel form)
and nonlinear operators (e.g. ReLu or softmax) used? In this section, we show that using
gradient ascent to maximize the rate reduction ∆R(Z) naturally leads to a “white-box” deep
network that represents such a mapping. All network layered architecture, linear/nonlinear
operators, and parameters are explicitly constructed in a purely forward propagation fashion.
n n tr(Πj )
where for simplicity we denote α = m2
, αj = tr(Πj )2
, γj = m for j = 1, . . . , k.
The question really boils down to whether there is a constructive way of finding such a
continuous mapping f (·) from x to z? To this end, let us consider incrementally maximizing
the objective ∆R(Z) as a function of Z ⊂ Sn−1 . Although there might be many optimization
schemes to choose from, for simplicity we first consider the arguably simplest projected
gradient ascent (PGA) scheme:22
∂∆R
Z`+1 ∝ Z` + η · s.t. Z`+1 ⊂ Sn−1 , ` = 1, 2, . . . , (13)
∂Z Z`
for some step size η > 0 and the iterate starts with the given data Z1 = X 23 . This scheme
can be interpreted as how one should incrementally adjust locations of the current features
Z` , initialized as the input data X, in order for the resulting Z`+1 to improve the rate
reduction ∆R(Z), as illustrated in Figure 4.
20. Especially it is already long known that even a single layer neural network is already a universal functional
approximator with tractable model complexity (Barron, 1991).
21. As we will see the necessity of such a feature extraction in the next section.
22. Notice that we use superscript j on Z j to indicate features in the jth class and subscript ` on Z` to
indicate all features at `-th iteration or layer.
23. Again, for simplicity, we here first assume the initial features Z1 are the data themselves. Hence the data
and the features have the same dimension n. This needs not to be the case though. As we will see in
the next section, the initial features can be some (lifted) features of the data to begin with and could in
principle have a different (much higher) dimension. All subsequent iterates have the same dimension.
20
ReduNet: A White-box Deep Network from Rate Reduction
Figure 4: Incremental deformation via gradient flow to both flatten data of each class
into a subspace and push different classes apart. Notice that for points whose
memberships are unknown, those marked as “”, their gradient cannot be directly
calculated.
Simple calculation shows that the gradient ∂∆R∂Z entails evaluating the following derivatives
of the two terms in ∆R(Z):
1 ∂ log det(I +αZZ ∗ )
= α(I +αZ` Z`∗ )−1 Z` , (14)
2 ∂Z Z` | {z }
E` ∈Rn×n
1 ∂ γj log det(I + αj ZΠj Z ∗ ) j ∗ −1
= γj α j (I + αj Z` Π Z` ) Z` Πj . (15)
2 ∂Z Z` | {z }
C`j ∈Rn×n
Notice that in the above, the matrix E` only depends on Z` and it aims to expand all the
features to increase the overall coding rate; the matrix C`j depends on features from each
class and aims to compress them to reduce the coding rate of each class. Then the complete
gradient ∂∆R
∂Z Z ∈ R
n×m is of the form:
`
k
X
∂∆R
= E` Z` − γj C`j Z` Πj . (16)
∂Z Z` |{z} |{z}
Expansion j=1
Compression
Notice that [q` ]? is exactly the solution to the ridge regression by all the data points Z`
concerned. Therefore, E` (similarly for C`j ) is approximately (i.e. when m is large enough)
the projection onto the orthogonal complement of the subspace spanned by columns of Z` .
21
Chan, Yu, You, Qi, Wright, and Ma
Figure 5: Interpretation of C`j and E` : C`j compresses each class by contracting the features to a
low-dimensional subspace; E` expands all features by contrasting and repelling features
across different classes.
Another way to interpret the matrix E` is through eigenvalue decomposition of the covariance
. .
matrix Z` Z`∗ . Assuming that Z` Z`∗ = U` Λ` U`∗ where Λ` = diag{λ1` , . . . , λn` }, we have
1 1
E` = α U` diag ,..., U`∗ . (18)
1 + αλ1` 1 + αλn`
Therefore, the matrix E` operates on a vector z` by stretching in a way that directions of
large variance are shrunk while directions of vanishing variance are kept. These are exactly
the directions (14) in which we move the features so that the overall volume expands and
the coding rate will increase, hence the positive sign. To the opposite effect, the directions
associated with (15) are “residuals” of features of each class deviate from the subspace to
which they are supposed to belong. These are exactly the directions in which the features need
to be compressed back onto their respective subspace, hence the negative sign (see Figure 5).
Essentially, the linear operations E` and C`j in gradient ascend for rate reduction are
determined by training data conducting “auto-regressions”. The recent renewed understanding
about ridge regression in an over-parameterized setting (Yang et al., 2020; Wu and Xu, 2020)
indicates that using seemingly redundantly sampled data (from each subspaces) as regressors
do not lead to overfitting.
at a point whose membership is not known, as illustrated in Figure 4. Hence, in order to find
the optimal f (x, θ) explicitly, we may consider constructing a small increment transform
g(·, θ` ) on the `-th layer feature z` to emulate the above (projected) gradient scheme:
z`+1 ∝ z` + η · g(z` , θ` ) subject to z`+1 ∈ Sn−1 (19)
1
such that: g(z` , θ` ), . . . , g(z`m , θ` ) ≈ ∂∆R
∂Z Z . That is, we need to approximate the gradient
`
∂∆R
flow ∂Z that locally deforms all (training) features {z`i }m
i=1 with a continuous mapping g(z)
22
ReduNet: A White-box Deep Network from Rate Reduction
defined on the entire feature space z` ∈ Rn . Notice that one may interpret the increment
(19) as a discretized version of a continuous differential equation:
Hence the (deep) network so constructed can be interpreted as certain neural ODE (Chen
et al., 2018). Nevertheless, unlike neural ODE where the flow g is chosen to be some generic
structures, here our g(z, θ) is to emulate the gradient flow of the rate reduction on the
feature set (as shown in Figure 4):
∂∆R
Ż = , (21)
∂Z
and its structure is entirely derived and fully determined from this objective, without any
other priors or heuristics.
By inspecting the structure of the gradient (16), it suggests that a natural candidate for
the increment transform g(z` , θ` ) is of the form:
X k
.
g(z` , θ` ) = E` z` − γj C`j z` π j (z` ) ∈ Rn , (22)
j=1
where π j (z` ) ∈ [0, 1] indicates the probability of z` belonging to the j-th class. The increment
depends on: First, a set of linear maps represented by E` and {C`j }kj=1 that depend only on
statistics of features of the training Z` ; Second, membership {π j (z` )}kj=1 of any feature z` .
Notice that on the training samples Z` , for which the memberships Πj are known, the so
∂∆R
defined g(z` , θ) gives exactly the values for the gradient ∂Z Z .
`
Since we only have the membership π j for the training samples, the function g(·) defined
in (22) can only be evaluated on the training. To extrapolate g(·) to the entire feature space,
we need to estimate π j (z` ) in its second term. In the conventional deep learning, this map
is typically modeled as a deep network and learned from the training data, say via back
propagation. Nevertheless, our goal here is not to learn a precise classifier π j (z` ) already.
Instead, we only need a good enough estimate of the class information in order for g(·) to
approximate the gradient ∂∆R ∂Z well.
From the geometric interpretation of the linear maps E` and C`j given by Remark 2, the
.
term pj` = C`j z` can be viewed as (approximately) the projection of z` onto the orthogonal
complement of each class j. Therefore, kpj` k2 is small if z` is in class j and large otherwise.
This motivates us to estimate its membership based on the following softmax function:
j
. exp (−λkC` z` k)
b j (z` ) = Pk
π j
∈ [0, 1]. (23)
j=1 exp (−λkC` z` k)
Hence the second term of (22) can be approximated by this estimated membership:
k
X k
X
.
γj C`j z` π j (z` ) ≈ γj C`j z` · π
b j (z` ) = σ [C`1 z` , . . . , C`k z` ] , (24)
j=1 j=1
23
Chan, Yu, You, Qi, Wright, and Ma
which is denoted as a nonlinear operator σ(·) on outputs of the feature z` through k groups
of filters: [C`1 , . . . , C`k ]. Notice that the nonlinearality arises due to a “soft” assignment of
class membership based on the feature responses from those filters.
k
X
σ(z` ) ∝ z` − ReLU P`j z` , (25)
j=1
where P`j = (C`j )⊥ is (approximately) the projection onto the j-th class and ReLU(x) =
max(0, x). The above approximation is good under the more restrictive assumption that
projection of z` on the correct class via P`j is mostly large and positive and yet small or
negative for other classes.
Overall, combining (19), (22), and (24), the increment feature transform from z` to z`+1
now becomes:
z`+1 ∝ z` + η · E` z` − η · σ [C`1 z` , . . . , C`k z` ]
(26)
= z` + η · g(z` , θ` ) s.t. z`+1 ∈ Sn−1 ,
with the nonlinear
function σ(·) defined above and θ` collecting all the layer-wise parameters.
That is θ` = E` , C`1 , . . . , C`k , γj , λ . Note features at each layer are always “normalized”
by projecting onto the unit sphere Sn−1 , denoted as PSn−1 . The form of increment in (26)
can be illustrated by a diagram in Figure 6 left.
24
ReduNet: A White-box Deep Network from Rate Reduction
256-d in 256-d in
4, 3x3, 4 4, 3x3, 4
... 4, 3x3, 4
64, 3x3, 64
Figure 6: Network Architectures of the ReduNet and comparison with others. (a): Layer structure
of the ReduNet derived from one iteration of gradient ascent for optimizing rate reduction.
(b) (left): A layer of ResNet (He et al., 2016); and (b) (right): A layer of ResNeXt (Xie
et al., 2017a). As we will see in Section 4, the linear operators E` and C`j of the ReduNet
naturally become (multi-channel) convolutions when shift-invariance is imposed.
As this deep network is derived from maximizing the rate reduced, we call it the ReduNet.
We summarize the training and evaluation of ReduNet in Algorithm 1 and Algorithm 2,
respectively. Notice that all parameters of the network are explicitly constructed layer by
layer in a forward propagation fashion. The construction does not need any back propagation!
The so learned features can be directly used for classification, say via a nearest subspace
classifier.
25
Chan, Yu, You, Qi, Wright, and Ma
They require introducing additional skip connections among three layers ` − 1, ` and ` + 1.
For typical convex or nonconvex programs, the above accelerated schemes can often reduce
the number of iterations by a magnitude (Wright and Ma, 2021).
Notice that, structure wise, the k + 1 parallel groups of channels E, C j of the ReduNet
correspond to the “residual” channel of the ResNet (Figure 6 middle). Remarkably, here
in ReduNet, we know they precisely correspond to the residual of data auto-regression (see
Remark 2). Moreover, the multiple parallel groups actually draw resemblance to the parallel
structures that people later empirically found to further improve the ResNet, e.g. ResNeXt
(Xie et al., 2017a) (shown in Figure 6 right) or the mixture of experts (MoE) module adopted
in Shazeer et al. (2017).25 Now in ReduNet, each of those groups C j can be precisely
interpreted as an expert classifier for each class of objects. But a major difference here is
25. The latest large language model, the switched transformer (Fedus et al., 2021), adopts the MoE architecture,
in which the number of parallel banks (or experts) k are in the thousands and the total number of
parameters of the network is about 1.7 trillion.
26
ReduNet: A White-box Deep Network from Rate Reduction
that all above networks need to be initialized randomly and trained via back propagation
whereas all components (layers, operators, and parameters) of the ReduNet are by explicit
construction in a forward propagation. They all have precise optimization, statistical and
geometric interpretation.
Of course, like any other deep networks, the so-constructed ReduNet is amenable to
fine-tuning via back-propagation if needed. A recent study from Giryes et al. (2018) has
shown that such fine-tuning may achieve a better trade off between accuracy and efficiency
of the unrolled network (say when only a limited number of layers, or iterations are allowed
in practice). Nevertheless, for the ReduNet, one can start with the nominal values obtained
from the forward construction, instead of random initialization. Benefits of fine-tuning and
initialization will be verified in the experimental section (see Table 2).
where “∼” indicates two features belonging to the same equivalent class. Although to
ensure invariance or equivarience, convolutional operators has been common practice in
deep networks (Cohen and Welling, 2016b), it remains challenging in practice to train
an (empirically designed) convolution network from scratch that can guarantee invariance
even to simple transformations such as translation and rotation (Azulay and Weiss, 2018;
Engstrom et al., 2017). An alternative approach is to carefully design convolution filters
of each layer so as to ensure translational invariance for a wide range of signals, say using
wavelets as in ScatteringNet (Bruna and Mallat, 2013) and followup works (Wiatowski and
Bölcskei, 2018). However, in order to ensure invariance to generic signals, the number of
convolutions needed usually grows exponentially with network depth. That is the reason
why this type of network cannot be constructed so deep, usually only several layers.
In this section, we show that the MCR2 principle is compatible with invariance in a very
natural and precise way: we only need to assign all transformed versions {x ◦ g | g ∈ G}
into the same class as the data x and map their features z all to the same subspace S.
Hence, all group equivariant information is encoded only inside the subspace, and any
classifier defined on the resulting set of subspaces will be automatically invariant to such
group transformations. See Figure 7 for an illustration of the examples of 1D rotation and
2D translation. We will rigorously show in the next two sections (as well as Appendix B
and Appendix C) that, when the group G is circular 1D shifting or 2D translation, the
resulting deep network naturally becomes a multi-channel convolution network! Because
the so-constructed network only needs to ensure invariance for the given data X or their
features Z, the number of convolutions needed actually remain constant through a very
deep network, as oppose to the ScatteringNet.
27
Chan, Yu, You, Qi, Wright, and Ma
26. Again, to simplify discussion, we assume for now that the initial features Z1 are X themselves hence
have the same dimension n. But that does not need to be the case as we will soon see that we need to
lift X to a higher dimension.
28
ReduNet: A White-box Deep Network from Rate Reduction
E1 z = e1 ~ z,
where e1 ∈ Rn is the first column vector of E1 and “~” is circular convolution defined as
n−1
. X
(e1 ~ z)i = e1 (j)x(i + n − j mod n).
j=0
Similarly, the matrices C1j associated with any subsets of Z1 are also circular convolutions.
Not only the first-layer parameters E1 and C1j of the ReduNet become circulant convo-
lutions but also the next layer features remain circulant matrices. That is, the incremental
feature transform in (26) applied to all shifted versions of a z1 ∈ Rn , given by
circ(z1 ) + η · E1 circ(z1 ) − η · σ [C11 circ(z1 ), . . . , C1k circ(z1 )] , (34)
is a circulant matrix. This implies that there is no need to construct circulant families from
the second layer features as we did for the first layer. By denoting
z2 ∝ z1 + η · g(z1 , θ1 ) = z1 + η · e1 ~ z1 − η · σ [c11 ~ z1 , . . . , ck1 ~ z1 ] , (35)
Continuing inductively, we see that all matrices E` and C`j based on such circ(Z` ) are
circulant, and so are all features. By virtue of the properties of the data, ReduNet has taken
the form of a convolutional network, with no need to explicitly choose this structure!
29
Chan, Yu, You, Qi, Wright, and Ma
One natural remedy is to improve the separability of the data by “lifting” the original
signal to a higher dimensional space, e.g., by taking their responses to multiple, filters
k1 , . . . , kC ∈ R n :
z[c] = kc ~ x = circ(kc )x ∈ Rn , c = 1, . . . , C. (36)
The filters can be pre-designed invariance-promoting filters,27 or adaptively learned from
the data,28 or randomly selected as we do in our experiments. This operation lifts each
.
original signal x ∈ Rn to a C-channel feature, denoted as z̄ = [z[1], . . . , z[C]]∗ ∈ RC×n .
.
Then, we may construct the ReduNet on vector representations of z̄, denoted as vec(z̄) =
[z[1]∗ , . . . , z[C]∗ ] ∈ RnC . The associated circulant version circ(z̄) and its data covariance
matrix, denoted as Σ̄(z̄), for all its shifted versions are given as:
" circ(z[1]) # " circ(z[1]) #
. .. . ..
circ(z̄) = . ∈ RnC×n , Σ̄(z̄) = . [ circ(z[1])∗ ,...,circ(z[C])∗ ] ∈ RnC×nC , (37)
circ(z[C]) circ(z[C])
where circ(z[c]) ∈ Rn×n with c ∈ [C] is the circulant version of the c-th channel of the feature
z̄. Then the columns of circ(z̄) will only span at most an n-dimensional proper subspace in
RnC .
However, this simple lifting operation (if linear) is not sufficient to render the classes
separable yet—features associated with other classes will span the same n-dimensional
subspace. This reflects a fundamental conflict between invariance and linear (subspace)
modeling: one cannot hope for arbitrarily shifted and superposed signals to belong to the
same class.
One way of resolving this conflict is to leverage additional structure within each class,
in the form of sparsity: Signals within each class are not generated as arbitrary linear
superposition of some base atoms (or motifs), but only sparse combinations of them and
their shifted versions, as shown in Figure 8. More precisely, let D j = [dj1 , . . . , djc ] denote a
matrix with a collection of atoms associated for class j, also known as a dictionary, then
each signal x in this class is sparsely generated as:
x = dj1 ~ z1 + . . . + djc ~ zc = circ(D j )z, (38)
for some sparse vector z. Signals in different classes are then generated by different
dictionaries whose atoms (or motifs) are incoherent from one another. Due to incoherence,
signals in one class are unlikely to be sparsely represented by atoms in any other class.
Hence all signals in the k class can be represented as
x = circ(D 1 ), circ(D 2 ), . . . , circ(D k ) z̄, (39)
where z̄ is sparse.29 There is a vast literature on how to learn the most compact and optimal
sparsifying dictionaries from sample data, e.g. (Li and Bresler, 2019; Qu et al., 2019) and
27. For 1D signals like audio, one may consider the conventional short time Fourier transform (STFT); for
2D images, one may consider 2D wavelets as in the ScatteringNet (Bruna and Mallat, 2013).
28. For learned filters, one can learn filters as the principal components of samples as in the PCANet (Chan
et al., 2015) or from convolution dictionary learning (Li and Bresler, 2019; Qu et al., 2019).
29. Notice that similar sparse representation models have long been proposed and used for classification
purposes in applications such a face recognition, demonstrating excellent effectiveness (Wright et al.,
2009; Wagner et al., 2012). Recently, the convolution sparse coding model has been proposed by Papyan
et al. (2017) as a framework for interpreting the structures of deep convolution networks.
30
ReduNet: A White-box Deep Network from Rate Reduction
= sparse
C
ch
Fish
an
ne
ls
C convolution kernels
= sparse
C
ch
an
Duck
ne
ls
C convolution kernels
Figure 8: Each input signal x (an image here) can be represented as a superposition of
sparse convolutions with multiple kernels dc in a dictionary D.
subsequently solve the inverse problem and compute the associated sparse code z or z̄.
Recent studies of Qu et al. (2020a,b) even show under broad conditions the convolution
dictionary learning problem can be solved effectively and efficiently.
Nevertheless, for tasks such as classification, we are not necessarily interested in the
precise optimal dictionary nor the precise sparse code for each individual signal. We are
mainly interested if collectively the set of sparse codes for each class are adequately separable
from those of other classes. Under the assumption of the sparse generative model, if the
convolution kernels {kc }C c=1 match well with the “transpose” or “inverse” of the above
sparsifying dictionaries D = [D 1 , . . . , D k ], also known as the analysis filters (Nam et al.,
2013; Rubinstein and Elad, 2014), signals in one class will only have high responses to a
small subset of those filters and low responses to others (due to the incoherence assumption).
Nevertheless, in practice, often a sufficient number of, say C, random filters {kc }C c=1 suffice
the purpose of ensuring so extracted C-channel features:
∗ ∗
k1 ~ x, k2 ~ x, . . . , kC ~ x = circ(k1 )x, . . . , circ(kC )x ∈ RC×n (40)
for different classes have different response patterns to different filters hence make different
classes separable (Chan et al., 2015).
Therefore, in our framework, to a large extent the number of channels (or the width of
the network) truly plays the role as the statistical resource whereas the number of layers
(the depth of the network) plays the role as the computational resource. The theory of
compressive sensing precisely characterizes how many measurements are needed in order to
preserve the intrinsic low-dimensional structures (including separability) of the data (Wright
31
Chan, Yu, You, Qi, Wright, and Ma
C
Fish
ch
an
n
el
C convolution kernels
s
Lifting & Thresholding
sparse
C
ch
Duck
an
ne
ls
C convolution kernels
Figure 9: Estimate the sparse code z̄ of an input signal x (an image here) by taking
convolutions with multiple kernels kc and then sparsifying.
and Ma, 2021). As optimal sparse coding is not the focus of this paper, we will use the
simple random filter design in our experiments, which is adequate to verify the concept.30
The multi-channel responses z̄ should be sparse. So to approximate the sparse code z̄,
we may take an entry-wise sparsity-promoting nonlinear thresholding, say τ (·), on the above
filter outputs by setting low (say absolute value below ) or negative responses to be zero:
. ∗
z̄ = τ circ(k1 )x, . . . , circ(kC )x ∈ RC×n . (41)
Figure 9 illustrates the basic ideas. One may refer to Rubinstein and Elad (2014) for a more
systematical study on the design of the sparsifying thresholding operator. Nevertheless,
here we are not so interested in obtaining the best sparse codes as long as the codes are
sufficiently separable. Hence the nonlinear operator τ can be simply chosen to be a soft
thresholding or a ReLU. These presumably sparse features z̄ can be assumed to lie on a
lower-dimensional (nonlinear) submanifold of RnC , which can be linearized and separated
from the other classes by subsequent ReduNet layers, as illustrated later in Figure 11.
.
The ReduNet constructed from circulant version of these multi-channel features Z̄ =
.
[z̄ 1 , . . . , z̄ m ] ∈ RC×n×m , i.e., circ(Z̄) = [circ(z̄ 1 ), . . . , circ(z̄ m )] ∈ RnC×nm , retains the good
invariance properties described above: the linear operators, now denoted as Ē and C̄ j ,
remain block circulant, and represent multi-channel 1D circular convolutions. Specifically,
we have the following result (see Appendix B.2 for a proof).
. −1
Ē = α I + αcirc(Z̄)circ(Z̄)∗ (42)
30. Although better learned or designed sparsifying dictionaries and sparse coding schemes may surely lead
to better classification performance, at a higher computational cost.
32
ReduNet: A White-box Deep Network from Rate Reduction
Similarly, the matrices C̄ j associated with any subsets of Z̄ are also multi-channel circular
convolutions.
From Proposition 5, shift invariant ReduNet is a deep convolutional network for multi-
channel 1D signals by construction. Notice that even if the initial lifting kernels are separated
(41), the matrix inverse in (42) for computing Ē (similarly for C̄ j ) introduces “cross talk”
among all C channels. This multi-channel mingling effect will become more clear when we
show below how to compute Ē and C̄ j efficiently in the frequency domain. Hence, unlike
Xception nets (Chollet, 2017), these multi-channel convolutions in general are not depth-wise
separable.31
We refer the reader to Fact 5 of the Appendix B.3 for more detailed properties of circulant
matrices and DFT. Hence the covariance matrix Σ̄(z̄) of the form (37) can be converted to
a standard “blocks of diagonals” form:
∗
F 0 0 D11 (z̄) · · · D1C (z̄) F 0 0
.. ..
Σ̄(z̄) = 0 . . . 0 .
..
. . 0 ... 0 ∈ R
nC×nC
, (44)
0 0 F ∗ DC1 (z̄) · · · DCC (z̄) 0 0 F
31. It remains open what additional structures on the data would lead to depth-wise separable convolutions.
√
32. Here we scaled the matrix F to be unitary, hence it differs from the conventional DFT matrix by a 1/ n.
33
Chan, Yu, You, Qi, Wright, and Ma
.
where Dcc0 (z̄) = diag(DFT(z[c])) · diag(DFT(z[c0 ]))∗ ∈ Cn×n is a diagonal matrix. The
middle of RHS of (44) is a block diagonal matrix after a permutation of rows and columns.
Given a collection of multi-channel features {z̄ i ∈ RC×n }m i=1 , we can use the relation in
(44) to compute Ē (and similarly for C̄ j ) as
∗ −1
F 0 0 m D11 (z̄ i ) · · · D1C (z̄ i ) F 0 0
X .. ..
Ē = 0 . . . 0 ·α I + α .
..
. . · 0 . . . 0 ∈ R
nC×nC
.
0 0 F ∗ i=1 i
DC1 (z̄ ) · · · DCC (z̄ ) i 0 0 F
(45)
The matrix in the inverse operator is a block diagonal matrix with n blocks of size C × C
after a permutation of rows and columns. Hence, to compute Ē and C̄ j ∈ RnC×nC , we only
need to compute in the frequency domain the inverse of C × C blocks for n times and the
overall complexity is O(nC 3 ). 33
The benefit of computation with DFT motivates us to construct the ReduNet in the
spectral domain. Let us consider the shift invariant coding rate reduction objective for shift
invariant features {z̄ i ∈ RC×n }m
i=1 :
. 1
∆Rcirc (Z̄, Π) = ∆R(circ(Z̄), Π̄)
n
! k
!
1 X γj
= log det I + αcirc(Z̄)circ(Z̄)∗ − log det I + αj circ(Z̄)Π̄j circ(Z̄)∗ , (46)
2n 2n
j=1
Cn C Cn C tr(Πj ) j
where α = mn 2 = m2 , αj = tr(Πj )n2 = tr(Πj )2 , γj = m , and Π̄ is augmented
membership matrix in an obvious way. The normalization factor n is introduce because
the circulant matrix circ(Z̄) contains n (shifted) copies of each signal. Next, we derive the
ReduNet for maximizing ∆Rcirc (Z̄, Π).
Let DFT(Z̄) ∈ CC×n×m be data in spectral domain obtained by taking DFT on the
second dimension and denote DFT(Z̄)(p) ∈ CC×m the p-th slice of DFT(Z̄) on the second
dimension. Then, the gradient of ∆Rcirc (Z̄, Π) w.r.t. Z̄ can be computed from the expansion
Ē ∈ CC×C×n and compression C¯j ∈ CC×C×n operators in the spectral domain, defined as
. −1
Ē(p) = α · I + α · DFT(Z̄)(p) · DFT(Z̄)(p)∗ ∈ CC×C ,
. −1 (47)
C¯j (p) = αj · I + αj · DFT(Z̄)(p) · Πj · DFT(Z̄)(p)∗ ∈ CC×C .
In above, Ē(p) (resp., C¯j (p)) is the p-th slice of Ē (resp., C¯j ) on the last dimension. Specifically,
we have the following result (see Appendix B.3 for a complete proof).
34
ReduNet: A White-box Deep Network from Rate Reduction
input output
data data
cube cube
Figure 10: For invariance to 2D translation, Ē and C̄ j are automatically multi-channel 2D convolu-
tions.
and a ReduNet can be constructed in a similar fashion as before. For implementation details,
we refer the reader to Algorithm 3 of Appendix B.3.
35
Preprint
Preprint
Preprint
Preprint
Preprint
Preprint
Preprint
Preprint
Preprint Preprint
Preprint
reprint
print Preprint
Preprint
Preprint
Preprint
Preprint Chan,ReduNet:
Yu, You, AQi, Wright, Deep
and Ma
White-box Network from Rate Reduction
t Preprint
Preprint
Mj Rn Sj
RD M 1
= xi
∗ sparse
M M2
f (x, ✓) zi
C
S2
ch
Fish
S1
an
n el
C convolution kernels
s
Multi-Class fting & Multi-Class Multi-Class
Lifting Multi-Class
Invariant & Multi-Class Lifting Lifting Lifting
Lifting
Incoherent
Invariant && & & & Invariant Invariant
Incoherent
Invariant Incoherent
Incoherent
Incoherent
ulti-Class
ng
Multi-Class &&
-Class
fting & Multi-Class
Lifting Lifting
Multi-Class
Invariant Multi-Class
Multi-Class
Lifting
Invariant &&
Multi-Class
Invariant
Multi-Class
& Incoherent Lifting
Incoherent
Lifting
InvariantLifting
Incoherent
Invariant
Invariant &
Lifting &
Lifting
&
& Lifting Invariant
Incoherent Invariant
Incoherent
Incoherent
InvariantInvariant
Invariant InvariantInvariant Incoherent
Incoherent
Incoherent
Incoherent
Incoherent
Incoherent Incoherent
Multi-Class
ariant ng
ing
Lifting herent
ulti-Class
se fting
variant
Signals & &
Coding &
Multi-Class
Lifting &
Multi-Class & Incoherent
Sparse
Rate Signals
Lifting
Incoherent
Lifting
Invariant Signals
Multi-Class
Invariant
Invariant
Signals
Lifting Coding
Reduction
Invariant & Multi-Class
&
Multi-Class
Lifting Multi-Class
&
Multi-Class
Signals & Sparse Sparse
Incoherent
Incoherent
Sparse
Rate Lifting
Incoherent
Invariant
Incoherent Coding
Invariant
Subspaces Coding
Lifting
Coding
Lifting
Reduction
Incoherent
Sparse
Invariant &
LiftingCoding & & & & Rate Rate
Incoherent
Rate Reduction
Incoherent
Invariant
Subspaces
Incoherent Reduction
Invariant
Reduction
Incoherent
InvariantInvariant Subspaces
Subspaces
Incoherent
Incoherent
Subspaces
Incoherent
Subspaces Incoherent
Incoherent
Signals
oding se Coding
nals Signals Coding Sparse
Rate Sparse
RateSparse Signals
Signals
Reduction
Rate Coding Signals
Signals
Coding
Signals
Reduction Invariant
Coding
Multi-Class
Reduction Signals
Signals Figure Sparse
Sparse
Rate Sparse
Sparse
Subspaces
RateRate
1:
SubspacesSubspaces
Reduction Coding
Left
Coding
Lifting Coding
Coding
Reduction
Reduction
Sparse
Invariant
Sparse and
&
Coding
Coding Middle: RateThe
Rate Rate
Subspaces
Rate
Rate
Rate
Reduction
Subspaces
Subspaces
Reduction Rate
distribution
Reduction
Reduction
Invariant
Invariant
Reduction
Reduction Reduction D of high-dim Subspaces
Subspaces
Subspaces data
Subspaces
Subspaces
Incoherent
Subspaces
x 2 RD is supported on a
Subspaces
spaces
arse se
Signals
Coding
Signals
Coding
eduction
Reduction
Sparse Coding
CodingSignals
CodingSignals Sparse
Rate Sparse
Subspaces
Rate Subspaces
Rate
Rate Reduction
Sparse
Rate Signals
Coding
Reduction Coding
Reduction
Reduction
Sparse
Multi-Class Coding
Reduction Signals Coding Signals
Signals RateSparse
Subspaces
RateSubspaces
Subspaces
Rate Subspaces
Reduction
Lifting Sparse
Coding
Reduction
Sparse
Rate Reduction
Subspaces
Sparse &Coding
Reduction Coding
Coding Rate Rate
Subspaces
Subspaces
Subspaces
Rate RateReduction
Reduction
Rate Reduction
Reduction
Subspaces Reduction submanifoldsIncoherent Subspaces
SubspacesjSubspaces
Subspaces
tingMulti-Class & Lifting
Invariant Multi-Class Signals & Incoherent
Sparse manifold
Lifting
Invariant sparseCoding & M and its classes Incoherent
Rate Invariant
on
Invariant low-dim
Reduction
M , we learn a map f (x, ✓) such
Incoherent
Subspaces
e Coding Signals = Rate Sparse Reduction Signals
Coding
Signals
∗ Sparse
Subspaces
Sparse
Rate Reduction thatCoding z i = f (xi , ✓) are Rate
Coding on
Subspaces a Reduction
Rate Reduction union of maximally uncorrelated Subspaces
Subspaces subspaces {S j }. Right:
ess : Overview for classifying of Figure Figure
the Figure 3:Figure
multi-class
process 3:Overview
3: Overview
Figure Overview
for 3: signals Overview
classifying
3: of
Overview of the
of with the
the process
of process
the
multi-class
shift
process
of the process for
invariance: for
for
process classifying
signals classifying
for
classifyingfor classifying
Multi- with
classifying multi-class multi-class
shift multi-class
multi-class invariance:
multi-class signals
signals signals
signals with
Multi-
signals with
with with shift shift
shift
shift
with invariance:
invariance:
invariance:
invariance:
shift invariance: Multi-
Multi-
Multi-
Multi- Multi- training
Figure Figure 3: 3:
Figure Overview Overview 3: Overview of of the the process
of process
the process for
Cosine for classifying
classifying
for similarity
classifying multi-class
between multi-class
multi-class learned signals
signals features
signals with with
with by shift shift
our invariance:
invariance:
method
invariance: for Multi-
Multi-
the CIFAR10
Multi-
orure
ess s:lass
verview
cess ss
Overview
for
:ariance:
ss forOverview
Overview
for for
3:classifying
classifying
Overview
signals
for for
signals classifying
Overview
classifying
classifying of
classifying
classifying Multi- of
withofof
thewith
of Figure
channelthe the
multi-class
the
the Figure
Figure
process
of
shift multi-class
multi-class
multi-class
Figure
channel process
process
the
multi-class
shift
process 3:
multi-class
multi-class
process 3:
Figure
process
invariance: 3:
Overview
for
invariance:
lifting 3: lifting Overview
Overview
for
signals
Figure for for
Overview
for signals
classifying
3:
signals
signals
and signals
signals
for classifying
classifying
3:
classifying
signals Overview
classifying
and with
of
Overview
classifying
Multi- Multi-
sparse withwithof
with
sparse ofthe
with
of with
with the
the
shift
the shift
multi-class
process
shift of multi-class
shift
shift shift
multi-class
process
multi-class
coding ofinvariance:
the
multi-class
multi-class
shiftprocess
coding the
invariance: invariance:
invariance:
process
invariance:
invariance: for
invariance: for
signals
process
for
followed
followed signals
signals
classifying
signals
signals classifying
for
signals Multi-for
Multi-
classifying
Multi-Multi-
withMulti-
classifying
Multi-
by Multi- with
with
by classifying
withwith
a shift
amulti-class
shift multi-class
shift
multi-class
shift multi-class
multi-class
(convolutional)
(convolutional) invariance:
invariance:invariance:
multi-class
invariance:
invariance:
invariance: signals
signals
signals Multi-
signals
signals
ReduNet. Multi-
Multi-
Multi-
Multi- with
signals
Multi-
ReduNet. withwith
with with shiftshift
with shift
shift
shift
These These invariance:
invariance:
invariance:
shift invariance:
invariance:
invariance: operations
operations Multi-
Multi-
Multi-
Multi-
Multi- Multi-
are areareof over 10
C
process
ng igure
lifting followed 3: for
and Overviewclassifying
sparse
by a channel of
(convolutional)
coding the
multi-class
channel Figure
process
lifting
channel followed lifting 3: andfor
signals
lifting Overview
ReduNet. classifying
by and
sparse with
a
and sparse of
(convolutional)
coding
sparseshift
These the
multi-class
coding invariance:
process
followed
codingoperations followed signals
for
ReduNet.
followed Multi-
by classifying are by a with a
(convolutional)
by Theseshift
a multi-class
(convolutional) invariance:
operations
(convolutional) signals
ReduNet.
ReduNet. Multi-
are
ReduNet. with Theseshift These invariance:
operations
operations
operations Multi-
are
are
ch
dataset. Each class has 5,000 samples and their features span a subspace
channel channel Figure lifting lifting
3: and
Overview and sparse sparse of coding
the coding process followedfollowed for classifyingby by aaare aa(convolutional)
(convolutional)
(convolutional) aamulti-class ReduNet.
signals ReduNet. with These
shift These operations
operations
invariance: Multi-are are
an
lifting
fting
ng
llowedgfollowed
followed followed
Duck
and and and bysparse by sparse
sparse
a by a
channel aaby
channel
channel (convolutional)
(convolutional)
(convolutional)
coding coding
coding channel
lifting lifting
channel
lifting
followed followed
followed and lifting
ReduNet. andand lifting
ReduNet.by ReduNet.
sparse by byand
sparse
sparse
a a aasparse
and (convolutional)
sparse
(convolutional)
(convolutional)
coding
These coding
These These coding coding
operations operations
operations
followed
followed followed
ReduNet. followed
ReduNet. by are by are by (convolutional)
(convolutional)
by
(convolutional)
These a??).
These operations
(convolutional)
operations
operations ReduNet. ReduNet.
ReduNet.
ReduNet. are are
ReduNet.
are These These
These These operations
operations
operations
operations
operations are
are
are areare
n
ng
codingutional)
ifting
ional)
lifting
hannelfollowed
operations
followed and
ReduNet.
and
followed
liftingReduNet.
bybysparse
sparse
are
by a
necessarya
and (convolutional)
(convolutional)
channel
necessary (convolutional)
These
coding
coding
sparse
a
Figure These (convolutional)
necessary to operations
lifting
3:followed
channel
coding
to operations
map
necessary followed map
Overview ReduNet.
and
to ReduNet.
followed
lifting ReduNet. by
shift-invariant
shift-invariant map
to are
by
sparse
ReduNet.
map are of aand (convolutional)
by
shift-invariant
the (convolutional)
These
These
coding
sparse
shift-invariant a These
process (convolutional)
These
multi-class operations
operations
followed
coding
multi-class operations
operations
multi-class
for ReduNet.
ReduNet.
followed
signals
classifying
multi-class are
by are
signalsReduNet. are
signalsare a to (convolutional)
by These
to These
multi-class
signals a to operations
operations
(convolutional)
These
incoherent
incoherent incoherent
to operations ReduNet.
(linear)
(linear)
signals
incoherent (linear) are
withare
ReduNet. aresubspaces.
subspaces.
(linear) subspaces.
shift These These
invariance:
subspaces. operations
Most Most
Most operations
modern
modern
Most modern
Multi- are
modern are
s
3: ysmulti-class
toOverview
for mapclassifying shift-invariant
signals
of necessary
necessary thenecessary
multi-class to necessary
Figure
process
channel incoherent
to multi-class
3:
to map
necessary
C to for
liftingsignals
map map
Overview
convolution
classifying
to (linear)
shift-invariant
signals
with
shift-invariant
shift-invariant
and
kernels
map
to map of
sparse shift
shift-invariant subspaces.
the tocoding
multi-class
shift-invariant processincoherent
invariance:
multi-class multi-class
dimensions
multi-class Most
for
signals
followed
multi-class (linear)
classifying
Multi-
multi-class signalsmodern
(see signals
with
signals by
Figure
signals subspaces.
a shift
to to
multi-class
to
(convolutional)
signals incoherent
invariance:
incoherent
incoherent
to incoherent
to Most
incoherentsignals (linear)
modern
Multi-
(linear)
(linear) ReduNet.
(linear) with subspaces.
subspaces.
(linear) shift
subspaces.
subspaces. These invariance:
subspaces. MostMost
Most
operations
Most modern
modern
Most Multi-
modern
modern are
modern
nt essary
multi-class
herent
y
ural
ghis oherent
his
y
ulti-class multi-class
to
i-class
map
ecessary
iant
ulti-class
y . to
multi-class
to Most
to
lifting
followed
map
multi-class
mapmap
multi-class
map
process.
networks (linear) Lifting
shift-invariant
tosignals
modern shift-invariant
shift-invariant
signals
map
to
(linear) signals
map
shift-invariant
signals
shift-invariant
Lifting
and signals necessary
signals
signals
bysparse deep to
subspaces.
resemble deep
adeep
necessary
necessary
shift-invariant
signals to
subspaces.
necessary
deep toto &
shift-invariantto
incoherent
channel incoherent
to
to
incoherent
neural
&
(convolutional)
channel
incoherent
multi-class
necessary
incoherent
incoherent
codingneural to multi-class
multi-class
to
incoherent
deep
neural multi-class
this to
necessary
multi-class
Most
networks
deep to
map
incoherent to Most
neural map
multi-class
networks
lifting
process.
lifting
followed
map
(linear)
map
neural
networks (linear)
(linear)
modern
(linear)
ReduNet.
(linear)
shift-invariant
signals
to(linear)
channel
multi-class modernshift-invariant
map
to
(linear) signals
shift-invariant
signals
(linear)
shift-invariant
networks
and
and signals
signals
resemble map
networks
by
subspaces.
signals
resemble
sparse
resemble
sparse
subspaces.
to
subspaces.
subspaces. subspaces.
shift-invariant
subspaces.
lifting, signals toto
incoherent
shift-invariant
subspaces.
a (convolutional) to to
resemble
These
to
subspaces. incoherent
incoherent
multi-class
sparse
incoherent
coding
this incoherent
resemble
coding this
this
multi-class
Most
Invariant
incoherent
to Most
Most
incoherent
multi-class
Invariant
Most
Most
this
process. Most
Most
process.
followed
process.
operations this
followed
(linear)
multi-class
coding,
Most modern (linear)
modern
modern
multi-class
modern
(linear)
modern
process. modern
signals
(linear)
(linear) modern
process.
ReduNet.
signals
modern followed
(linear)
signals
are by subspaces.
signals
by a (convolutional)
subspaces.
to
subspaces.
subspaces.
a signals
subspaces. to
to
to
(convolutional)
These
incoherent
incoherent
to
by
subspaces.incoherent
incoherent a
to Most
incoherent Most
Most
Most
multi-channel
incoherent
Most
Most
operations Most (linear)
(linear)
modern modern
(linear)
modern
modern
modern
(linear)
modern
ReduNet.
ReduNet. Incoherent
(linear)
modern
Incoherent (linear)
are subspaces.
subspaces.
subspaces.
subspaces. subspaces.
convolution
subspaces. subspaces.
These
These operations
Most
Most Most
Most
ReduNet
Most
operations Most modern
modern
modern
modern
modern modern
are
are
nvariant
uralprocess.
al
ral ral
networks
ble
his seep this neural
process.
process.
process.process.
networks
networks
this process.
neural
networks
process.
networks Lifting
networks
process. resemble
networks deep
resemble
resemble
deep
resembleresemble deep deep
resemble
deep &
necessary
neural
neural
this
necessary
neuraldeep
neural
neural
resemble deep
this
neural Invariant
this
this
this networks
deep
process.
deep this neural
process.
neural
process.
networks
process.
to
networks
networks
process.thisneural
networks process.
neural
networks
to
map
for
map process. Incoherent
networks
networks shift-invariant
resemble
networks
resemble
invariant networks resemble
resemble
resemble
shift-invariant resemble
resemble
resemble this
resemble
this
resemble
rate this
this
this
this
multi-class
process.
Invariant process.
process.
this
process.process.
reduction.
multi-class this
this process.
process. process.
process. Incoherent signals
These
signals
to incoherent
components
to incoherent are (linear)
(linear) Incoherent
necessary subspaces. in order Most
Most
modern
to modern
map
multi-class ry to Sparse Coding
map
Sparse
shift-invariant
signals Coding to necessary
incoherent
deep neural networks multi-class to map
(linear) signals
shift-invariant subspaces.
resemble to Rate
Rate
incoherent thisReduction
multi-class
Most
Reductionprocess. modern
(linear) signals subspaces. to incoherent Most modern
(linear) Subspaces
Subspaces
subspaces. Most modern
isural Reduction
process. Sparseresemble
networks Coding
The
deep
Rate
deepThe
ReduNet
neural
this
neural
The Reduction
ReduNet process. networks
ReduNet
shift-invariant
networks
constructed
Subspaces
constructed constructed
resemble
resemble
References
from
Rate multi-class
from
this
this
from
circulant
process.
Reduction
process.
circulant circulant
signals
version
Subspaces
version version . of
to incoherent (linear)
of
these these
of these multi-channel
multi-channel multi-channel
Subspaces .
subspaces as an LDR. . .
features
features features z̄, i.e.,
i.e., circ(
i.e., Z̄) .=....= .=
circulant uNet constructed version The Theof ReduNet
from these ReduNet
The circulant
Themulti-channel
ReduNet constructed
1ReduNet constructed
Note version constructed
that constructed from
features
of the from these circulant
from
architectures circulant
z̄,multi-channel
from i.e.,
circulant version
circulantcirc( ofZ̄) version
.most . of features
version
. =...these these of
modern these
of multi-channel
z̄,thesemulti-channel
i.e.,
multi-channel
deep circ( neural
multi-channel Z̄)features features
. . features
= . features
.features
networks . features z̄,z̄, i.e.,
z̄,
z̄, z̄,circ(
i.e.,
i.e.,
resemble circ(
circ(
i.e.,
circ(
Z̄)
this
Z̄)
Z̄)Z̄)= Z̄)
.=.= .. =..
circulant
The
etculant
uNet Net
lant
circulant
rculantuNet
om
culant ReduNet
circulant
multi-channel
uNet ReduNet
constructed
ulti-channel
i.e., constructed
circulant
constructed
version version
constructed
constructed
version constructed
version
version
version ofThe
constructed
version
The
features
. The
from The
these
features
of of The
of
from
Theofof
ReduNet
from
ReduNet
from
these
from
these ReduNet
these
these
of from
ReduNet
ReduNet
circulant
these The
[circ(
ReduNet circulant
these from
circulant The
multi-channel
multi-channel
circulant
multi-channel
circulant ReduNet
multi-channel
circulant
multi-channel
i.e., z̄constructed
multi-channel
multi-channel
i.e.,
[circ( constructed
),constructed
multi-channel
ReduNet
z̄ constructed
circulant
constructed
version 1 . version
version
constructed),. .
version
. , . constructed
version
circ(features
. , of.
constructed
version
features
features
circ( features
.from
features
from
z̄
of of m
these
features
of fromof
from
from)],
from
z̄ these
features
these m
these these
circulant
z̄,of from
z̄,
retains
circulant
)], circulant
these
z̄,
circulanti.e.,
circulant
from circulant
multi-channel
multi-channel
i.e.,
multi-channel
multi-channel
i.e.,
circulantz̄,
retains
multi-channel
i.e.,
multi-channel
i.e., i.e., the
circ( circ(
i.e.,circ(version
multi-channel
circulantgood
circ(
version
the version
Z̄) Z̄)
version
circ( good
Z̄) version
Z̄)
= =
features
Z̄) .
invariance of.
ofversion
features
features
= features
features
=
features =of
of these
invariance of
these
features
these these
z̄, of z̄, multi-channel
multi-channel
properties
z̄, these multi-channel
i.e.,
multi-channel
i.e.,
multi-channel
i.e.,
multi-channel
i.e., z̄,
properties
multi-channel
i.e., circ(
i.e.,
circ(multi-channel
circ( Z̄)
described
circ(
Z̄) Z̄) described
= = .
=
features
Z̄) features
features
features
features= above:featuresz̄,
above:
z̄, i.e.,
z̄,
z̄,
z̄,the
i.e.,i.e.,
i.e.,
z̄,
i.e.,
i.e.,
z̄,
i.e.,the circ(
circ(
linear
i.e.,
circ(
circ(
circ(
circ(
Z̄)
linear
circ( Z̄)Z̄)
opera-
circ(
Z̄) Z̄) ==
Z̄)
=
=
Z̄)
opera-
. .=
= =
the
s,,the .the .1.), circ(
good .circ(
,1good Z̄)invariance
z̄m m [circ(= )], [circ( retains
[circ( 1 z̄, 1
properties
z̄),
z̄11retains z̄[circ( z̄,
1),),
.), the. ,.z̄z̄the
.[circ( .good
1 circ(
.,),
circ( circ(
circ( ..described
circ( .z̄ Z̄)
m Z̄)
invariance
z̄.z̄ )],m m = )], =retains m retains
retains
above: the
z̄,
properties
retainsz̄, the
z̄,the
thegood good
thecirc(
circ(
linear circ(
invariance
described
good Z̄) invariance
Z̄) Z̄)
opera- = =
invariance = above: properties
properties
propertiesz̄, the
properties circ(
linear described
circ(
described
described Z̄)
Z̄)
opera-
described = = above: above:
above:
above: asthe z̄, the
the
the linear
linear circ(
linear opera-
opera-
Z̄)
opera-
opera- =
..j...,.linear
.circ( ,z̄circ( ..m ,,z̄ circ( )], z̄z̄ )],
1process. The so-learned LDR facilitates subsequent tasks such classification.
c( invariance
,z̄.circ( m The 11ReduNet
properties 1
..,,z̄ good constructed
1described
), .invariance .m
circ(invariance
,, )], circ( retains above:
m
retains z̄ m
from )], retains
properties
the the thecirculant
retains goodjlinearthe
,good thegood
described
invarianceversion
opera- good
invariance invariance invarianceof
above: these
properties
properties properties
the multi-channel
properties
linear described
described
described opera- described features
above: above:
above: above: z̄,
the thei.e., the
linear linear
circ(
linear
linear opera-
Z̄)
opera-
opera- opera-
= ..
ains ,..z̄remain,..z̄,good
,the ),.z̄invariance
.good invariance z̄[circ(
invariance )],
retains [circ(
tors,
properties the
j.retains
properties now good ,), the z̄.z̄opera-
denoted
described good
described as
invariance z̄)],
retains
above: and
properties
above: the
the
retains
properties
the remain
linear
jlinear described
the opera-
good block
described
opera- above:
invariance circulant,properties
above: the andlinear
properties
the represent
described
linear opera- described
opera- multi-channel
above: above: the 1D
linear
the circular
linearopera- opera-
m m
circ(
good
he rtiese.duNet
erties
e ...linear
good circ(
invariance .invariance
m z̄.z̄ ,m [circ(
circ( )],
retains z̄
[circ(
retains z̄
The
properties )],z̄z̄),
properties
the
11
z̄z̄tors, the
ReduNet .the
),[circ(
tors,
good .described
.good now circ(
described ),
constructed
invariance .invariance
m .z̄
denoted
invariance )],.m m circ(
)],
above: retainsabove: Ē
as
retains
properties )],
properties
from the the jand C̄
the
circulant
linear
good good
described,described
remain
invariance
opera- opera-
invariance
version . above: block above:
of circulant,
properties
these
properties the the linear and
linear
multi-channel described
described opera- represent . features
above:above: multi-channelthe the
i.e., linear
linear 1D circular
opera-
opera-
jthe good good
described
described opera-
invariance above: retains
above:
retains properties
properties
properties
the the linear
good described
described described opera- above: retains
above: properties
properties jthe the the
the jlinear
linear linear described
described
opera-
opera- opera- above:
above: properties the linear described opera-
opera- above: the linear opera-
circ( circ( )], m m )],[circ( [circ( z̄ ), ),
. . .
, .
circ( circ( z̄ )], m )], Ē C̄ z̄, circ( Z̄) =
w), ,ors, ,
.
rculant .
now
,
. .
denoted
remain
, circ(
circ(
constructed
denotedversion z̄
blockas
block
)],
tors,Ē )],
as tors,
of [circ(
circulant,
and
tors,
circulant, now
these
The
from
[circ(and now
C̄ now ReduNet
tors, z̄ ),
denoted denoted
multi-channel
circulant
, remain
and
tors,
1
and
), now
now
,
.
denoted .remainnow
. , circ(
represent
represent
. , as
denoted
constructed
denoted version
block
circ( as
denoted z̄
as
Ē
block z̄ Ē and
features
Ē
m )], as
as and
of
circulant,
multi-channel
and
multi-channel
)], Ē
asC̄ these
from
retains
circulant,
jand
C̄C̄
and
Ē ,
z̄, ,
remain
and , remain
i.e.,
C̄ the
and
j C̄,
multi-channel
circulant
and
remain ,1D remain
represent
jj1D
circ(
good
remain, block
remain
represent block
Z̄)
circular version
circular block
invariance =
block circulant, features circulant,
multi-channel
block multi-channelcirculant,
of these
circulant, z̄,
circulant, and
properties and
i.e.,
and and
multi-channel
and 1D represent
represent
1D represent
represent
circ(
and circular
described
represent Z̄)
represent
circular = multi-channel
multi-channel
multi-channel
features
multi-channel
above:
multi-channelmulti-channelz̄,the i.e.,1D 1D1D
linear
1Dcirc( circular
circular
circular
circular
1D Z̄)
opera-
circular =
circular
eelass
process for
for classifying
classifying multi-class
multi-class signals
signals with
with shift invariance:
invariance: Multi-
Multi-
w ,emain remain
denoted j
now
, remain denoted
block
as block
tors, tors, Ē
as
circulant,
and
tors, now now
circulant,now and C̄
convolutions.
j,jdenoted
j
,,and denoted
tors,
1remain
and denoted and
j
now
,representremain as represent
block as
denoted
as Specifically, block
and and
circulant,
multi-channel
and Ē
multi-channel
as circulant,
j
,,and j
j,,,and
we
and
remain
j C̄
remain
and have
remain and ,
represent
1D the
1D
block block
remain represent
circular following
circular circulant, circulant,
block
multi-channel result
multi-channel
circulant, and and
(see
and represent
1DAppendix
and
represent
represent1D represent
circular circular multi-channel
B.2
multi-channel
multi-channel for a
multi-channel proof). 1D 1D1D 1Dcircular
circular
circular circular
emain C̄
denoted block as Ē
circulant, andtors, Ē C̄
,now and remain C̄
convolutions.
denoted .represent block as Ē z̄z̄Ē Ē Specifically,
multi-channel
circulant, and C̄
retains C̄
Ē
C̄ we
the C̄ have
good the invariance following result
properties (see Appendix
described above:B.2 afor the a proof).
linear opera-
,remain 1D
represent circular
block multi-channel
circulant, and 1D
represent circular multi-channel 1D circular
1 m
main
w ndnel
remain noted
,have
and),denoted
remain
denoted block
represent
.1D representas circular
block block
as as circulant,
and
tors,
multi-channel
circulant,
multi-channel
and
circulant,
and
tors, now
[circ(j
and
j
now remaindenoted
,z̄
and represent
remain
remain
and
), denoted
1D
represent block
represent
.1D .represent
.Specifically, as
,,circular
circ(circular
block as multi-channel
circulant, and
multi-channel circulant,
multi-channel
)], and j j
remain,jand
and 1Drepresent
remain 1D circular
represent
represent
1D block
circular
circular circular multi-channel
circulant, multi-channel
multi-channel and and 1Drepresent
1D circular
representcircular multi-channel
multi-channel 1D 1D circular
circular
processhe good
invariance:shift
Ē
invariance Ē Ē Ē
convolutions. C̄
convolutions.
retains C̄
C̄ C̄
properties convolutions. the good .Specifically,
.described Ē
invariance Specifically,Ē Ē above: we C̄ we
retains C̄
aC̄
properties
have havewe
the have
the
thelinear the good the
following
described
following opera- following
invariance above:
result forresult properties
aa(see the (see linear
Appendix Appendix
Appendix described
opera- B.2 B.2
B.2 forabove:for
for a aproof). proof).
proof).
the linear opera-
m m
ions.
ave .have the.the,Specifically,
circ(following z̄ )], convolutions.
result we
[circ( have z̄(see
convolutions.
), the Appendix Specifically,
following
circ( B.2
Specifically, )], result for we aC̄ have
(see
proof). we Appendix
have the B.2
following result
for proof).
(see
result Appendix
(see Appendix B.2 for
B.2 for proof).
aa proof).
assifying
ve we
ehe
oof). signals
ions.
ave
ons.multi-class
volutions.
eions.
ave with shiftsignals
onvolutions.
.ns.
ee the invariance:
Appendix
the thewith shift
Multi-
Specifically,
Appendix
the following
Specifically,
following the
Specifically,
following
Specifically,
following
Specifically,
followingMulti-
following
Specifically,
following
B.2 Specifically,
convolutions.
B.2 convolutions.
Ēconvolutions.
convolutions.
result we
result
for
result we
tors,
result
convolutions.
result
tors,
convolutions.
for
result
we we
have
we
a(see result
a have have
proof).
(see
have(see
have
now
convolutions.
we (see
now(see
proof).
have
convolutions.
we
Appendix
the
(see (see
Appendix
the
Appendix
denoted
Appendix
have
Specifically,
the Appendix
denoted
Specifically,
following
Appendix
the Appendix
the Appendix
Specifically,
Specifically,
following
Specifically,
following
Specifically,
following
following
the B.2
as
Specifically,
following
as
B.2 B.2
B.2 B.2Ē
Specifically,
result
for
B.2
Ē B.2
we
for
for
for
we
result
and
result
and
result aweresult
for
we
for we
we a for
have
a
proof).
have
(see result
a have
have
proof).
C̄
we
proof).
have
proof).
a(see
havea
(see
(see
jj(see
proof).,
proof). we
the remain
have
proof).
remain
Appendix
the (see
the Appendix
Appendix
the
Appendix have
following
Appendix
following
Appendix
the
Appendix
following
following
block
the
block
following
B.2 B.2 B.2
B.2
B.2
following
B.2 circulant,
result
circulant,
result
for B.2
result
for for
for
result a
result
a for
(see
a
a
proof).
(see
proof).
result
proof).
(see
proof).
proof).
(see a and
(see
proof).
and
represent
Appendix
(see
Appendix
Appendix
Appendix
represent
Appendix
Appendix
Appendix Appendix B.2
B.2 B.2
B.2
jB.2
multi-channel
B.2for
multi-channel
B.2 for
for
B.2
for
for
for
for aa aaa aproof).
for
proof).
a
proof).
proof).
proof).
proof).
proof).
1D
proof).
1D
circular
circular
e coding
coding followed
followed by
w remain
by aa (convolutional)
(convolutional) ReduNet.
denoted
ReduNet. These operations
block as
operations are
are circulant, and tors, nowand
, remain represent
denoted block as multi-channel circulant, and and
, remain
1D represent circular block multi-channel circulant, and 1D represent
circular multi-channel 1D circular
j
convolutions.
sparse C̄ Proposition coding,Specifically,
Proposition 2.2
a multi-layer 2.2 (Multi-channel Ē (Multi-channel C̄
wearchitecture haveconvolution convolution
theconvolution following with structures result
multi-channel structures (see ofAppendix
Ē
of andĒ and
convolutions, C̄ )C̄The
B.2 for
jj )for
j matrix
a proof).
different
The matrix nonlinear
eed
utional)
tionby a ReduNet.
elve
Proposition
onvolution
on (convolutional)
position
nnel tions.
ion TheseReduNet.
convolution
convolution
the
convolution
2.2 2.2
2.2 are operationsThese
operationsThese
following
convolution 2.2
Specifically, are
(Multi-channel
2.2
!(Multi-channel
(Multi-channel
(Multi-channel
j j Proposition
Proposition
structures
Proposition
structures result
(Multi-channel
Proposition
structures
convolutions.
Propositionconvolutions.
structures
Proposition we
activation,
Proposition
Proposition
(see have convolution
of
of 2.2
Proposition
Proposition
convolution
of 2.2
ĒĒ
of
Ē
2.2
2.2
2.2
convolution
Appendixthe 2.2
and and
(Multi-channel
following
convolution
Ē
"and
(Multi-channel
and and
2.2
Specifically,
(Multi-channel
(Multi-channel
2.2
Specifically,
C̄
(Multi-channel
(Multi-channel
C̄ j2.2
B.2
2.2
jj jstructures
C̄ spectrumj
j(Multi-channel
)structures
C̄ )
(Multi-channel
))The
The (Multi-channel
The
jstructures
for result
(Multi-channel
)
The
matrix
structures
The a we
matrix
we
matrix proof).
matrix(see
computing
have
convolution
of
convolution
convolution
ofconvolution
have of Ē
convolution
convolution
Ē
Ēthe
of and
convolution
Appendix
the
and . and following
convolution
following
convolution
Ē . and
allj!
!C̄
C̄
C̄ jj jstructures
structures
structures
j ! structures
)B.2 )
))structures
become C̄ The
The
structures
The ) for result
structures
resultmatrix
matrix
a
structures
Thematrix
of
proof).
of
necessary
(see
of
(see
matrixof
Ē of
ĒĒ
Ē
of
Ē
Ē
Ē Appendix
Ē
and ofand
and
Appendix
and
Ē
of
and
and ∗
and
and
"Ē
C̄
Ē
""C̄−1
components
∗C̄
jC̄
and
C̄
C̄
and
j" )
C̄
j)jj))The
C̄ j
jjjjB.2
j ) ) )C̄
The)
B.2
)The C̄
) The
TheThe
The
The
The
))matrix
for
matrix
matrix
The aa proof).
matrix
matrix
Thematrix
matrix
matrix
for matrix
proof).
matrix
achieving the(24)
variant multi-class
of
onvolution
tion
volution multi-class signals
onvolution
on
2.2 signals to
convolution andto incoherent
2.2 incoherent (linear)
2.2
and !.(Multi-channel
.(Multi-channel !)structures
(linear) subspaces. Most
(Multi-channel
structures Most modern
modern
structures
)Proposition Proposition
structures Proposition of of
convolution convolution
of convolution
convolution
of 2.2
and and
2.2
andand2.2 and
.(Multi-channel (Multi-channel
(Multi-channel
C̄)−1 C̄)!The
structures )structures
structures matrix of of
convolution convolution
of
convolution
Ē!and
and and
and )−1 )+
Istructures )structures The matrix of
ofZ̄)circ( and and
and )−1 The matrix
−1
Ē Ē . C̄ C̄ The The matrix matrix
Proposition Ē Ē
Ē Ē . Ē "2.2
!C̄ C̄ j!
C̄
"
j
(Multi-channel The Thematrix matrix matrix Ē
. Ē
Ē Ē
convolution
.. = ..!"−1
"Ē C̄α= ..C̄ C̄
" !C̄ α
j
jj!
The
+
! The
Istructures The +matrix
αcirc( matrix
αcirc( Z̄)circ( of ĒĒZ̄)
"∗∗−1"∗∗Z̄)
"and C̄
−1
∗∗C̄
"C̄
"
j
) The
The matrix matrix
matrix (24)
variant modern Mostsubspaces.
Ē C̄ (24)
−1 −1 Ē = −1
α I αcirc( Z̄)circ( Z̄) −1
−1 −1 −1
! ∗ ∗ " ! . !∗ "
∗ ! " "" (24)
oherent
ss signals
(linear)
to incoherent
subspaces.(linear)
Mostsubspaces. modern Proposition 2.2 (Multi-channel convolution (24) (24) IIstructures of and (24) (24)
j)process (24)
(24)
(24)
∗
Ē
convolution
ition ..Ē. != ..="Ē !α "
!
2.2α= ! Iα I+ +Istructures
(Multi-channel αcirc(
αcirc( + αcirc( objective
Z̄)circ(
Z̄)circ(
Proposition Ē
of
Z̄)circ(
is . block .
Ē.Z̄)
Ē
= "..∗∗!−1
Z̄)
convolution " "α
and=effectively
"
Ē
Z̄)
∗ ! "!
2.2
circulant,
α −1
=IC̄+
∗ I(Multi-channel
α j+
) αcirc(I structures
The αcirc( + and
i.e., Z̄)circ(
matrix
αcirc( efficiently.
Z̄)circ(
Ē ĒĒ
= of
Z̄)circ( .. Z̄)
==
α .Z̄)
.Ē
..Ē
"
convolution!
∗α"
" I
and Ē
="
Z̄)
Ē Figure
!−1+I
!α =∗
= C̄+ αIjαcirc(
αcirc(
α ) + 11 (24)
structures
The +
αcirc(
+ Z̄)circ( αcirc(
illustratesZ̄)circ(
matrix
αcirc( Z̄)circ( Z̄)circ(
Z̄) of
Z̄)circ(Z̄)
Z̄)the
ĒZ̄)
Ē" ∗−1
∗∗ ""Z̄)
and
" overall
−1
−1
Z̄) C̄
C̄
j
) (24)The
The matrix
matrix of learning such (24) (24)
Ē =∗
Iα ∗ −1 −1 Iαcirc(
I++ (24) αcirc( Z̄)circ( Ē is α= Z̄)block αα
∗ −1
∗−1−1 −1
IIcirculant, ++(24) αcirc(
(24) Z̄)circ(
i.e., Ē ĒĒ
= Z̄)
==
α .α
∗∗−1 −1 −1−1
α"IIαreduction
∗ I
+⎡I+ αcirc((24) αcirc(
(24) (24)
(24) Z̄)circ( Z̄)circ( Z̄) ⎤Z̄)
Z̄)∗ −1
∗ −1
−1
−1 (24) (24)
(24) (24)
(24) (24)
(24)
emble thisthis process.
process. (24) ⎤input (24)
∗ ∗ ∗
Z̄)
Ē =lock (= α= Z̄) =
αα
. circulant,
α+!⎡Icirculant,
I + + αcirc(
αcirc( αcirc( Z̄)circ(
i.e., Z̄)circ(
Z̄)circ( Z̄)circ( is Ē
Ē
aĒrepresentation
isZ̄)= Ē block
Z̄)Z̄)
Z̄)
is=
block= = Z̄)
block "αcirculant,
I +I
circulant,
! + αcirc(
circulant, αcirc(
αcirc(via i.e.,
Z̄)circ(
i.e., invariant Z̄)circ(
i.e., Z̄)circ( ĒZ̄) ĒĒ=Ē Z̄)
Z̄)
Z̄)==
rateα= . I
!+
! + (24)
αcirc(
⎡
+ αcirc( αcirc( Z̄)circ(
on Z̄)circ(
the⎤
Z̄)circ( Z̄) Z̄)
Z̄)
Z̄) ∗ ""−1
sparse (24) codes. (24)
(24)
emble
ss.
circulant,
scirculant,
circulant,
ulant, ⎡
block
= α⎡
irculant,
rculant, ⎡⎡⎡i.e.,
⎡ IĒ⎡
Ēi.e.,
+
i.e.,
i.e.,
i.e.,
i.e.,
1,1
αcirc(
Ē···
is
isis Ē
···
is
block
is
block
is
block
Ē Ē
⎤
···
is
i.e.,
is
is1,C block
1,C
block
block
block
Z̄)circ(
block
⎤ isblock
⎤
Ē1,C⎤
⎤
⎤
circulant,
⎤circulant,
blockcirculant,
⎤ circulant,
circulant,
is
Z̄)=∗⎡
circulant,
circulant,
Ē circulant,
circulant,
⎡
block
.
circulant,⎡
⎡α i.e.,
⎡
−1 ⎡ i.e.,
Ēi.e.,
circulant,
I1,1 Ē
⎡ i.e.,
i.e.,
i.e.,
i.e.,
+ i.e.,
i.e.,
1,1
αcirc(
i.e.,
···Ē1,1 ···Ē1,C Ē1,C
···⎤
i.e.,
Z̄)circ(
⎤ ⎤
⎤ ⎤Ē
Ē⎤
⎤ ĒZ̄)
⎡
⎡=
⎡ Ē
=⎡
.⎡α
∗
⎡ =
αĒ
−1
Ē⎣
⎡
⎡ IIĒ
···
Ē
Ē
⎡
+
⎣
⎡
(24)
+
1,1
.···Ē
1,1
1,1
αcirc(
Ē
αcirc(
Ē1,1
1,1
.
···
···
.···
Ē .
Ē
···
Ē
⎤
···
Ē
.
···
⎤1,C
⎤
⎤Ē
Z̄)circ(
1,C
Z̄)circ(
.⎤
1,C
⎤ Ē1,C
⎤
Ē1,C ⎤Z̄) ∗
⎦
. ∈
⎤
⎦
Z̄)nC×nC
R
∗ −1
,
(24) (24)
(24)
(25)(25)
Ē
1,1
. ···
1,1
. . is block
Connections circulant, Ē 1,1 to . ···i.e., .
convolutional
Ē 1,C.
1,C
Ē
Ē ⎡
Ē
and
=
1,1
Ē
Ē ⎣ =1,1
1,1
.
. ···
recurrent
⎣ . .
1,1
1,C
. Ē . . 1,C
. . ..
. ⎤ 1,C
⎦.
sparse ∈ ⎦ R ∈ R
nC×nC nC×nC
coding. , , As we have discussed (25)
circulant,
Ē Ē=
=⎡ Ē
⎣
Ē Ē⎣ Ē··· .
1,1
⎣
Ē⎣= . . .... .. . ... ⎤⎦ . ···i.e.,
···
Ē . .
··· Ē .
Ē .
Ē .
1,C
. is ⎦
⎦.block
.. the ∈ ⎦ R
R Ēcirculant,
nC×nC
nC×nCĒ
R⎣ ⎣=⎡ ĒĒ
nC×nC Ē⎣ . ,
···
Ē =. , . . ,. ... . . . .. ⎦ ⎣ . ··· . i.e.,
···
Ē . Ē . Ē . . ⎦
⎤Ē ⎦
.
..∈ ∈ ⎦ R
RnC×nC⎣
nC×nC
Ē
nC×nC Ē
⎣
Ē
R⎡ ⎡
Ē
= .
1,1
nC×nC Ē
Ē ⎣ .
1,1
=
1,1
Ē
···
,
.Ē. C,1
··· (25)
⎣ . .··· Ē
,(25)
Ē
··· .
1,C
(25). Ē .
Ē.1,C
..Ē. . .
⎦.C,C .
...⎦⎦ ⎤
⎤.∈ ⎦
. ∈ ⎦
RnC×nC
nC×nC R ∈
nC×nC R
nC×nC nC×nC
∈ R , ,, (25)
nC×nC nC×nC , (25) ,, (25) (25)
(25)
(25) (25)
(25)
∈ĒnC×nC ∈Ē Rthere (25)
1,1
1,1 1,1 1,C
1,C 1,C 1,1 1,1 1,C 1,C 1,1 1,C
Ē 1,1 1,1 1,C ∈ =nC×nC 1,1 1,1 1,C
Ē
= =
=∈⎣ Ē 1,1
.=
1,1
,1,1 . .··· ..···
. Ē
1,C
∈
Ē...1,C
1,1 1,C
Ē
R nC×nC
⎣ =
nC×nC
⎣ ⎣ . ⎣ Ē.,.1,1 . . . . .
(25) . Ē.1,C . ⎦ . ⎦ ⎦ in ⎦ ∈R Ē
R R =
nC×nC
R
nC×nC
⎣ introduction,
nC×nC
⎣ ⎣ . . . . , . (25) . . (25) . Ē.1,C .
it ⎦ has
⎦ ⎦ Ē ĒĒ
been
R
= R R =
nC×nC
⎣
=
nC×nC
nC×nC
⎣ ⎣long .. Ē . ., .. .
noticed
. (25).. .
(25)
Ē .
(25)
(25) .
···
⎦
···
. ⎦ that ⎦∈ĒC,C ∈.R R
nC×nC
R
nC×nC
nC×nC
nC×nC are , ,connections
, (25) (25)
(25) between sparse (25)
(25) (25)
(25)
.. .C,1 ···
. . . . Ē ···
. . . Ē
. (25)···
(25)
nC×nC nC×nC Ē C,1 Ē nC×nC
Ē =
= . ..Ē = = Ē ,
. . . . .
···
. Ē . ∈ Ē∈ R∈ ĒĒ∈
= Ē == = . , ,
. , ,
.1,1 Ē . , . . . .
··· .
Ē ∈ Ē∈ R∈ ĒĒ
= Ē =
= = ⎣ . , ., , , .
1,1
(25)
Ē . .C,1
.
Ē . . C,1 . .···C,C
.··· 1,C
···
Ē C,C ⎦
∈
Ē ∈ R∈ R , ,, (25) (25)
(25)
.. . ···. .Ē.. C,C
.. .n×n ..Ē.c,c
where each ′ ∈.C,C .Rn×n isĒanC×nC circulant .··· matrix.
.work ..ofMoreover, Ē represents a and multi-channel circular
n×n
⎣.ĒĒĒ .. C,1
C,1 Ē ···
C,1 Ē ··· C,C Ē C,C
⎦coding where
nC×nC and ⎣.ĒĒ Ē
each .,C,1
deep C,1
C,1 Ē
···
.Ē C,1
.Ē Ē ···
′ . ∈n×n
C,C Ē
⎦Rn×n C,C n×n
iscirculant
==aaĒ Ē⎣ ⎣ .Ē Ē
circulant
ĒC,1 .the
C,1 .···(25)
C,1
···
··· Ē
....C,C
Ē .C,C
C,1
.matrix.
Ē C,C ⎦
Ē
⎦
C,C
∈R
Moreover,
C,C
∈ nC×nC
RnC×nC Ē,, represents a multi-channel (25)
circular
∈where R each .networks, ..R ∈including
is RĒ aais matrix. Learned
Moreover, ISTA represents (Gregor
(25) aa multi-channel LeCun, 2010) circular(25)
from circulant version of these multi-channel features
Ē.′··· i.e., ..RC,C ..···
C,C
=Ēeach Ēwhere = Ē ··· ∈ C,1
,C,1
.··· ··· ..∈multi-channel Ē
ach
circulant re
from circulant
where s circulant
version
a
of i.e.,
thesecirc(
circulant
Ēeach C,1
C,1 . Ē
···
c,c′ matrix.
.
multi-channel
Ē C,1 matrix.
···
······
∈···R
Ē
c,c .
features
Ē.
z̄, ∈ circ(
n×n
C,C
C,C
Z̄)
z̄, i.e., circ(Z̄) =
matrix.Ē Ē
where =
where
∈ C,C Moreover,
R
Moreover,is n×n
a where
is
convolution,
Moreover,
each each a
where
is
circulant circulant
a Ē
Ē each circulant each
Ērepresents
Ē
each
Ērepresents
C,1 C,1 Ē
matrix.i.e.,
represents matrix.
c,c
···
···
c,c Ē
···
Ē
ĒR R
′ c,c
′.c,c
for
matrix.
Ē ∈
Ē
n×n ′ C,C
C,C a
C,C
R
∈
any multi-channel
Moreover,
∈a Moreover,
is
R
a
is R multi-channel is
multi-channel
Moreover,
n×n
multi-channel
a a circulant
circulant is circulant
aĒ
Ē
Ē
circulant
Ērepresents
circulant
C,1 Ē circular
C,1
C,1
Ē
represents
ĒC,1 .··· signal
represents
circular
matrix. circularmatrix.
···
··· Ē.C,C
Ē
··· matrix.
Ē
matrix.
Ē Ē
C,C
C,C
z̄
C,C
a a
Moreover, Moreover,
R
a
multi-channel Moreover,
multi-channel
C×n
Moreover, we Ē
have represents
circular
Ē
represents
Ē represents
circular
represents represents
circular a a
aa
multi-channel multi-channel
multi-channel
multi-channel
multi-channel circularcircular
circularcircular
circular
version
multi-channel
of thesefeatures
multi-channel
ĒĒ
features z̄, i.e., circ( Ē ∈ Ē Ē
n×n
where each convolution, R i.e., is
for a any circulant multi-channel matrix. signal Moreover, R represents
we have a multi-channel circular
C,1 C,1 C,C C,C C,1 C,1 n×n C,C C,C C,1 C,1 C,C C,C C×n
c,c ′
and manyĒ Ē of ∈ its ∈ convolutional ′
and recurrent variants z̄ ∈ Ē
(Wisdom Ē et al., 2016; Papyan et al.,
ach
circulant
ach
er, volution,
channel
circulant
rculant
culantch
culant
hannel
h c,c Ēz̄,
represents Z̄) =
represents
matrix.
′
matrix.matrix.
signal
circular
∈ ∈
i.e.,
matrix.
∈ R R RZ̄) =
Rfor
where Ē
n×nn×n
n×n
z̄ awhere where
any
Moreover, Moreover,
Moreover,
wherewhere
∈a R is
multi-channel
Moreover,
z̄is is each
C×n
multi-channel
is
where a
aReach aaconvolution,
eacheach
circulant
convolution,
multi-channel circulant
each we
convolution,
circulant
circulant Ē
each ĒĒ
have
c,c ′
c,c
represents
c,c
represents
represents represents
Ē ′
represents
circular
∈
′
′matrix.
′matrix. ∈i.e.,
R
circular
matrix. ∈
i.e.,
signal ∈ n×nR
···
R R
i.e., R
c,c
for
n×n
forR
n×n Ē
n×n
a
z̄ a is
for any
Moreover,
Moreover,
n×n a
any
Moreover,is
a is
R
multi-channel
∈ a
isRz̄is any
multi-channel
is multi-channel
multi-channel
a a
multi-channel
circulant
C×n
multi-channel
is a a
aRcirculant circulant
circulant we
multi-channel
circulant
circulantĒ have
represents
Ē
representsĒmatrix.
matrix.
represents
represents
C,1
circular
circular signal
circular
matrix.
circular signal
circular
matrix.
···
··· Moreover,
signal
Ēz̄z̄aMoreover,
Ē
a
C,C
a ∈
Moreover,
aC×n
R
multi-channel
R
multi-channel
∈
C×n
multi-channel
z̄
Moreover, ∈ C×n
R C×n
Ēwe
we Ē have
represents
have
represents
we circular
represents
have
represents
circular
circular
represents
represents a a
aa multi-channel
multi-channel
multi-channel
multi-channel
multi-channel
amulti-channel
multi-channel circular
circular
circular
circular
circular
circular
retains the good
Ē
invariance
Ē Ē
properties described above: the linear opera- Ē Ē Ē Ē Ē Ē
n×n n×n
lant lti-channel
onvolution, matrix. R signal i.e., Moreover,
wherez̄ for is aR any each
circulant multi-channel
convolution, represents
we matrix. have R∈ signal
i.e.,z̄aMoreover, ∈multi-channel
for aC×n
any circulant multi-channel represents
we circular
matrix.
have z̄ ·signal az̄)
Moreover,
Ramulti-channel ∈ Rwe represents
we circular
have a amulti-channel circular
c,c C,1 n×n C,C C×n c,c C,1 n×n C,C C×n C,1 C,C C×n
hannel Ē ion, Ēc,c ′i.e.,
c,cc,c∈signal
′ ′∈for any ∈convolution,
multi-channel
convolution, ∈2017;
C×n Ē weĒ Ē Ē
c,c have
i.e.,
′Ē
i.e.,
c,c
′c,c signal ∈ ′
for ∈
for ′ ∈any any multi-channel
multi-channel
C×n ∈Monga weĒ Ē have signal
Ē ·Ē vec( z̄vec( ∈ R = z̄C×n
C×n
C×nvec(ē !have
Ē Ē
z̄).
have
convolution, i.e., for any multi-channel signal =we have !represents
′ ′
c,c
where each
Sulam c,c et ′ al., R 2018;n×n
is
is aawe circulant et al., ·matrix.2019). ∈ Moreover,
Although both sparsity aand convolution circular
are
retains the good invariance the properties described
the linearabove:
opera-the linear opera-
∈
c,c Ē Ē
hannel
each ion, rculant i.e., signal for
′ ∈any
·matrix. R z̄z̄∈n×n convolution,
multi-channel
Moreover, R whereC×n
is awe !we
circulant
each have i.e., signal
represents for
·matrix. any R an×n multi-channel
R
Moreover,
multi-channel we circulanthave represents
circular ·matrix. R z̄)
Moreover,
multi-channel vec(ē
we !
!have
z̄).
represents
circular multi-channel circular
erties
invariance
described
properties
above: described
linearabove:
opera- ∈
convolution, i.e., for ′ ∈ any z̄z̄C×C×n ∈ multi-channel signalĒ vec( z̄) = vec(ē z̄).
have
Ē Ē Ē Ē C×n Ē
we
ehannel
nnelnnel
el on,
ion,nnel
n, have
i.e., havei.e.,
i.e.,
signal i.e.,
signal
Ē
signal
for signal
c,c forfor for
vec(
any
z̄ · any
any
z̄
∈ z̄any
vec( convolution,
z̄) R
∈ multi-channel
R∈
= convolution,
multi-channel
RC×n
multi-channel
multi-channel
convolution, convolution,
C×n R
C×n
C×nvec(ēC×n
convolution,
we In !
vec(ē we we
have i.e.,
have
above,havehave
z̄).
i.e., ! have i.e.,
i.e.,
signal for
Ē
signal
forsignal
i.e.,
c,c
c,c for any
for vec(
any
z̄ for ·Rany
z̄
∈ any
vec( multi-channel
any
z̄) ∈
multi-channel
∈ R multi-channel
R = R
multi-channel
C×n C×n
C×n C×n
multi-channel
vec(ē we
is a! vec(ē we!have
have z̄).
have !
multi-channel
signal
havesignal signal
Ē
signal Ē z̄
vec(
z̄ ·· z̄
∈ z̄
∈ vec(
z̄
vec( z̄) ∈ R
R
∈
convolutional ∈
C×n
R
z̄)
C×n= R C×n
C×n
C×nvec(ē
= we vec(ē
we we
vec(ē wehave
have !
z̄).
have
!
kernel z̄). withroles ′
R∈ being the the firstfirst
Ē z̄) =
convolution,
advocated Ē as z̄). ·i.e., ēvec( Ē
∈ēfor
desired any z̄) =
′multi-channel
characteristics z̄).· signal
··for such Ē networks, z̄)
R C×n= ! we
their have z̄).
precise ē[c,ē[c, c ] the
for ∈ n
classification
and C̄
annel
and C̄ ,, remain
multi-channel
! remain block
tion,Ē
bove, block circulant,
circulant, and
Ēi.e.,
signal
ē and represent
· vec(
·· vec(represent multi-channel
∈ j Rmulti-channel 1D
jforz̄z̄) z̄)any 1D circular
convolutional
C×C×n circular
∈C×C×n=
vec(ē
=Rmulti-channelconvolution,
C×n
vec(ē is InIn
! a!
column
we ! In
above,
!above,
In havez̄).
above,
z̄).
multi-channel
kernel Ē
above, vector ··i.e.,
signal
ēēvec(
with ∈∈ē for
of
Rz̄)
∈z̄C×C×n
z̄)
R
∈ ē[c, any R
convolutional
C×C×n
R =
= C×C×n
∈C×C×n c
vec(ē
R multi-channel
vec(ē],
C×n
∈
and ′is
is R ! aa!
!
is
iswe
n
!
z̄).
ahave
Ē
z̄).
being
a
Ē
multi-channel
kernel
nĒ
Ē
multi-channel
· vec(
Ē
multi-channel
multi-channel
· ·
vec(
vec(
signal
vec(
the
vec(
R with z̄)first z̄)
z̄)
z̄)
z̄) =ē[c,
is =
=
z̄z̄convolutional
=
convolutional
=
∈
vec(ē
the
vec(ē
convolutional
vec(ē
∈convolutionalR
vec(ē
vec(ē
c ′
]
C×n
∈ !
!!
′ R
multi-channel !
!
!
nwe
z̄). z̄).
z̄).
z̄). have
kernel
kernel
being
kernel
n kernel the with
with
circular first
with
with ē[c,
ē[c, cc′′]] ∈
convolution
ē[c,
c′′ ]R
∈
c ′] R ∈ nR
n
R
n
being
being
being
n being
n
defined
the
theas
first
firstfirst
the
block
and represent
circulant,multi-channel
and represent1D multi-channel
circular 1D circular ! ! !
· · vec(
vec( vec( vec(ē
vec(ē vec(ē · vec( vec( vec(ē
vec(ē Ē vec( · vec( C×n
z̄) = vec(ē z̄).
z̄).
an
Ē,lti-channel z̄).
Ē
Ē
ē
Ē
· vec( Ē
multi-channel
above, ∈· vec( R z̄) ēj
C×C×n z̄)
z̄)z̄)
∈ z̄) =
=
=convolutionalR =
vec(ē =
convolutional
vec(ē
Inis !multi-channel
task
above,
a is
z̄).column
column
z̄).
Inz̄).
a
z̄).
have
Ēkernel
ē
z̄).
Ē
Ē Ē
multi-channel
above,
· vec(∈ kernel
·
vector vec(
never R vector z̄)of
with
C×C×nē z̄)
with
z̄) z̄)
∈
=Ēbeen
convolutional
Ē =
=
of
ē[c, R =
vec(ē convolutional
C×C×n
ē[c,
c,c Ē vec(ē
is
′fully
′c , ′
a c
and
] ′! ,∈] and ∈ ē
is
revealed.
z̄).
multi-channel
ēR
z̄).
R
!
z̄).
a
z̄). z̄
Ēēz̄being
kernel
n Ē
!
Ē Ē ∈
multi-channel
being
kernel
·Ēvec(
∈ ·
z̄·R ·vec(vec(
with
∈
vec( For
C×n z̄)
the the z̄)
with
z̄) z̄)
Rconvolutional
C×n =
convolutional
z̄)ē[c,first
is first
instance, = =
vec(ē
=the ē[c,
is c
vec(ē
convolutional
vec(ē
′ the
]
vec(ē c !]
multi-channel
∈ R Papyan
∈
multi-channel
!z̄).
n z̄).
Rz̄).
kernel
being
z̄). being
kernel
et
with
the
circular al. the
circular with
first
ē[c, first
(2017) ē[c,
′
convolution ∈ has
c
′cc′ ′]convolution ]
R ∈n R
suggested
n being
defined being
defined
theas athe asfirst
first
!
umn vector R C×n
of is In the In above, , above,
multi-channel
and !
column ! ē
column ∈ R R
C×C×n
vector circular
R C×C×n
vector C×n
of is of is convolution
the is a ,
c,c a
multi-channel
and
multi-channelmulti-channel
, and ! ! defined ! circular
R asconvolutional
C×n
R is the
convolution
is multi-channel
the !
kernel
multi-channel kernel
defined with with
circular
as circular ē[c, convolution ] ∈ R
convolution nR beingbeing
defined the
defined the
as first first
as
y, we
we have
have thethe following
following resultresult (see
(see Appendix
Appendix B.2 for a proof). ∈
z̄ Ē ∈ · vec( z̄) Ē = vec(ē ē ē z̄). z̄ Ē ∈ · vec( z̄) Ē = vec(ē ē z̄). z̄ Ē ∈ · vec( C×n z̄) = vec(ē z̄). ē[c, c ] ∈
nolumn , lti-channel
ē ! ∈ R
vector
C×C×n
R C×n convolutional
of is In
is the above,
a , multi-channel
multi-channel
and columnkernel
ē ! ∈ R vectorwith
C×C×n
Rcircular convolutional
C×n ē[c,
of
c,c Ē
is is c ′
a
]
convolution
the ∈ multi-channel
, R
multi-channel
and kernel
n ē being
! definedz̄ with
∈ the R convolutional
circular as
C×n ē[c, first c
is
′
] ∈
convolution
the R n
kernel
multi-channel being defined with
the first
circular ē[c,
as c ] ∈
convolutionR n
being the
defined first
as
, In aabove, R cis amulti-channel
amulti-channel convolutional R being the first
′ C×C×n ′ ′ n
ulti-channel i-channel
-channel
ernel
,kernel
ēbeing RR with
∈R with
the convolutional In
convolutional
first
convolutional above,
is
In
′
is ′
a]∈
In above,
aconvolutional
multi-channel
multi-channel
R
above, R kernel
kernel
beingkernel
ēbeing R withRwith
the
∈ R the
with convolutional
first
convolutional
first c,is a′multi-channel
′
ais
′
multi-channel
is ′
a∈ multi-channel
R R Rnkernel
multi-channelkernel
being
being
∈being with
the
the
with
∈ convolutional
the first
convolutional
first first
convolutional ′] ] ∈∈ R R kernel kernel
being
being
kernel with the
with
the firstfirst ′
R R Rnnbeing the the
being
nbeing the
the first
first
first
c,c C×C×n c,c ′ ′ ′ ′ n
ē z̄ C×C×nC×C×n Ē n n ē ∈ z̄ C×C×n C×C×n Ē c,c n n n ē z̄ n n ē[c, ]] ∈∈ nn
y, sparse ]ccoding framework c′cc]]cc′′well
for a proof).B.2 for a proof).
vector -channel ē∈ ∈∈R of ē[c,
convolutional
is ē[c, the cis
In
and c]c,c
column above, ∈ !
ēmulti-channel
multi-channel ēvector ∈
kernel
vector ē ∈ ∈ R with
of circular ē[c,
ē[c,
convolutional
is ′ē[c, cconvolution
is ′]c
∈]]is ∈a! ∈
]ēR ∈!
multi-channel
R z̄kernelbeing with
the C for ē[c, ē[c,
convolutional
first the cCNNs.
′cmulti-channel ] ∈ nR nIt
kernel actually
being with
the ē[c, aligns
ē[c,
first cconvolution′]∈ with being early
the first
′the ,isand
C×n C×C×n
R C×n C×C×n ′and defined
Rwith C×n
circular
' isas the convolution defined circular as convolution defined as
hannel R convolutional In
′is aabove, multi-channel kernel with convolutional kernel being .the convolutional
first ′′ ]∈∈RRkernel ′ being with
the first RR being the first
C×C×n ′ C×C×n ′ n n
ē
z̄ ∈ Ē z̄ ē ∈ ∈ Ē ē[c, c,c
! ē[c, cmulti-channel ē[c, ∈
ollowing
see Appendix
resultB.2
∈
(seefor
Appendix
a proof).B.2 column ēz̄ ∈ of ē[c, , R ē[c,
is the cconvolution ]multi-channel circular ē[c, ∈ defined as
C×n
vector ti-channel
e, ∈ R of
C×n
R C×C×n is
c,c column
convolutional the
, and column
is In
multi-channel
In a above,
vector
!
above,
multi-channel vector
kernel
∈ of
ēR ∈
C×n Ēof
withcircularR
Ē
R
C×C×n
′is
C×C×n c,c, and
the
convolutional and ē
convolution
multi-channel
is ē aa R!
z̄ multi-channel
n∈ z̄
multi-channel R
∈
kernel
being
∈
C×n
defined
R C×n
circular
the '
with isC the
'
isas
first convolutional
C
the
convolutional multi-channel n kernel
defined
being
kernel circular with
circular
the
with as first convolution
ē[c, convolution ] ∈ R being
defined
defined
being the
as
the as first
first
vector
nector
nnelz̄
∈ annel
z̄∈defined
ctor R
∈ ē
RC×n .R
C×n ∈
circular
C×n 'C×n C
ofcircular
of Ē
ĒĒis
asis,c,c the
is
′the
′column
′
column
′,convolution
the
,and
convolution
multi-channel
and multi-channel
column !
ē
column
multi-channel
ē! !
ēlifting !′∈
vector vector
z̄],z̄ ∈ vector
z̄vector
defined
∈ and R
ofē
defined RC×n C×n of∈
circular
circular
C×n
Ē of '
sparse of
C
circular
c,c
Ē Ē
as
Ēis isc,c
as is
′Ēc,c ē[c,
, the the
′the
and ′
′,coding
′convolution ,′and c
convolution
′ and ēē !
]
,convolution
and
multi-channel ∈
multi-channel! ēēē!
ē
ēē(ē ! ! !
z̄stage! z̄∈
′∈
z̄ z̄defined
R
∈defined
z̄)[c]
∈
z̄z̄)[c] ∈ R R
defined
C×n
of RR =C×n
circular
C×n
circular
C×n
..circular
the ' .
is C
as as is
'is
theas
overall is
C ē[c,
ē[c, the theconvolution c
ccmulti-channel
convolution
multi-channel
c′multi-channel !
]multi-channel
]process. z̄[c defined
′defined
],defined ],∀c circular
circular
circular
Nevertheless,= as
1, as . 1, ē[c, c
convolution
convolution
. ,convolution
..convolution
.convolutionC.
. .our
] ∈
, C. new defined defined
defined
defined
framework asas
as as(26)(26)
=of is the ,multi-channel
and multi-channel
′column vector .circular
of ,multi-channel
convolution
and
,multi-channel ! defined as is
R R R .the convolution ! ′z̄[c ′defined circular as convolution defined as
C×n
c′′′]z̄[c
¯or of Cis the and column vector circular
of is the .,convolution
and ],defined circular as
∈ R C×n
Ē Ē ' C
′c,c c,c
]ē! ē(ē
column z̄ ! R
∈′vector
C×n
Ē .of
c,cĒ ' c,c
.Ē
C
′c,c c,c ′and
! ′(ē
z̄ !! !
(ē z̄ z̄ ∈R
′∈
C×n
z̄)[c] = C×n =.is .the
' C
is
ē[c,convolutionthe multi-channel
ē[c, !
′multi-channel
! ′],
′],∀c
circular ∀cas
circular
= 1, = .. 1, .. ,.,.convolution
C. defined
defined as
as (26)
! )[c] vector
∈z̄)[c] R
.'
C×n
'
' C=
.of
c,c c,c
Ē is
ē[c, c,cthe ē[c, ′c , multi-channel
and c ′column
]z̄[c
ē
!
suggests, !
(ē z̄[c z̄)[c]
!z̄ vector
], ∈ .∀c
z̄)[c] R '
=
once
'
=
C
C
circular
C×n
∀c = 1,of
c,c
= such
c,c
is
Ē ē[c,
1, .c,c
c,c the
,.convolution
ē[c,C.
.′c, multi-channel
sparse
. , ]andC. c (ē
] z̄[c
ē
! (ē
codes!
(ē z̄[c !
!defined
z̄)[c]
z̄ '
.
],∈ ∀c
z̄)[c]
C '
are
z̄)[c]'
' R =
CC
C circular
=∀c
C×n
c =as.=1
′1,
obtained,
= = (26) is
.ē[c,
1, the
,.convolution
(26) C.
ē[c,
ē[c,
. . one
, multi-channel
]C. c
c ]] !
z̄[c!
needs z̄[c],defined
z̄[c ∀c
subsequent
], circular
=
∀c
∀c as
1,== (26) 1, (26)convolution
. C.
.. .
ReduNet
. ,, C.
C. to defined
transform as (26)(26) (26)
hannel convolution
convolution structures
structures of of Ē and
and C̄ C̄ )) The
The matrix
matrix
′ . . C'
c]' = .
' '
CCC ' C
ē[c, c ′
(ē
] ! ! z̄[c z̄)[c] ′
], = . ∀c ' ' '
CCC
= 1,ē[c, . . c . ′
, ] C.
(ē
!! ! !
z̄[c z̄)[c] ], = '
∀c
. '
' = C
CCCc′ =1c′ =1′′
' 1, ē[c, . . ′ . (26)
c , ! C. ] !
! z̄[c ′j ′′
], ∀c = 1, . . (26)
. , C. (26)
(26)
hannel
C C C
. . . ′
c =1′ ′ ′ ′ ′
]] ! ! ′
Similarly, ' . . the
′
c =1′ (ē
matrices ′ (ē !
z̄)[c] z̄)[c] ′ = .
],],associated = .'
. . ' C
ē[c,c ē[c,
′
with =1 c ′
c
.]],(26)
] cany ′ ] ! z̄[csubsets z̄[c j], ′ ], of∀c ∀c = =
are 1, 1,.
also . . .
.(26)
, .C. , C.
..multi-channel circular (26)
convo-(26)
of
tionĒ structures
and C̄ ) !The of Ēmatrix
and C̄ ) TheĒmatrix
j
c] .=.∀c = ' C
.ccē[c,.].(26) c! (ē
! ! ! ′! jz̄[c ′z̄)[c]
them
′ ′ ′ ],],. Similarly,
to.= ∀c the ' =
C 1,
ē[c,
desired ..′=1
(26) .].,(26) cmatrices
.cC.′ ′,]]C.(ē
! ! !
LDRs. ! z̄[c
! ′C̄j!z̄)[c]
z̄)[c]
′ ′ Through
j. ∀c =∀c.= = ' C 1, ē[c,
our c
...(26)
c.(26) =1.derivations,
,C.
c,,c!
c(26) ′′ C. ]]′]′]]any
′
! ! z̄[c ′′j ′], ′ ′ we∀c Z̄ see = 1,
how ...(26)
. .(26) ,,,C.
lifting and sparse coding (26)(26)
.′′.=1 ..,.,(26) (26)
c]
= ∀c == c= ′1, ē[c,
1,
ē[c, ..′=1
ē[c, cc(ē
,c!.′C. ]]′,c]](ē
C.
(ē (ē z̄[cz̄[c z̄[c
z̄)[c]
jz̄)[c]
′], ],Similarly, ∀c
=∀c = .∀c = c= = 1, c
]]′,](ē (ē z̄)[c] = ∀c .. c= ē[c, ē[c, ′
c′C. z̄[c
z̄[c ],], ′′],],],of ∀c∀c = 1, C. C. (26)
= ē[c, ē[c, !
!z̄[c !
z̄[c
z̄)[c] z̄)[c]
],(via ],z̄)[c] j=∀c =∀c = = 1, ē[c, .1,
the
1, ē[c,
ē[c, .1,
ē[c,
..the
.filtering .c.matrices
the ,.matrices
.C. .cc,(ē
,(ē !.′C.C. ! C.
(ē
(ē !
z̄[c z̄[c
! z̄[c
z̄)[c]
z̄[c C̄
z̄)[c] ! ], z̄)[c]
j′j],
z̄)[c] ′C̄
], j=
associated
∀c
j=
=
associated
∀c
= = = =
′′1,
1,.1,
ē[c,
1,
ē[c, .ē[c,
with
ē[c, .(26)
.ē[c, .(26)
cc.,with C. any ! C. !
z̄[c
!
z̄[csubsets z̄[c subsets ∀c
∀c Z̄ of
∀c ==are =
Z̄ 1,
1,areare
..convo-
also
1, also
....(26) (26)C.
.multi-channel
.C.
,,,C. C.multi-channel circularcircular convo-
(26)
(26)convo-
(26)
ociated ]ilarly, = ′thewith
=1
"ē[c,matricesany c ! subsets
(ē C̄
z̄[c ! ], Similarly,
associated
of
lutions. Similarly,
Z̄
∀c = are =
′ =1
1, the
also with .
ē[c, . multi-channel
. ,any matrices
cC. ! (ē
subsets
(ē z̄[cC̄ ! z̄)[c]
C̄ associated
z̄)[c]
], of circular
associated
Z̄
∀c ′= are
=
′ =1
also
1, with
convo- ē[c,
.(26). with
. , any
multi-channel C.
c any
] subsets
! z̄[c
subsets
z̄[c ], of circular
Z̄of
∀c =
are
Z̄= 1,
also
1, also .multi-channel
C. multi-channel circularcircular
j ), and convo-
(26)convo-
.. α ! I + αcirc(
imilarly,
ssociated ′ =1the with matrices any subsets Similarly,
associated
of
random are the
also with matrices multi-channel
any or subsets
sparse associated
of
c ′ circular
=1c =1
deconvolution), are
=1 also with
convo- multi-channel
any subsets
multi-channel of circular are also
convolutionsconvo- multi-channel ( Ē, C̄ circular convo-
" Z̄)circ(Z̄) " C̄ Z̄ C̄ Z̄ Z̄
′ =1
c =1
lutions. lutions. c c ′′c =1 ′
−1
ons.
y, "Ē =
utions. iated
c ′ =1 c c c
thec′ =1
′
′ ′=1
=1 =1c
matrices
with−1 any Similarly, subsets
C̄
(24)
Similarly,
j
Similarly, associated lutions.
of the lutions.
Z̄c
the
lutions. the
′ =1 c c
are
matrices
′c′ =1
=1
′ =1
with
′matrices
cmatrices also any −1 multi-channel subsets
jC̄ jj jassociated
C̄associated j
jjj jassociated
of Z̄grouping circular
c c are =1
′ =1
with
c
cc ′
c
with
=1
with=1 =1
also
′ =1
any
∗
∗ any
convo-multi-channel
−1
−1 j (·),
subsets subsets
subsets ofof of
of Z̄ circular
are are
are also also
also convo- multi-channel
multi-channel
multi-channel (·), and softcircular circular
circular convo-
convo-
convo-
y, iated the matrices
with any subsets
(24)
jSimilarly, associated of the are with
matrices also any multi-channel subsets associated of circular
are Z̄
jjj Similarly, different nonlinear =1 C̄ operations: c′also =1 any
convo-
πb multi-channel subsets normalization Z̄Z̄ circular
are also
P convo- multi-channel circular
thresholding convo-
Ē = α I+ ∗ any C̄ Z̄ the matrices
∗ C̄ associated Z̄ with ∗convo- any subsets of are also circular convo-
ated
y, ted circular
iated
are
the
Z̄)αcirc(Z̄)circ(Z̄) (24)
the are with
withmatrices
also
matrices
with
also any
convo-
any
multi-channel
any
(24)
multi-channel
subsetssubsetssubsets Similarly,
associated
associated
of ofFrom of circular
circular
are are
the are with also
withmatrices
also also
convo- any
convo-
multi-channel
anymulti-channel
multi-channel
subsets subsets associated of ofcircular circular
circular
are
circular
are withalso also convo- anyconvo-
multi-channel
multi-channel circular
circular
are also
C̄ Z̄ Sconvo-
convo- multi-channel circular convo-
Similarly, the matrices associated with any subsets of multi-channel circular convo-
n−1
(+ Z̄
Z̄)αcirc(
e Z̄)circ(
ed
the with with
matrices matrices any subsets
C̄ Similarly,
subsets C̄
Similarly,
jC̄ C̄ lutions.
Similarly,
associated associated
Similarly,
of ofZ̄ the
Z̄ Z̄
the
Z̄ are Z̄
Fromthe
are with Proposition
matrices
thewith
matrices
also matrices
also matrices
any
Proposition any
multi-channel C̄
multi-channel
subsets
C̄ subsets
jC̄ 2.2,
C̄associated
jC̄
associated
C̄ 2.2, ReduNet
associated
of of
associated ReduNet
Z̄ Z̄Z̄ Z̄
circular
are with
arewithisalso
with awith
also
is deep
any
convo-
any a convo- any
multi-channel any
deep subsets convolutional
multi-channel
subsets subsetssubsets
convolutional of
of ofZ̄
Z̄ Z̄
circular
Z̄ network
are
circular
are are also
also
networkalso
also
convo- for
multi-channel
convo- multi-channel
multi-channel
multi-channel
multi-channel for multi-channel circular 1D 1D
circular
circular
circular signals
convo-
convo-
convo-
convo-
signals by by
et ated
m ly, is the
Proposition
a with matrices
deep any convolutional subsets
lutions.
2.2, lutions.
C̄ j
lutions.
lutions. Similarly,
associated
τ
lutions.
ReduNet of
(·) From
From Z̄
construction.are are
network
From isProposition
the
all
Propositionawith
also matrices
derived
deep
Proposition for 10 multi-channel
any multi-channel
convolutional
Figure
2.2,
subsets
as
2.2, C̄ j ReduNet
necessary
2.2, 3 associated
ReduNet of
ReduNet
illustrates circular
Z̄ are
network
1D is
processes
is a
signals
the a also
with
is deep
convo-
deepa
whole for deep by convolutional
multi-channel
any from
convolutional
multi-channel process subsets
convolutional the of of
objective
rate
network
circular
Z̄ are
network
1D also
signals
network
reduction ofconvo-for
for by multi-channel
multi-channel
maximizing multi-channel
for
with multi-channel
such rate
sparse
1D
circular signals
reduction
1D and signals
1D convo-
signals
invariant
by
by by
uNet rom is Proposition
a 10 deeplutions. lutions. lutions.
convolutional
2.2,
lutions. lutions. ReduNet From
construction. network is Proposition a 10 deep for10 convolutional
multi-channel
Figure 2.2, 3ReduNet illustrates network
1D isthe signals
a deep whole forprocess by
convolutional
multi-channel
process of rate network
1D signals
reduction for with by multi-channel such sparse 1D and signals
invariant by
struction.
tes the whole Figure
10process 3 ofofconstruction.
lutions.
illustrates construction.
rate reduction the whole 10 Figure
with Figure process such 3while
3 illustrates
illustrates
sparse
of rate and the
the
reduction invariantwhole
whole with process such of
of
sparse rate
rate reduction
reduction
and invariant with
with such
such sparse
sparse and
and invariant
invariant
strates ⎡
oposition
is
onstruction. a ⎤
deep the 2.2,
convolutional
whole ReduNet
Figure From
process 3
the
features.
Proposition
illustrates
of is learned
construction.
network
construction.
rate a deep reduction the
features
for
convolutional
2.2,
whole
10
10multi-channel Figure
ReduNet
with
Figure process such 3 3
enforcing
illustrates
network
is
illustrates
sparse
of a
rate 1D deep and
reductionthe
signals
thefor shift
invariantwhole
convolutional
wholemulti-channel by invariance.
with process
process such of
network
sparse
of 1Drate
rate reduction
signals
and for
reduction invariant multi-channel
by with
with such
such sparse
1D
sparse and
signals
and invariant
by
invariant
es
es
oposition
oposition
network
spositionures.
leatures.
tion.
sition
tion.
⎤⎡
is
oposition
twork
ais
deep
a1Dathe
sroposition
aadeep
deep
deep
athe deep
1010
10 ⎤
deep
deep
signals
whole
10whole
for
Figure
2.2,
Figure
2.2,
convolutional
convolutional
for
convolutional
2.2,2.2,
convolutional
convolutional2.2, ReduNet
convolutional
2.2,
From
multi-channel
convolutional
multi-channel
by
From
process
From ReduNet
construction.
process 3
From
ReduNet
ReduNet
3ReduNet
From
From
From
From
illustrates
ReduNet
construction.
illustrates
From
construction.
From
From
construction.
Proposition
Proposition
Proposition
of
features.
network
construction.
ComparisonĒĒ
Proposition
features.
isProposition
Proposition
of rate
Proposition
rate
isis
network features.
network
Proposition
Proposition
network
is
network
1D a101D
features.
athe
aProposition
network
is
networkfeatures.
1,1deep
Proposition
is the
aadeep
reduction
a deep
deep
10
reduction
deep
signals
10signals
10whole
Figure···
for
deep
whole
10 ··· 10
for2.2,
for
2.2,
Figure
for
convolutional
Figure2.2,
for
Figure
2.2,
for
convolutional
Figure
2.2,
convolutional
forconvolutional
2.2, 2.2,
Ē
convolutional
2.2,
multi-channel
convolutional
Ē
with 3
ReduNet
2.2,
multi-channel
2.2,
multi-channel
multi-channel
by
ReduNet
process
with
multi-channel
ReduNet
multi-channel
2.2,
process
with
multi-channel
1,C 3
by
illustrates
3
ReduNet
ReduNet
ReduNet
3ReduNet
ReduNet
ReduNet
illustrates
ReduNetsuch
ReduNet
illustrates
3illustrates
such
illustrates
ScatteringNet
ofnetwork
of
isnetwork
is
rate
issparse
rate
sparse
isa1D
network
network
1D
the is
a
network
is
network
isaisis
1D
1D
1D
is
the
a1D
deep
aaa1D
deep
the
reduction
deep
whole
reduction
the
adeep
deep
deep
signals
deep
signals
aadeep
and signals
whole
deep
whole
and signals
signals for
deepwhole
convolutional
signals
for
signals
for for
convolutional
invariant
for
convolutional
for
convolutional
convolutional
process
invariant
and
multi-channel
convolutional
convolutional
multi-channel
convolutional
with multi-channel
by
process
convolutional
convolutional by
multi-channel
multi-channel
by
convolutional
multi-channel
by
with
process
process by
PCANet.
by
bysuch
such of
of
of
of rate
of
network
network
sparse
rate
network
rate
sparse
rate
rate
network
network
1D
network 1D
network
network
network
1D
1D
reduction
1D
network
1D and
reduction
reduction
reduction
Usingand
for
signals
signals
signals
for
signals
signals for
signals
formulti-channel
for
for
for
invariant
for
for
forwith
invariant
convolutional with
multi-channel
multi-channel
by
multi-channel
multi-channel
multi-channel
with by by
multi-channel
multi-channel
multi-channel
by
with multi-channel
by
multi-channel
by
with such
such
such
such
sparse
sparse sparse
sparse
1D
1D
1D
1D
1D
1D
1D
1Dsignals
1Dand
1D
and
operators/with andand
signalsby
signals
signals
signals
signals
signals
invariant
signals
signals
signals
signalsinvariant
invariant
invariant
invariant
by
by
by
by
by
by
by by
by
by
es the
eion.
reduction
tion.
se the reduction
theand
whole
whole
··· whole
Figure
invariant
Figure Ēwith
process
process
withprocess 33illustrates
such
construction.illustrates
illustrates
such
construction.
construction. of
features. ofsparse
of
rate
construction.rate
sparse
Fast
Ē rate reduction
1,1the
reduction
the
10 andreduction
and
10whole
Computation whole
Figure ··· invariant
10FigureFigure
invariant
Figure with
Ēwith
process
3 process
with
1,C 3
illustrates 3such such
illustrates
illustrates
3 in such of
the
illustrates of sparse
sparse
raterate
sparse
Spectral the reduction
the
reduction
the and
and
whole
the and
whole
whole invariant
invariant
Domain.
whole invariant
process with with
process
process
process suchsuch
The of of sparse
rate
of sparse
rate
calculation
rate reduction
reduction andandinvariant
reduction ofinvariant
with with
with
in
with (24)
such such
such
such requires sparse
sparse sparse
sparse and
and
inverting
and
and invariant
invariant
invariant
invarianta ma-
⎣ .. .. . .. ⎦
on. he 10 10
whole Figure
10 process 3 construction.of rate reduction
the 10 10
whole Figure
10 with
process 3 such
illustrates of sparserate reduction
the and whole invariant with
process such of sparse
rate reduction
and Ē
invariant with such sparse and invariant
t.tral whole Figure process 3processillustrates
construction. of rate the
reduction whole Figure with
process in3process illustrates
such of sparse
rate reduction
the and whole invariant with
aprocess such of 3sparse
rate reduction
andandinvariant awith such sparse and invariant
,1
ction. the whole Figure 1,C 3in illustratesconstruction.
of rate 1,1 the whole Figurewith 1,C such3in illustratesof sparse rate and the invariant
whole with
Ē. =
= ⎦⎣ Computation ⎦∈R
Domain. features. features.
The features.
(25) the features.
calculation
pooling Fast
Fast
Spectral Fast Computation
ofComputation to Computation
Domain.
ofensure ×(24) in
The
equivarience/invariance in
the
the
requires the Spectral
Spectral
calculation Spectral
inverting nC×nC Domain.
Domain.
of Ē Domain. ma- inprocess (24) have such The
The requires of
Thesparse3 rate
calculation
calculation
been calculation
inverting
common invariant
of
of the Ē of with
in
ma-
practice in Ē(24) (24) insuch (24)requires
requires
in sparse
requires
deep and
inverting
networksinvariant
inverting
inverting aa ma-
ma- a ma-
∈ R (25),, features. Fast Computation in the Spectral Domain.
features. trix size Ē which has complexity
nC×nC 33The 3).calculation
3By using
Ēof relationshipin (24) requires
between inverting
circulant aa ma-
Fast
s.
hral
.
pectral
R . . .. ,Ē
asofcomplexity
mputation
rix .
has . .
nC×nC
∈ .R . . ,
of
Computation
.
size nC × O(n
Domain.complexity
size
Domain.
nC in the
features.
features.
(25)
nC, which
× The
(25)
features.
nC,Fast
O(n
3in
features.
The
C(LeCun
Spectral
3the
features.
3).matrix
whichC
trix
trix 3By
Computation
calculation
Spectral
calculation
has
).
Fast
trix
of
trix
trix
of
By
using
has
nC×nC
and of
size
size
complexity
of
Domain. and
of
Computation
using
size
size
complexity
size
ofand
nC
Domain.
nC
thenC
Bengio,
of nC
Discrete
Ē nC
Ē
the
nC
×
relationship
× O(n
inin(24)
in
in nC,
nC,
nC,
the
The ×
× (24)
1995;
relationship
×
nC,
nC,Fourier
nC,
O(n
inThe
3which
which
C Cohen
Spectral
calculation
requires
3the
requires
which
which
3
which
).between
C 3By
Spectral
calculation
has
hashas
Transform
).
has
using
between
By
has
inverting
and
Domain.
nC×nC
complexity
complexity
inverting using
ofthis
complexity
circulant
Domain.
Welling,
complexity
complexity
Ē (DFT)
of Ēa ma-
the(DFT)
circulant
relationship
the
ina (24) ma-
in
The
O(n
O(n
O(n
(24)
of
2016a),
relationship O(n
O(n
requires
O(n a
C
C
The
C1D
calculation
requires
33
33). C
).
C
C
calculation
between
butBy
By ).
signal,
33
).
).
By
the
By
between
By
inverting
inverting
using
using using
this
number
using
using
ofthis Ē
the
circulant
of
thecomplexity
circulant
in
Ē
Ē
the
the
the
acomplexity
ma-
a
(24)
ma-
relationship
relationship
in
of
(24)
relationship
convolutions
relationship
relationship
requires
requires
can
between
between
be
inverting
between
significantly
between
between
inverting
needed circulant
circulant
circulant
circulant
circulant
ama-ma-
ma-
Fast Fast matrix
Computation matrix
Computation and Discrete Discrete
in the the Fourier Fourier
a3Spectral
Spectral Transform Transform
Domain.
Domain. (DFT) The of aof 1D
calculation a 1D signal, signal, this
ain (24)complexity
. requires cann×n can
be be significantly
significantly
inverting
ral
mputation
mputation
ral
matrix
rix
mputation
ation ze
ansform
l Domain.
on Domain.
inverting
Transform
complexity
Domain.
and
Domain.
of of Ē
and
Discrete
Ē (DFT)
in a in
in in
in
(24)
ma-
Discrete
(DFT)TheFast
the
(24)
The the
the of
The
The
which Fast
Fast
Fourier
Fast
requiresSpectral
aComputation
3Spectral
Spectral
calculation
trix
1D
has
calculation
requires
Fast
of Computation
calculation
calculation
3Computation
Fourierof a
has
matrix
1D
signal,
reduced.
Computation
Ē Transform
size
By matrix
never
inverting
Computation Domain.
Domain.
Domain.
inverting
matrixsignal,
Transform
complexity using of and
of this
··· of
been
of Specifically,
Ē and
Ē
and
Discrete
in
Ē
theĒ
thisin
a(DFT)
complexity
in Ē a
ma-
in
in
in
the
Discrete
in
clear
in
(24)(24)
Discrete ma-
in
(DFT) The
the
the
(24)
The
the
(24)
complexity
relationship The
which the
of
Spectral and
requires Spectral
Fouriercalculation
requiresrequires
let
Spectral
calculation
calculation
requires
Spectral
1D
Spectral
of
can
Fourier
Fourier
has Ftheir
a By 1D ∈
can
Transform
signal,
be
Domain.
C
inverting
inverting
between
significantly
inverting
Transform
n×n
Domain.
parameters
inverting
Domain.
signal, of
Transform
complexity
using be of
of
Domain. Ēbe
Ē
Ē
significantly
the a
this
circulant in(DFT)
acomplexity
inthe
in ma-ama-
aThe
(24)(24)
relationship
ma-
(DFT)
(24) ma-
(DFT) The
DFT
The
complexity
The need
The
of
3calculation
requires calculation
requires 3a
calculation
matrix,
requiresofto
calculation
of
1D
calculation
calculation
can aaBy
be 1D
1D
signal,
can
between
be
inverting
and of
signal,
learned
inverting
inverting
using
of
of
of
signal,
be of
this
significantly
DFT(z) Ē
Ē
Ē
the
circulantain
inthis
Ēsignificantlyvia
in
thisain
complexity
in(24)
ma- ma-
ma-
relationship
(24)
(24)
(24)back
(24) = .. requires
complexity
complexity F requires
requires
requires
requires ∈can
. zpropagation be
Cinverting
can inverting
inverting
can
betweeninvertingbe
inverting theaaaaama-
significantly
inverting
be significantly
be from ama-
DFTma-
ma-
significantly
circulant a
ma-
ma-
omputation
al
utation
uced.ze C putation
omain. n×n
complexity
nC
Domain.
Domain. ··· × in
Specifically,
be nC,
the
Ē in
the
The in Fast
O(n
the
The
DFTThe
Fast the Fast
Spectral
calculation
which trix trix
3 C
letComputation
SpectralSpectral
Fast
calculation
calculation
Computation
matrix,
trix of
3
of
has
F ). reduced.
Computation
Ē
Ē reduced.
of
size
size
By
∈ reduced.
C,1
Computation
Domain.
and
size
C Domain.
of
complexity n×n
using
nC
Domain.
of Ē
DFT(z)
nC ···
··· Ē×
Specifically,
in
Specifically,
be in
the
×in Specifically,
nC,
in
in Ē
(24)
the
Ē inthe
the
The
nC, O(n
(24)
(24) C,C
in
relationship
= DFT
.
the
The The
which
which Spectral
the
3
Spectral
requires
calculation
which
F3 C
requires
. z let
Spectral
requires calculation
calculation
let
matrix,
3 ∈ ).
Spectral
haslet
hasF F has
C
By ∈
n×nF
∈ Domain.
C
inverting
Domain.
inverting and ∈of
n×n
Domain.
inverting
complexity
C
complexity
betweenusing be
n×nC
Domain.
of Ē
DFT(z)
n×n
the be
Ēin
be
the a a
circulant in
ma- be
the
ma-
in
a(24)
DFTthe The
O(n
ma-
The(24)
(24)
relationship
O(n the=DFT.
The
DFT The calculation
requires
3 DFT
C
requires
requires
calculation
F 3 .C z matrix,
).
calculation
calculation
matrix,
33 3 ∈ ). matrix,
CBy
By
By and
inverting
inverting
n×n
between and
using
using of
inverting
of and
be Ē
of
DFT(z)
Ē Ē
the
DFT(z) Ē
the
the in
in
a a
DFT(z)
circulant ma-(24)
in
ma-
in
a
DFT(24)
relationship(24)
ma-(24)
relationship
relationship =
= requires
F requires
=
requires
F requires
.. zz F ∈
∈ z C inverting
∈ n×n C
inverting
inverting
between
C n×n
between be
n×n
inverting
be be
the
the a
aa
circulant ma-
the
DFT
ma-
a
ma-
DFT
circulant
circulant ma- DFT
omplexity
y
C,1 using ze
omplexity
educed.
nsform
nd tween
e ∈ nCnC
complexity
using C
nC n×n
Discrete the
× ×
circulant
× the nC,
relationship
Specifically,
(DFT) be
nC,nC,O(nO(n O(n
the
C,C O(ntrix
relationship
which
Fourier which
33
of3DFT
C
trix trix
C 3 of
matrix
3
a
CC3
trix
). size
has
3
).
of
1D of
let ).
has
By
randomly
matrix,
). By
Transform
of
between
of
size size
C,1
signal,
F and reduced.
z
nC
between
Bycomplexity
using
complexity
using
reduced.
C,1size
of ∈ ∈ nC nC
using
and
nC C ×R
DiscretenC n×n
n×
the
thisnC,×
circulant
initializedthe×,DFT(z)
(DFT)
R
where
× Specifically,
nC,
circulant
the nC,
Specifically,
nC,
,be which
relationship
O(n
complexity
where
nC, O(n
relationship
O(n
C,C
relationship
the
C,C
Fourier which
ofwhich Cones.
3
which
3
3DFT
C
= denotes
a
CC
C has
3
1D
3
3).
F ).
).
has let
has
Of
let complexity
By
between
matrix,
Transform
denotes z has
can
signal,
∈ F
between
By the
between
C ∈
complexity
using
using
course,
complexity
complexity
F be∈n×n
the
setC
and
C
n×nof
circulant
n×n
significantly
this setthe
circulant
the
one
be
(DFT)
complex
DFT(z) O(n
circulant the be
be
complexity
of
O(n
relationship
relationship
mayO(n
complex
O(n the
DFT
the
3
3of 333DFT
Calso3
a
3
Cnumbers.
DFT
=C C).1D 3
33F By
).
).
33choose
). can
z
numbers.
). matrix,
matrix,
By
By
signal,
∈ using
between
Bybetween beC We
using
an×n
using the
and
andhave
completecirculant
the
be
the
significantly
this
We relationship
DFT(z)
circulant therelationship
relationship
complexity
DFT(z)
have basis
relationship DFT == for F
F canbetween
the ∈
zzbetween
between
between
∈ be C
C
n×n
convolution
n×n circulant
be
circulant
be
circulant the
the
circulant
significantly DFT
DFT
complexity ize trix which of trix size has of
By of complexity
size ∈ nC R ∈ n ×the , where
n nC, which
,relationship
C
.which
denoteshas complexity
between
has
By the set
complexity of
circulant
the complex O(n
relationship 3
Cnumbers. By
). using
between
By We the
have
circulant
the relationship
relationship between
between circulant
circulant
is aa circulant
circulant matrix. matrix. Moreover, Moreover, Ē represents aa multi-channel
z z ×
multi-channel circular circular
3 nC nC, O(n C ).
nCensform
ndn mplexity
eplexity ∈ set
nC R nC
Discrete
× of
n ×
nC, ×
, complex
where
nC,
(DFT)
O(n nC,
O(n O(n
which 3 which
trix
Fourier
matrixCof 3
trix
3 C
matrixC
numbers.
denotes
matrix
).1D 3
has
of
aa1D ). has
of
).
By
matrix
1Dsize By
Transform
and ofsize
complexity complexity
usingthe
Ē represents
signal,
andand zusing
nC
ofWe
and
Discrete ∈ set
nC the
Discrete
×
Discrete ∈nC
have
R ofn
Discrete×
this (DFT)
nC,
R n ×
, relationship
where
complex
nC,
O(n relationship
Fourier nC,
O(n
complexity
where O(n
which Fourier
Fourier
3 which
Fourier
of C C 3 3 C C
C
3
has
denotes
anumbers. 1D
).Transform ).Byhas
).
between
Transform
denotes By between
complexity
Transform
Transformcan using
signal, complexity
using
the We
be the set the
circulant
have of
significantly
this
(DFT) set circulant
relationship
(DFT)
(DFT) complex
of relationship
complexity
O(n O(n O(n
complex
of
3
of
.F aof C∗1D 3C C
anumbers.
). 1D1D
aadiag(DFT(z))F
1D ).).
Bybetween
can
numbers.
signal, By
between
using
signal,
signal,
signal, beusing
We this
significantly
this this
We circulant
circulant
the
have complexity
complexity
3have relationship
relationship
complexity
complexity . a wide can cancan between
between
be bebe circulant
circulant
significantly
significantly
significantly
significantly
er,
n
gnal,
form
d×n
nsform
Discrete
and
×n
n×n
nnd
fsform
matrix.
and be
mis
al,
form
n×n
d
n×n
orm
thez Discrete
Ē represents
Specifically,
d×n
n×n
significantly
Discrete
this
∈
Specifically,
Discrete
(DFT)
Specifically,
circ(z)
Specifically,
bebe
DFT(z) be
this
set
Moreover,
be
DFT(z)
(DFT)
be
Discrete
be
(DFT)
R
(DFT) (DFT)
the
thethe
(DFT)
of
n complexity
the
the
, complexity
where
Fourier
the =
DFTof
DFT
DFT
Fourier
complex Fourier
DFT
DFT
=
of
a multi-channel of
amatrix
let=
of
.∗1D
let
reduced.
letF
a
Ē represents acircular
DFT
.Fmatrix
Fourier
of Fourier
of let a
matrix matrix
C
reduced.
areduced.
matrix F
F
matrix,
areduced.
1D 1D
diag(DFT(z))F
matrix,
matrix,
Freduced.
F F
1D
Transform
can
∈
matrixTransform
denotes
numbers.
matrix and
reduced.
zTransform
matrix,
matrix, signal,
∈∈
reduced.
z ∈ ∈∈and
signal,
filters can
C
signal,
C C
Cand
C
signal,
multi-channel
Transform
Transform C
signal,
signal, and
and
and
and
n×n
n×n
be
n×n
of
n×n
n×n
be
andin
and the
Discrete
and
Discrete
and
Specifically,
and
n×n this
z
significantly
z
Specifically,
Specifically,
Discrete
this
We
Discrete
Specifically, significantly
Discrete
this
each
be
this
∈
Specifically,
Specifically, (DFT)
be
circ(z)
bebe
set
DFT(z)
DFT(z)
bebe
Specifically,
DFT(z)
DFT(z) DFT(z)
this
(DFT)
Discrete (DFT)
R
circular
Discrete
(DFT)
have
the
.(DFT)
complexity
thethe the
the
complexity
the
of
n complexity
,
layer complexity
Fourier
complexity
complexity
Fourier
DFT
DFT=
where
complex
DFT
of
DFT
let
DFTDFT
=
Fourier
Fourier
Fourier
..of
= .F
of
.let
of
let
alet
=
let
to
Fourier=
Fourier
of
=
.
F.∗let
let
F
a
1D
a
matrix,
matrix,zF
FF
C
matrix,
aFF
F
∈
matrix,
z
1D
ensure 1D
Transform
adiag(DFT(z))F
1D 1D z
Transform
∗can
F z
∈ z
can
Transform
can
denotes
numbers.
Transform
∈
can
signal,
∈
∈C∈
∈the
∈
Transform
can ∈
C
signal,
can
C
n×n
C
signal,
Transform
C
Transform
signal,
signal,
be
C
C
Cbe
and
n×n
CC
be
and
n×n
n×n
n×n
be
and
n×n
be
and
n×n
n×nbe
translationalsignificantly
thesignificantly
this
We
be
significantly
n×n
n×n
n×n
n×n this
be be
significantly
this
circ(z)
(DFT)
DFT(z)
significantly
this
(DFT)
be the
DFT(z)
be
significantly
be
circ(z)
circ(z)
bebe
DFT(z)
DFT(z)bebe
set (DFT)
(DFT)
the
.(DFT)
(DFT)
thethe
the
the
complexity
have
(DFT)the
the
complexity
circ(z)
the
the
circ(z)
ofcomplexity
complexity
complexity
DFT (27)
DFT DFT
=
DFT
=
DFT
=
DFT
complex
ofDFT
of
DFT
DFT
DFT
DFT
=
=
..F
==
of
. of
equivariance/invariance
of
.F
.(27)=
= of
aaofF
matrix, F
∗∗1D
F
F
a
1D
F
aFa
matrix, z
matrix,
matrix,
matrix,
1D
1D
az∗1D∗1D
z∗can
1D
diag(DFT(z))F
diag(DFT(z))F
matrix,
matrix,
z ∈ ∈
can
∈
can
numbers.
signal,
∈
cancan
signal, C
diag(DFT(z))F
signal,
signal,
signal,
C and
signal,
signal,
diag(DFT(z))F
C
andCbe and
be
n×n
be
and
n×n
and
n×n
be
be
and
n×n
and
significantly
DFT(z)
significantly
this this
We
this
be
DFT(z)
significantly
significantly
bebethis
significantly
this
DFT(z)
DFT(z)
DFT(z)
bethe
DFT(z)
.
the
the
the
the
complexity
have
complexity
complexity
complexity
.. complexity
for=
DFT
complexity
complexity DFT
.=
.DFT
(27)DFT
.=
.. =
. .
..FFF
= .F
F
zzzand
z
z∈can
∈
can
∈ ∈
∈∈
can
can
C
range
can
can C
can
C
CCC be
n×n
be
n×n
be
n×n
n×n
be
bebe
n×n
n×n
n×n be
significantly
of
be
signals.
significantly
significantly
be the
significantly
significantly
significantly
be
be
be be the
significantly
thethe
the
the
the DFT
(27)
DFT
(27)
DFT
DFT
DFT
DFT DFT
(27) (27)
(27)
multi-channel signal
R set , where
of
signal z̄
complex C
z̄ ∈ R
denotes numbers.
of
reduced. we have
the
have
We R set Werefer ,
Specifically, of
where have complexthe Creader denotes let numbers. to C×n F Appendix ∈ C set
We of have
B.3
be complex the for DFT properties numbers. matrix, of We
and circulant have matrices = ∈ DFT.
C By be using
the the
DFT
n ∗ n . n×n . n×n
d.
Specifically,
×n
ecifically,be Specifically,
be be
the the the
circ(z) DFT DFTDFT
let reduced.
=
let
matrix,
reduced. letF matrix,
matrix,
reduced. F reduced.
of z
diag(DFT(z))F
∈ ∈
andC Specifically,
C and
and
n×n n×n
Specifically,
R
Specifically,
nbe
n Specifically,
be
DFT(z) be
,refer
the where circ(z)
the the
DFT . DFT let
DFTC
let = =
let
Fmatrix, denotesFF let F matrix,
F
matrix,z ∈ F ∈ ∈ C
diag(DFT(z))FC
∈n×n
C
the
and
n×n
C
andand
n×n n×n
setbebe be bethe
DFT(z)
of thebe the circ(z)
the
complex DFT
theDFT DFT . DFT DFT = matrix,
=
Fmatrix, F Fmatrix,
matrix,
numbers. z ∈ and
C
diag(DFT(z))F
C
and
n×n
and
and
n×n
We DFT(z)
be be
DFT(z)
be
DFT(z)
havethe the DFT
DFT .
=DFT = (27)
F FF
z zz ∈ ∈ C Cn×n
C
n×nn×n be be the
be
be the thetheDFTDFT
DFT DFT (27)
F ∈zC n×n DFT(z) = F F z ∈ C n×n DFT(z) = F ∈ CDFT.
z ∈ n×n DFT(z) = F ∈C
z ∈ n×n
multi-channel set n, where of complex
we have∈ R
C denotes numbers.
of
we the z
∈We ∈R setWeWerefer ,,of ofwhere havecomplex the the Creader denotes
reader numbers.
z to C×n the
Appendix
to Appendix We have
B.3 B.3 for for properties numbers. properties of We circulant
ofhave have
circulant matrices F z and
matrices andDFT. DFT. By using
By using the the
n n
of F of ∈ zof ∈R ∈ nR
,We n
,refer
DFT(z)
where where C C =
denotes denotes F ∈ the the set set of of
DFT(z) complex complex for= numbers. numbers.z and We We DFT(z) have =
gnal
We
we have
et
Rdix We tet set
z̄ ∈ R
,of
refern,of
have
where
where
of
have
B.3complex
complex
thecomplex
forC C
reader denotes
denotes
properties numbers.
numbers. numbers.
ztoofC×n ∈ zthe
Appendix the
of We We
set
R set We
circulant have
of
have
where
B.3 have
complex
complexthe for
matrices Creader denotes
properties numbers. numbers. to
and C×n the
Appendix
DFT. of We set
We
circulant
By of
have
have
B.3 complex
using matrices the properties numbers. of Wecirculant
By have using matrices the and DFT. By using the the
n
We R endix of,
referwhere
complex
B.3 the C
for denotes
of
reader ∗ of
numbers.
properties z
∈ of
to R
∈ zthe
z relation
n R
Appendix∈ ∈,Weof We
n
WeR
set
R ,
where
n
where
n
have
of,refer
circulant,
refer where
in
wherecomplex C
B.3 (27), the
C denotes
the forCCdenotes
matrices Ē ∗denotes
reader
reader denotes can
numbers.
properties the be to
to the
and the
computed
setAppendix
the
Appendixset We set
set
of
DFT.
of of of
have
of
complex
circulant complex
By complex
asB.3
complex
B.3 using for
for ∗∗numbers.
numbers.
matrices the numbers.
properties
numbers.
36
properties and We We
Weof
We
DFT.
ofhave circulant
have
have
circulant
By using matrices
matrices the and
and DFT.
DFT. By
By using
using the
where
puted f
tion , of where
complex
circ(z) complex C
in.as.(27), Cdenotes
= denotes
numbers.
F numbers.
of
∗∗ ∗can ∗ ofz the
diag(DFT(z))F
z ∈z the
∈R
be computed set
We Rset
relation
n We
relation , n
of ,
where
have
circ(z)
relation of where
have
complex complex
in
in as..in C (27),
.in
(27),= Cdenotes
(27), F denotes
numbers.
Ē ∗ numbers. can
diag(DFT(z))F
Ē
cancan the
can be
(be be the set
We
computed
be
computed set
We of have
circ(z)
computed of
circ(z) have
complex complex as
.as= = = as F F (27)
numbers.
∗ numbers.
diag(DFT(z))F
diag(DFT(z))F We We have have ... −1 (27) (27) (27)
(27)
Ē ∗Ē ) .= * (27) ( D ··· D . )+ ( ) (27)
(27)
circ(z)
z̄) = vec(ē ! z̄). diag(DFT(z))F circ(z) diag(DFT(z))F circ(z)
= F
∗ relation = (27),F ∗ computed . as ∗
F ∗∗∗diag(DFT(z))F
diag(DFT(z))F . (27)
circ(z)
z))F elationirc(z)
(z))F circ(z)
mputed Ē ·· vec(
=
in* = =F
as(27),
vec(z̄) F (27)
F diag(DFT(z))F
diag(DFT(z))F
diag(DFT(z))F
ĒAppendix can be circ(z)
circ(z)
computed
relation . . =in = as F (27)
(27), F (27) ∗ diag(DFT(z))F Ē
diag(DFT(z))F
Ē can
( )∗∗be
( ∗ circ(z) circ(z)
computed
circ(z) ) .) =F =
* as (27)
(27) F(27)
F * ∗ (27)
diag(DFT(z))F diag(DFT(z))F
diag(DFT(z))F ( ( )+. )+ (27)
(27) (27) ( ( ) ) (27)
(27)
(27)(27)
! z̄). = vec(ē ! z̄).
∗
xcirc(z) theB.3)= reader
for F=∗FpropertiesF ∗todiag(DFT(z))F
properties (We ( refer
of··· circ(z)
circulant
B.3
the.))+ ..for=
reader F* = matrices
properties F (∗todiag(DFT(z))F diag(DFT(z))F
Appendix
(Appendix and
( Fof F
DFT.
0
circulant
circ(z) circ(z)
B.3
0
circ(z) By
.)00)+ for
= .. )using= =
matrices
FF* F
properties F (∗the
diag(DFT(z))F
(27) diag(DFT(z))F
diag(DFT(z))F
and
( of
11
)DFT.
Dcirculant ···By1C.
1C
. )+ .using
matrices (27) the
( F (andF 0 0
DFT.
) 0 ) By By usingusing (27) the
(27)
!ec(
∗ ∗ −1 −1
xirc(z) (z) Ē = diag(DFT(z))F
diag(DFT(z))F We refer
circ(z) circ(z)01Cthe = reader
−1 F diag(DFT(z))F toAppendix F( circ(z)
.00)
∗
circ(z)B.3 = for =−1 (27) F
properties*diag(DFT(z))F diag(DFT(z))F D(
of
.11 circulant
···11 D .1CDD . matrices
)+ (27)
matrices
−1 and
0 F 0 0DFT. (27)the
culant
T.
B.3
er
B.3 in x z̄).
0z̄) = vec(ē
0the
rculant
B.3
the
ted
B.3
B.3
the B.3
By
(27),
the
he reader
0asfor
for
reader
reader
for
for
for
)using
reader
matrices
for
reader
· α properties
properties
Ēmatrices
I to
*to
properties
can
properties
to
propertiesto
theWe
to
We
Appendix
be We
Appendix
We
AppendixWe
We
Appendix
andWe
D
refer
and
relation
Appendix We(of
computed
refer We.
F
of
11
of
refer
(
refer
DFT.
refer
relation
of
D
. refer
of
∗
refer
of .F
0∗circulant
the
refer
circulant
DFT. circulant
circulant
refer
.
11
.the
circulant. B.3
D
in
B.3
the
the
circulant
B.3B.3
the
By
0B.3
the
···
the in . reader
the
By
(27),
.readeras
the
D for
0reader
(27),
for
reader
for
reader
using
reader
1C
reader
for
for
)matrices
· αmatrices
reader
using)+
matrices
matrices
reader
Ē
matrices
properties
to
properties
matrices
properties
toI· to
properties
Ē can
properties
−1* to
the
to
to
Appendix
to Ē
the
Appendix
can Ē
F
to
to
ĒαAppendix be
Appendix
( =
Appendix
and
and
Appendix
=.
be
Ē
0 and
Appendix
and
and
D
and
Fcomputed
Appendix
. computed
=
0(11
of
of
of
of
DFT.
0of
. DFT.
0
. DFT.
(
DFT.
D F
0.F
F
DFT.
circulant
···
.B.3
circulant
DFT. .
11 . By
circulant
B.3
. .
,circulant
∗ D
circulant
. .. B.3
.
B.3
∗ ···
00
B.3
B.3
By
0 0
.
. By
B.3
By
1C
B.3
. as
..∗for
By
for
By
D 0
for
using
for
using
asfor for
1C
·
·
)
for
using
for
using
matrices
)+
properties
α
matrices
using
matrices
properties
α
·using ·
matrices
properties
matrices properties
α
(28)
properties I
I
properties
the
−1
properties
properties the
I· the
*the +
+ + α and
the
Ithe
Fα(and
α + and
and
.
0 D
and of
Fα of
.of of
0of
(
of
. of
DFT.
11
0ofDFT.
DFT.
circulant
DFT.
D
D
.circulant
. DFT.
0.
.···
circulant
.
. ) ···
circulant
circulant
D
By
11. .By using
circulant
11
circulant
.
,.circulant
.. . By
···
.
By
By
. D .using
1C
matrices
using
using
1Cmatrices
1C
. matrices
. .
)+
matrices
matrices
matrices
usingmatrices (28)
−1
·
the
the
·
−1the
the
· theand
F(
and
·
0 .
.and
and
and
and
and
.
0
.
.
F
F. 0.0
and
.. DFT.
and
00
DFT.DFT.
.
DFT.
DFT.
DFT.
DFT.
DFT.
. .DFT.
,
0
0 ,By
,
)By
, ByBy
By
By
ByBy
By
using
usingusing
using
using
using
using
usingusing the
the
(28)
the
the
(28)
the
the
(28) the
the
the (28)
3is a multi-channel ..convolutional
0...kernel with being
eed
uted ted
nin
ded reader
n0. )for asproperties
(27),
as
(27),
as
(27),
as * ·Ē
Ē tocan
ĒĒ Appendix
can
canWebe of
refer
relation
be
be
(+( . circulant
relation
computed
relation ..B.3
0 in
relation
relation
computed
computed the
.inin
.(27),
..0.in
in
.) reader
formatrices
.(27),
as
(27),
(27),
(27),
as
as Ē)+
*properties
+ĒĒ
can
Ē
Ē to
Ē Appendix
can
can
be
can
can((and
α
be
·be of DFT.
=0...circulant
computed
computed
computed
be
( be
Ē computed
=
.computed
0
0) .. B.3
00. .By
as
.,F
) .0.as
..)
F 0 for
using
0matrices
as
as)+..*properties
αthe
·· α I(00·and
( I(28) + of
α
.D
DDFT.
0....circulant
) ..···. By
ē[c, D......c
)+′matrices
using ∈
..] −1 the
( R ·0·and
0
(28)n00. FDFT.
0 .)
0.) 0 By,, the
+ usingfirst
the(28)
(28)
= α
is in a
as multi-channel
(27), α can be
Irelation
Ē α=computed
relation in asconvolutional
(27), ·−1
α can
Ibe+ be
Ēα =computed kernel
as with + α ,··· ′
R n
being... F
the first
0 C1 CC
d)el relation in in
(27), (27), can can
Ē be ( computed
computed as as * ( ē[c, c)+ ] ∈ (
0kernel F convolutional
*−1−1 with kernel ∈−1(Rwith 00being the ∈first(R 0being ··· the −1−1first
∗
.
(27),
as
) * can be
( (relation
computed . in ) .. ′
(27),
as Ē.)+
* Ē can ((n
be
( computed .
0 ) . 0
0 ) .
as ′
0
)+
F .
* −1 ( ( n ).D0···. D . D
)+′ . ( 0 n 00 F 0
) 0 ∗
)27), s) can berelation
computed 0ē[c,
in (27),
as cD)+
F ]*
can be computed ē[c, as 0c ]*−1
∗ ∗ ∗ −1 ∗∗ ∗
F ) 0
)+Ē
)+
** Ē
* ( ((
((
( (
((DD
F D 0
)0 )···0
0 D
)D
F
))
0
)+
)+ Ē*
)+
* Ē −1 (( (
(0
(
((
FFD
00
11
∗ C1
F D
0F 0)0
)0
1C
10 CC
···
0 0
)···
)000D
0 )D
) F) *0
)+
)+ ***−1 (( F( ( D
0
0
D0 DF
0 )···
)
···
D
D
···D D
)+
··· )+
)+
D −1 ( (( F
F 0 00 0
11C1
0 0 0
0F
) )
0)
1C)
CC 0
F ∗ 11
C1 C1
11C1 C1 1C
CC CC
CC
D
∗ FD D ···D
······
C×n 0D···
0
···
D DD0Unlike Xception
−1
−1
−1 ((FF(
nets
FD
∗∗ 11 C1
DDF 00)···
Chollet
0···
0
∗
···F0
D
1C CC 0D0
00D
D
0)
)···(2017),)+
F
F*
D these
−1
−1 F(
D (D
multi-channel
0
D
D 00
··· )···D
···FD
···
···D D
D
D )+
convolutions
)+
D −1
−1 ∗ in( FF 0
F ∗∗
(general 00are
00 00
∗
∗11 0)) Fnot depthwise sepa-
C1 1C
∗
CC 11
11
11
1C
C1 1C
1C
CC
CC
ReduNet: A White-box Deep Network from Rate Reduction
ScatteringNet (Bruna and Mallat, 2013) and many followup works (Wiatowski and Bölcskei,
2018) have shown to use modulus of 2D wavelet transform to construct invariant features.
However, the number of convolutions needed usually grow exponentially in the number of
layers. That is the reason why ScatteringNet type networks cannot be so deep and typically
limited to only 2-3 layers. In practice, it is often used in a hybrid setting Zarka et al.
(2020, 2021) for better performance and scalability. On the other hand, PCANet (Chan
et al., 2015) argues that one can significantly reduce the number of convolution channels
by learning features directly from the data. In contrast to ScatteringNet and PCANet,
the proposed invariant ReduNet by construction learns equivariant features from the data
that are discriminative between classes. As experiments on MNIST in Appendix E.5 show,
the representations learned by ReduNet indeed preserve translation information. In fact,
scattering transform can be used with ReduNet in a complementary fashion. One may replace
the aforementioned random (lifting) filters with the scattering transform. As experiments in
the Appendix E.5 show this can yield significantly better classification performance.
37
Chan, Yu, You, Qi, Wright, and Ma
5. Experimental Verification
In this section, we conduct experiments to (1). Validate the effectiveness of the proposed
maximal coding rate reduction (MCR2 ) principle analyzed in Section 2; and (2). Verify
whether the constructed ReduNet, including the basic vector case ReduNet in Section 3
and the invariance ReduNet in Section 4, achieves its design objectives through experiments.
Our goal in this work is not to push the state of the art performance on any real datasets
with additional engineering ideas and heuristics, although the experimental results clearly
suggest this potential in the future. All code is implemented in Python mainly using NumPy
and PyTorch. All of our experiments are conducted in a computing node with 2.1 GHz Intel
Xeon Silver CPU, 256GB of memory and 2 Nvidia RTX2080. Implementation details and
many more experiments and can be found in Appendix D and E.
Supervised learning via rate reduction. When class labels are provided during train-
ing, we assign the membership (diagonal) matrix Π = {Πj }kj=1 as follows: for each sample
xi with label j, set Πj (i, i) = 1 and Πl (i, i) = 0, ∀l 6= j. Then the mapping f (·, θ) can
be learned by optimizing (11), where Π remains constant. We apply stochastic gradient
descent to optimize MCR2 , and for each iteration we use mini-batch data {(xi , y i )}mi=1 to
2
approximate the MCR loss.
Evaluation via classification. As we will see, in the supervised setting, the learned
representation has very clear subspace structures. So to evaluate the learned representations,
we consider a natural nearest subspace classifier. For each class of learned features Z j , let
µj ∈ Rn be the mean of the representation vectors of j-th class and U j ∈ Rn×rj be the first
rj principal components for Z j , where rj is the estimated dimension of class j. The predicted
label of a test data x0 is given by j 0 = argminj∈{1,...,k} k(I − U j (U j )∗ )(f (x0 , θ) − µj )k22 .
38
ReduNet: A White-box Deep Network from Rate Reduction
70 70 30
70
70
60 60 25
60 70
Sigular Values
60
50 50 20
60
50 50
Loss
40
Loss
40 50 15
Loss
Loss
40 40
30
Loss
30 40
30 10
30
20 70
20 30
20 70 5
70 60 c
20
10
10 20 R (train)
R (train) R (train)
R (train) R (train)
Rc (train)
10 60 c
60 R R Rcc R (test)
R (test)
(train) R (test)
R (test)
(train) R (test)
Rcc (test)
(train) 0
0
10 R (train) R (train) R (train) 10 50
0 0 R (test) R (test) Rc (test) 0 5 10 15 20 25 30
100 101R (test) 102 50 R (test)
103 104 Rc (test) 0 0 100 100 50 200 200 300300 400
400 500
500
0
40 Epoch
Epoch Components
Loss
0 Number of 40iterations 0 100 200 300 400 500
Loss
Loss
30
(a) Evolution of R,Epoch
Rc , ∆R during 30
(b) Training loss versus testing
30 (c) PCA: (red) overall data;
20 20
the training process. 10
loss. 20 (blue) individual classes.
10
R R Rc 10
0 R R Rc
0
Figure 12: Evolution of the rates of MCR2 in the training process
100 101 102
and principal 103 104 c
Number of iterations
R R R
components of 0 100 101 102 103 104
100 101Number of iterations
102 103 104
learned features. Number of iterations
Figure 12(a) illustrates how the two rates and their difference (for both training and test
data) evolves over epochs of training: After an initial phase, R gradually increases while Rc
decreases, indicating that features Z are expanding as a whole while each class Z j is being
compressed. Figure 12(c) shows the distribution of singular values per Z j and Figure 1
(right) shows the angles of features sorted by class. Compared to the geometric loss (Lezama
et al., 2018), our features are not only orthogonal but also of much higher dimension. We
compare the singular values of representations, both overall data and individual classes,
learned by using cross-entropy and MCR2 in Figure 17 and Figure 18 in Appendix D.3.1.
We find that the representations learned by using MCR2 loss are much more diverse than
the ones learned by using cross-entropy loss. In addition, we find that we are able to select
diverse images from the same class according to the “principal” components of the learned
features (see Figure 13 and Figure 19).
One potential caveat of MCR2 training is how to optimally select the output dimension
n and training batch size m. For a given output dimension n, a sufficiently large batch
size m is needed in order to achieve good classification performance. More detail study on
varying output dimension n and batch size m is listed in Table 5 of Appendix D.3.2.
Table 1: Classification results with features learned with labels corrupted at different levels.
39
Chan, Yu, You, Qi, Wright, and Ma
C1 C1 C1 C1
C2 C2 C2 C2
C3 C3 C3 C3
C4 C4 C4 C4
C5 C5 C5 C5
C6 C6 C6 C6
C7 C7 C7 C7
C8 C8 C8 C8
C9 C9 C9 C9
Figure 13: Visualization of principal components learned for class 2-‘Bird’ and class 8-‘Ship’. For
each class j, we first compute the top-10 singular vectors of the SVD of the learned
features Z j . Then for the l-th singular vector of class j (denoted by ulj ), and for the
feature of the i-th image of class j (denoted by zji ), we calculate the absolute value of
inner product, |hzji , ulj i|, then we select the top-10 images according to |hzji , ulj i| for
each singular vector. In the above two figures, each row corresponds to one singular
vector (component Cl ). The rows are sorted based on the magnitude of the associated
singular values, from large to small.
50 70.0
50
45 67.5
40 40
65.0
35
62.5
30
Loss
Loss
Loss
30
60.0
25 20
57.5
20
55.0 10
15 noise=0.0 noise=0.3 noise=0.0 noise=0.3 noise=0.0 noise=0.3
noise=0.1 noise=0.5 52.5 noise=0.1 noise=0.5 noise=0.1 noise=0.5
10
0
0 5000 10000 15000 20000 25000 0 5000 10000 15000 20000 25000 0 5000 10000 15000 20000 25000
Number of iterations Number of iterations Number of iterations
(a) ∆R Z(θ), Π, . (b) R(Z(θ), ). (c) Rc (Z(θ), | Π).
Figure 14: Evolution of rates R, Rc , ∆R of MCR2 during training with corrupted labels.
the same training parameters, MCR2 is significantly more robust than CE, especially with
higher ratio of corrupted labels. This can be an advantage in the settings of self-supervised
learning or constrastive learning when the grouping information can be very noisy. More
detailed comparison between MCR2 and OLE (Lezama et al., 2018), Large Margin Deep
Networks (Elsayed et al., 2018), and ITLM (Shen and Sanghavi, 2019) on learning from
noisy labels can be found in Appendix D.4 (Table 8).
Beside the supervised learning setting, we explore the MCR2 objective in the self-
supervised learning setting. We find that the MCR2 objective can learn good representations
40
ReduNet: A White-box Deep Network from Rate Reduction
without using any label and achieve better performance over other highly engineered methods
on clustering tasks. More details on self-supervised learning can be found in Section D.6.
34. We do this only to demonstrate our framework leads to stable deep networks even with thousands of
layers! In practice this is not necessary and one can stop whenever adding new layers gives diminishing
returns. For this example, a couple of hundred is sufficient. Hence the clear optimization objective gives
a natural criterion for the depth of the network needed. Remarks in Section 3.4 provide possible ideas to
further reduce the number of layers.
41
Chan, Yu, You, Qi, Wright, and Ma
Loss
0.00 0.00
2.0
−0.25 −0.25
1.5
−0.50 −0.50
−1.00 −1.00 1.0
−0.75 −0.75 −0.75 −0.75
−0.50 −0.50 Rc (train)
−0.25 −0.25 0.5 ∆R (train) R (train)
−1.00 0.00 −1.00 0.00 ∆R (test) R (test) Rc (test)
0.25 0.25
−1.00 0.50 −1.00 0.50 0.0
−0.75
−0.50
−0.250.00 0.25 0.75 −0.75
−0.50
−0.250.00 0.25 0.75
0.50 0.75 1.00 1.00 0.50 0.75 1.00 1.00 0 250 500 750 1000 1250 1500 1750 2000
Layers
(a) Xtrain (3D) (b) Ztrain (3D) (c) Loss (3D)
Figure 15: Original samples and learned representations for 3D Mixture of Gaussians. We visualize
data points X (before mapping f (·, θ)) and learned features Z (after mapping f (·, θ))
by scatter plot. In each scatter plot, each color represents one class of samples. We also
show the plots for the progression of values of the objective functions.
number of layers/iterations L = 40, precision = 0.1, step size η = 0.5. Before the first
layer, we perform lifting of the input by 1D circulant-convolution with 20 random Gaussian
kernels of size 5.
0 1.0 0 1.0
×106
400 400 3.5
0.30
3.0
0.25
800 800 2.5
Count
Loss
0.20
0.5 0.5 2.0
1.5 0.15
1200 1200
1.0 0.10
0.0 0.00
0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20 25 30 35 40
2000
0 400 800 1200 1600 2000
2000
0 400 800 1200 1600 2000
0.0
Similarity Layers
(a) Xshift (RI-MNIST) (b) Z̄shift (RI-MNIST) (c) Similarity (RI-MNIST) (d) Loss (RI-MNIST)
0 1.0 0 1.0
×106
0.35
400 400
2.0 0.30
0.25
Count
1.5
Loss
0.0 0.00
0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20 25
1600
0 400 800 1200 1600
0.0 1600
0 400 800 1200 1600
0.0
Similarity Layers
(e) Xshift (TI-MNIST) (f) Z̄shift (TI-MNIST) (g) Similarity (TI-MNIST) (h) Loss (TI-MNIST)
Figure 16: (a)(b) and (e)(f) are heatmaps of cosine similarity among shifted training data Xshift and
learned features Z̄shift , for rotation and translation invariance respectively. (c)(g) are
histograms of the cosine similarity (in absolute value) between all pairs of features across
different classes: for each pair, one sample is from the training dataset (including all shifts)
and one sample is from another class in the test dataset (including all possible shifts).
There are 4 × 106 pairs for the rotation (c) and 2.56 × 106 pairs for the translation (g).
To evaluate the learned representation, each training sample is augmented by 20 of its
rotated version, each shifted with stride=10. We compute the cosine similarities among the
m × 20 augmented training inputs Xshift and the results are shown in Figure 16(a). We
compare the cosine similarities among the learned features of all the augmented versions,
42
ReduNet: A White-box Deep Network from Rate Reduction
i.e., Z̄shift and summarize the results in Figure 16(b). As we see, the so constructed rotation-
invariant ReduNet is able to map the training data (as well as all its rotated versions) from
the 10 different classes into 10 nearly orthogonal subspaces. That is, the learnt subspaces
are truly invariant to shift transformation in polar angle. Next, we randomly draw another
100 test samples followed by the same augmentation procedure. We compute the cosine
similarity histogram between features of all shifted training and those of the new test samples
in Figure 16(c). In Figure 16(d), we visualize the MCR2 loss on the `-th layer representation
of the ReduNet on the training and test dataset. From Figure 16(c) and Figure 16(d), we
can find that the constructed ReduNet is indeed able to maximize the MCR2 loss as well as
generalize to the test data.
2D Translation Invariance on MNIST Digits. In this part, we provide experimental
results to use the invariant ReduNet to learn representations for images that are invariant
to the 2D cyclic translation. Essentially we view the image as painted on a torus and
can be translated arbitrarily, as illustrated in Figure 22 in the Appendix. Again we use
the 10-class MNIST dataset. We use m = 100 training samples (10 samples from each
class) for constructing the ReduNet and use another 100 samples (10 samples from each
class) as the test dataset. We apply 2D circulant convolution to the (1-channel) inputs
with 75 random Gaussian kernels of size 9 × 9 before the first layer of ReduNet. For the
translation-invariant ReduNet, we set L = 25, step size η = 0.5, precision = 0.1. Similar
to the rotational invariance task, to evaluate the performance of ReduNet with regard
to translation, we augment each training/testing sample by 16 of its translational shifted
version (with stride=7). The heatmap of the similarities among the m × 16 augmented
training inputs Xshift and the learned features circ(Z̄)shift are compared in Figure 16(e) and
Figure 16(f). Similar to the rotation case, Figure 16(g) shows the histogram of similarity
between features of all shifted training samples and test samples. Clearly, the ReduNet can
map training samples from the 10 classes to 10 nearly orthogonal subspaces and invariant to
all possible 2D translations on the training dataset. Also, the MCR2 loss in Figure 16(h)
is increasing with the increased layer. This verifies that the proposed ReduNet can indeed
maximize the objective and be invariant to transformations as it is designed to.
43
Chan, Yu, You, Qi, Wright, and Ma
networks: (1). ReduNet by the forward construction, (2). ReduNet-bp initialized by the con-
struction, (3). ReduNet-bp with random initialization using the same backbone architecture.
To initialize the networks, we use m = 500 samples (50 from each class) and set number
of layers/iterations L = 30, step size η = 0.5, precision = 0.1. Before the first layer, we
perform 2D circulant-convolution to the inputs with 16 channels with 7 × 7 random Gaussian
kernels. For ReduNet-bp, we further train the constructed ReduNet via back propagation
by adding an extra fully connected layer and using cross-entropy loss. We update the model
parameters by using SGD to minimize the loss on the entire MNIST training data. To
evaluate the three networks, we compare the standard test accuracy on 10, 000 MNIST
test samples. For ReduNet, we apply the nearest subspace classifier for prediction. For
ReduNet-bp with and without initialization, we take the argmax of the final fully connected
layer as the prediction. The results are summarized in Table 2. We find that (1). The
ReduNet architecture can be optimized by SGD and achieve better standard accuracy after
back propagation; (2). Using constructed ReduNet for initialization can achieve better
performance compared with the same architecture with random initialization.
White box versus black box. This new framework offers a constructive approach to
derive deep (convolution) networks entirely as a white box from the objective of learning
a low-dimensional linear discriminative representation for the given (mixed) data. The
goodness of the representation is measured by the intrinsic rate reduction. The resulting
network, called the ReduNet, emulates a gradient-based iterative scheme to optimize the
rate reduction objective. It shares almost all the main structural characteristics of modern
deep networks. Nevertheless, its architectures, linear and nonlinear operators and even
35. There are exceptions such as ScatteringNets (Bruna and Mallat, 2013; Wiatowski and Bölcskei, 2018),
whose operators are pre-designed and fixed, as also indicated in the table.
44
ReduNet: A White-box Deep Network from Rate Reduction
Table 3: Comparison between conventional deep networks and compression based ReduNets.
their values are all derived from the data and all have precise geometric and statistical
interpretation. In particular, we find it is rather intriguing that the linear operator of
each layer has a non-parameteric “data auto-regression” interpretation. Together with the
“forward construction,” they give rather basic but universal computing mechanisms that even
simple organisms/systems can use to learn good representations from the data that help
future classification tasks (e.g. object detection or recognition).
45
Chan, Yu, You, Qi, Wright, and Ma
Invariance and sparsity. The new framework also reveals a fundamental trade-off be-
tween sparsity and invariance: essentially one cannot expect to separate different classes
of signals/data if signals in each class can be both arbitrarily shifted and arbitrarily super-
imposed. To achieve invariance to all translation, one must impose that signals in each
class are sparsely generated so that all shifted samples span a proper submanifold (in the
high-dimensional space) and features can be mapped to a proper subspace. Although sparse
representation for individual signals have been extensively studied and well understood in
the literature (Wright and Ma, 2021), very little is yet known about how to characterize the
distribution of sparse codes of a class of (equivariant or locally equivariant) signals and its
separability from other classes. This fundamental trade-off between sparsity and invariance
certainly merits further theoretical study, as it will lead to more precise characterization of
the statistical resource (network width) and computational resource (network depth) needed
to provide performance guarantees, say for (multi-manifold) classification (Buchanan et al.,
2020).
Further improvements and extensions. We believe the proposed rate reduction pro-
vides a principled framework for designing new networks with interpretable architectures
and operators that can provide performance guarantees (say invariance) when applied to
real-world datasets and problems. Nevertheless, the purposes of this paper are to introduce
the basic principles and concepts of this new framework. We have chosen arguably the
simplest and most basic gradient-based scheme to construct the ReduNet for optimizing
the rate reduction. As we have touched upon briefly in the paper, many powerful ideas
from optimization can be further applied to improve the efficiency and performance of the
network, such as acceleration, precondition/normalization, and regularization. Also there
are additional relationships and structures among the channel operators E and C j that
have not been exploited in this work for computational efficiency.
The reader may have realized that the basic ReduNet is expected to work when the data
have relatively benign nonlinear structures when each class is close to be a linear subspace or
Gaussian distribution. Real data (say image classes) have much more complicated structures:
each class can have highly nonlinear geometry and topology or even be multi-modal itself.
Hence in practice, to learn a better LDR, one may have to resort to more sophisticated
strategies to control the compression (linearization) and expansion process. Real data also
have additional priors and data structures that can be exploited. For example, one can reduce
the dimension of the ambient feature space whenever the intrinsic dimension of the features
are low (or sparse) enough. Such dimension-reduction operations (e.g. pooling or striding)
are widely practiced in modern deep (convolution) networks for both computational efficiency
and even accuracy. According to the theory of compressive sensing, even a random projection
would be rather efficient and effective in preserving the (discriminative) low-dimensional
structures (Wright and Ma, 2021).
In this work, we have mainly considered learning a good representation Z for the data
X when the class label Π is given and fixed. Nevertheless, notice that in its most general
46
ReduNet: A White-box Deep Network from Rate Reduction
form, the maximal rate reduction objective can be optimized against both the representation
Z and the membership Π. In fact, the original work of Ma et al. (2007) precisely studies
the complementary problem of learning Π by maximizing ∆R(Z, Π, ) with Z fixed, hence
equivalent to minimizing only the compression term: minΠ Rc (Z, | Π). Therefore, it
is obvious that this framework can be naturally extended to unsupervised settings if the
membership Π is partially known or entirely unknown and it is to be optimized together
with the representation Z. In a similar vein to the construction of the ReduNet, this may
entail us to examine the joint dynamics of the gradient of the representation Z and the
membership Π:
∂∆R ∂∆R
Ż = η · , Π̇ = γ · . (53)
∂Z ∂Π
Last but not the least, in this work, we only considered data that are naturally embedded
(as submanifolds) in a vector space (real or complex). There have been many work that study
and apply deep networks to data with additional structures or in a non-Euclidean space. For
example, in reinforcement learning and optimal control, people often use deep networks to
process data with additional dynamical structures, say linearizing the dynamics (Lusch et al.,
2018). In computer graphics or many other fields, people deal with data on a non-Euclidean
domain such as a mesh or a graph (Bronstein et al., 2017). It remains interesting to see how
the principles of data compression and linear discriminative representation can be extended
to help study or design principled white-box deep networks associated with dynamical or
graphical data and problems.
Acknowledgments
Yi would like to thank professor Yann LeCun of New York University for a stimulating
discussion in his office back in November 2019 when they contemplated a fundamental
question: what does or should a deep network try to optimize? At the time, they both
believed the low-dimensionality of the data (say sparsity) and discriminativeness of the
representation (e.g. contrastive learning) have something to do with the answer. The
conversation had inspired Yi to delve into this problem more deeply while self-isolated at
home during the pandemic.
Yi would also like to thank Dr. Harry Shum who has had many hours of conversations
with Yi about how to understand and interpret deep networks during the past couple of
years. In particular, Harry suggested how to better visualize the representations learned by
the rate reduction, including the results shown in Figure 13.
Yi acknowledges support from ONR grant N00014-20-1-2002 and the joint Simons
Foundation-NSF DMS grant #2031899, as well as support from Berkeley FHL Vive Center
for Enhanced Reality and Berkeley Center for Augmented Cognition. Chong and Yi
acknowledge support from Tsinghua-Berkeley Shenzhen Institute (TBSI) Research Fund.
Yaodong, Haozhi, and Yi acknowledge support from Berkeley AI Research (BAIR). John
acknowledges support from NSF grants 1838061, 1740833, and 1733857.
47
Chan, Yu, You, Qi, Wright, and Ma
A.1 Preliminaries
Properties of the log det(·) function.
Lemma 7 The function log det(·) : Sn++ → R is strictly concave. That is,
log det((1 − β)Z1 + βZ2 )) ≥ (1 − β) log det(Z1 ) + β log det(Z2 )
for any β ∈ (0, 1) and {Z1 , Z2 } ⊆ Sn++ , with equality holds if and only if Z1 = Z2 .
Proof Consider an arbitrary line given by Z = Z0 + t∆Z where Z0 and ∆Z = 6 0 are
.
symmetric matrices of size n × n. Let f (t) = log det(Z0 + t∆Z) be a function defined on an
interval of values of t for which Z0 + t∆Z ∈ Sn++ . Following the same argument as in Boyd
and Vandenberghe (2004), we may assume Z0 ∈ Sn++ and get
n
X
f (t) = log det Z0 + log(1 + tλi ),
i=1
−1 −1
where {λi }ni=1 are eigenvalues of Z0 2 ∆ZZ0 2 . The second order derivative of f (t) is given
by
n
X
00 λ2i
f (t) = − < 0.
(1 + tλi )2
i=1
Therefore, f (t) is strictly concave along the line Z = Z0 + t∆Z. By definition, we conclude
that log det(·) is strictly concave.
Properties of the coding rate function. The following properties, also known as the
Sylvester’s determinant theorem, for the coding rate function are known in the paper of Ma
et al. (2007).
Lemma 8 (Commutative property) For any Z ∈ Rn×m we have
1
. 1 n ∗ n ∗
R(Z, ) = log det I + ZZ = log det I + Z Z .
2 m2 2 m2
Lemma 9 (Invariant property) For any Z ∈ Rn×m and any orthogonal matrices U ∈
Rn×n and V ∈ Rm×m we have
R(Z, ) = R(U ZV ∗ , ).
48
ReduNet: A White-box Deep Network from Rate Reduction
Lemma 10 For any {Z j ∈ Rn×mj }kj=1 and any > 0, let Z = [Z 1 , · · · , Z k ] ∈ Rn×m with
P
m = kj=1 mj . We have
k
X
mj n m n
log det I + Z j (Z j )∗ ≤ log det I + ZZ ∗
2 mj 2 2 m2
j=1
(54)
k
X m
n
j j ∗
≤ log det I + Z (Z ) ,
2 m2
j=1
Z 1 (Z 1 )∗ Z 2 (Z 2 )∗ Z k (Z k )∗
= = ··· = ,
m1 m2 mk
and the second equality holds if and only if (Z j1 )∗ Z j2 = 0 for all 1 ≤ j1 < j2 ≤ k.
X
k k
X k
X
j j
log det βj S ≥ βj log det(S ), for all {βj > 0}kj=1 , βj = 1 and {S j ∈ Sn++ }kj=1 ,
j=1 j=1 j=1
mj
where equality holds if and only if S 1 = S 2 = · · · = S k . Take βj = m and S j =
I + mnj 2 Z j (Z j )∗ , we get
Xk
m n ∗ mj n j j ∗
log det I + ZZ ≥ log det I + Z (Z ) ,
2 m2 2 mj 2
j=1
(Z ) 1
(Z ) 1 ∗ k k ∗
with equality holds if and only if Z m 1
= ··· = Z m k
. This proves the lower bound in
(54).
We now prove the upper bound. By the strict concavity of log det(·), we have
log det(Q) ≤ log det(S) + h∇ log det(S), Q − Si, for all {Q, S} ⊆ Sm
++ ,
where equality holds if and only if Q = S. Plugging in ∇ log det(S) = S −1 (see e.g., Boyd
and Vandenberghe (2004)) and S −1 = (S −1 )∗ gives
49
Chan, Yu, You, Qi, Wright, and Ma
We now take
(Z 1 )∗ Z 1 (Z 1 )∗ Z 2 ··· (Z 1 )∗ Z k
n (Z 2 )∗ Z 2
2 ∗ 1 2 ∗ 2 ···
n ∗ (Z ) Z (Z ) Z
Q=I+ Z Z = I + . .. .. .. , and (56)
m2 m2 .. . . .
(Z ) Z (Z )∗ Z 2
k ∗ 1 k ··· (Z k )∗ Z k
1 ∗ 1
(Z ) Z 0 ··· 0
n 0 2 ∗
(Z ) Z · · · 2 0
S=I+ 2 .. .. .. .. .
m . . . .
0 0 ··· (Z k )∗ Z k
k
X n
j ∗ j
log det(S) = log det I + (Z ) Z . (57)
m2
j=1
tr(S −1 Q)
n 1 ∗ 1 −1 (I + n n n
(I + m 2 (Z ) Z ) m2
(Z 1 )∗ Z 1 )
··· (I + m2
(Z 1 )∗ Z 1 )−1 (I + m2
(Z 1 )∗ Z k )
.. .. ..
= tr . . .
n k ∗ k −1 n k ∗ 1 n k ∗ k −1 n k ∗ k
(I + m2 (Z ) Z ) (I + m2
(Z ) Z ) · · · (I + m2 (Z ) Z ) (I + m2 (Z ) Z )
I ···
= tr ... . . . ... = m, (58)
··· I
where “
” denotes nonzero quantities that are irrelevant for the purpose of computing the
trace. Plugging (57) and (58) back in (55) gives
m n Xm k
n
∗ j ∗ j
log det I + Z Z ≤ log det I + (Z ) Z ,
2 m2 2 m2
j=1
where the equality holds if and only if Q = S, which by the formulation in (56), holds if and
only if (Z j1 )∗ Z j2 = 0 for all 1 ≤ j1 < j2 ≤ k. Further using the result in Lemma 8 gives
m n Xm k
n
∗ j j ∗
log det I + ZZ ≤ log det I + Z (Z ) ,
2 m2 2 m2
j=1
50
ReduNet: A White-box Deep Network from Rate Reduction
Lemma 11 For any Z ∈ Rn×m , Π ∈ Ω and > 0, let Z j ∈ Rn×mj be ZΠj with zero
columns removed. We have
k
X m n j (Z j )∗
1 det I + Z
∆R(Z, Π, ) ≤ log m2 , (59)
2m detmj
I + n Z j (Z j )∗
j=1 mj 2
∆R(Z, Π, )
= R(Z, ) − Rc (Z, | Π)
Xk
1 n ∗ tr(Πj ) ZΠj Z ∗
= log det I + ZZ − log det I + n
2 m2 2m tr(Πj )2
j=1
k
X
1 n mj Z j (Z j )∗
= log det I + ZZ ∗ − log det I + n
2 m2 2m mj 2
j=1
k
X X k
1 n j j ∗ mj Z j (Z j )∗
≤ log det I + Z (Z ) − log det I + n
2 m2 2m mj 2
j=1 j=1
k
X X k
1 n 1 Z j (Z j )∗
= log detm I + Z j
(Z j ∗
) − log det mj
I + n
2m m2 2m mj 2
j=1 j=1
Xk
1 detm I + m n j
2 Z (Z )
j ∗
= log ,
2m det mj
I+ n j
Z (Z ) j ∗
j=1 mj 2
where the inequality follows from the upper bound in Lemma 10, and that the equality holds
if and only if (Z j1 )∗ Z j2 = 0 for all 1 ≤ j1 < j2 ≤ k.
51
Chan, Yu, You, Qi, Wright, and Ma
samples in the k classes. Given any > 0, n > 0 and {n ≥ dj > 0}kj=1 , consider the
optimization problem
Z? ∈ argmax ∆R(Z, Π, )
Z∈Rn×m (60)
s.t. kZΠj k2F j j
= tr(Π ), rank(ZΠ ) ≤ dj , ∀j ∈ {1, . . . , k}.
• (Between-class discriminative) (Z?j1 )∗ Z?j2 = 0 for all 1 ≤ j1 < j2 ≤ k, i.e., Z?j1 and
Z?j2 lie in orthogonal subspaces, and
• (Within-class diverse) For each j ∈ {1, . . . , k}, the rank of Z?j is equal to dj and either
j)
all singular values of Z?j are equal to tr(Π
dj , or the dj − 1 largest singular values of Z?
j
tr(Πj )
are equal and have value larger than dj ,
j
where Z?j ∈ Rn×tr(Π ) denotes Z? Πj with zero columns removed.
• There exists xT > 0 such that f 0 (x) is strictly increasing in [0, xT ] and strictly decreasing
in [xT , ∞),
• f 00 ( rc ) < 0 (equivalently, c
r > xT ),
• x? = [ rc , . . . , rc ], or
52
ReduNet: A White-box Deep Network from Rate Reduction
Proof The result holds trivially if r = 1. Throughout the proof we consider the case where
r > 1.
We consider the optimization problem with the inequality constraint x1 ≥ · · · ≥ xr in
(61) removed:
Xr r
X
max r f (xp ) s.t. xp = c. (62)
x=[x1 ,...,xr ]∈R+
p=1 p=1
We need to show that any global solution x? = [(x1 )? , . . . , (xr )? ] to (62) is either x? =
[ rc , . . . , rc ] or x? = [xH , . . . , xH , xL ] · P for some xH > rc , xL > 0 and permutation matrix
P ∈ Rr×r . Let
X r Xr Xr
L(x, λ) = f (xp ) − λ0 ·
xp − c − λp xp
p=1 p=1 p=1
be the Lagragian function for (62) where λ = [λ0 , λ1 , . . . , λr ] is the Lagragian multiplier. By
the first order optimality conditions (i.e., the Karush–Kuhn–Tucker (KKT) conditions, see,
e.g., (Nocedal and Wright, 2006b, Theorem 12.1)), there exists λ? = [(λ0 )? , (λ1 )? , . . . , (λr )? ]
such that
r
X
(xq )? = c, (63)
p=1
By using the KKT conditions, we first show that all entries of x? are strictly positive. To
prove by contradiction, suppose that x? has r0 nonzero entries and r − r0 zero entries for
some 1 ≤ r0 < r. Note that r0 ≥ 1 since an all zero vector x? does not satisfy the equality
constraint (63).
Without loss of generality, we may assume that (xp )? > 0 for p ≤ r0 and (xp )? = 0
otherwise. By (66), we have
(λ1 )? = · · · = (λr0 )? = 0.
Plugging it into (67), we get
53
Chan, Yu, You, Qi, Wright, and Ma
Combining the last three equations above gives f 0 (0) − f 0 ((x1 )? ) ≥ 0, contradicting the
assumption that f 0 (0) < f 0 (x) for all x > 0. This shows that r0 = r, i.e., all entries of x?
are strictly positive. Using this fact and (66) gives
That is, the set {(xp )? }rp=1 may contain no more than two values. To see why this is true,
suppose that there exists three distinct values for {(xp )? }rp=1 . Without loss of generality we
may assume that 0 < (x1 )? < (x2 )? < (x3 )? . If (x2 )? ≤ xT (recall xT := arg maxx≥0 f 0 (x)),
then by using the fact that f 0 (x) is strictly increasing in [0, xT ], we must have f 0 ((x1 )? ) <
f 0 ((x2 )? ) which contradicts (68). A similar contradiction is arrived by considering f 0 ((x2 )? )
and f 0 ((x3 )? ) for the case where (x2 )? > xT .
There are two possible cases as a consequence of (69). First, if xL = xH , then we have
(x1 )? = · · · = (xr )? . By further using (63) we get
c
(x1 )? = · · · = (xr )? = .
r
It remains to consider the case where xL < xH . First, by the unimodality of f 0 (x), we
must have xL < xT < xH , therefore
On the other hand, by using the second order necessary conditions for constraint optimization
(see, e.g., (Nocedal and Wright, 2006b, Theorem 12.5)), the following result holds
* +
X r
v? ∇xx L(x? , λ? )v ≤ 0, for all v : ∇x
(xp )? − c , v = 0
p=1
(72)
Xr r
X
⇐⇒ f 00 ((xp )? ) · vp2 ≤ 0, for all v = [v1 , . . . , vr ] : vp = 0 .
p=1 p=1
54
ReduNet: A White-box Deep Network from Rate Reduction
Take v to be such that v1 = · · · = vr−2 = 0 and vr−1 = −vr 6= 0. Plugging it into (72) gives
f 00 ((xr−1 )? ) + f 00 ((xr )? ) ≤ 0,
which contradicts (71). Therefore, we may conclude that ` = 1. That is, x? is given by
x? = [xH , . . . , xH , xL ], where xH > xL > 0.
By using the condition in (63), we may further show that
c c xL
(r − 1)xH + xL = c =⇒ xH = − < ,
r − 1 xL r−1
c
(r − 1)xH + xL = c =⇒ (r − 1)xH + xH > c =⇒ xH > ,
r
which completes our proof.
Proof [Proof of Theorem 12] Without loss of generality, let Z? = [Z?1 , . . . , Z?k ] be the
optimal solution of problem (60).
To show that Z?j , j ∈ {1, . . . , k} are pairwise orthogonal, suppose for the purpose of
arriving at a contradiction that (Z?j1 )∗ Z?j2 6= 0 for some 1 ≤ j1 < j2 ≤ k. By using
Lemma 11, the strict inequality in (59) holds for the optimal solution Z? . That is,
k m n j j ∗
X 1 det I + m 2 Z? (Z? )
∆R(Z? , Π, ) < log . (73)
2m det mj
I+ n j
Z (Z ) j ∗
j=1 mj 2 ? ?
P
On the other hand, since kj=1 dj ≤ n, there exists {Uj ∈ Rn×dj }kj=1 such that the columns
of the matrix [U1 , . . . , Uk ] are orthonormal. Denote Z?j = U?j Σj? (V?j )∗ the compact SVD of
Z?j , and let
Z = [Z1 , . . . , Zk ], where Zj = Uj Σj? (V?j )∗ .
It follows that
(Zj1 )∗ Zj2 = V?j1 Σj?1 (Uj1 )∗ Uj2 Σj?2 (V?j2 )∗ = V?j1 Σj?1 0Σj?2 (V?j2 )∗ = 0
for all 1 ≤ j1 < j2 ≤ k. (74)
That is, the matrices Z1 , . . . , Zk are pairwise orthogonal. Applying Lemma 11 for Z gives
k m n j j ∗
X 1 det I + m 2 Z (Z )
∆R(Z , Π, ) = log
2m det mj n j
I + mj 2 Z (Z ) j ∗
j=1
(75)
k m n j j ∗
X 1 det I + m 2 Z ? (Z ? )
= log ,
2m det mj
I + n
Z j
(Z j ∗
)
j=1 mj 2 ? ?
where the second equality follows from Lemma 9. Comparing (73) and (75) gives ∆R(Z , Π, ) >
∆R(Z? , Π, ), which contradicts the optimality of Z? . Therefore, we must have
(Z?j1 )∗ Z?j2 = 0 for all 1 ≤ j1 < j2 ≤ k.
55
Chan, Yu, You, Qi, Wright, and Ma
We now prove the result concerning the singular values of Z?j . To start with, we claim that
the following result holds:
m n j (Z j )∗
det I + Z
Z?j ∈ arg max log m2 s.t. kZ j k2F = mj , rank(Z j ) ≤ dj . (77)
Zj mj n j j ∗
det I + mj 2 Z (Z )
To see why (77) holds, suppose that there exists Z e j such that kZe j k2 = mj , rank(Z e j ) ≤ dj
F
and
n ej ej ∗ j j ∗
detm I + m 2 Z ( Z ) det m
I + n
m2
Z ? (Z? )
log > log . (78)
mj n ej ej ∗ mj n j j ∗
det I + mj 2 Z (Z ) det I + mj 2 Z? (Z? )
ej = U
Denote Z e j (Ve j )∗ the compact SVD of Z
e jΣ e j and let
e j (Ve j )∗ .
Z = [Z?1 , . . . , Z?j−1 , Zj , Z?j+1 , . . . , Z?k ], where Zj := U?j Σ
0
Note that kZj k2F = mj , rank(Zj ) ≤ dj and (Zj )∗ Z?j = 0 for all j 0 6= j. It follows that Z is
a feasible solution to (60) and that the components of Z are pairwise orthogonal. By using
Lemma 11, Lemma 9 and (78) we have
∆R(Z , Π, )
0 0
j j ∗
1 det m
I + n
m2
Z (Z ) X 1 det m
I+ n
Z j (Z?j )∗
m2 ?
= log + log 0 0
2m detmj I + mnj 2 Zj (Zj )∗ 2m detmj 0 I + n
Z j (Z?j )∗
j 0 6=j mj 0 2 ?
0 0
1 detm I + m n ej ej ∗
2 Z (Z ) X 1 det m
I+ n
Z j (Z?j )∗
m2 ?
= log + log
2m e j (Ze j )∗ 2m j0 j0 ∗
detmj I + mnj 2 Z j 0 6=j detmj 0 I + n
Z
mj 0 2 ?
(Z ? )
0 0
j j ∗ j j ∗
1 detm I + m n
2 Z? (Z? ) X 1
m
det I + n
m2 ?
Z (Z? )
> log + log
2m 2m j0 j0 ∗
detmj I + mnj 2 Z?j (Z?j )∗ j 0 6=j detmj 0 I + n
Z
m 0 2 ?
(Z ? )
j
j j ∗
Xk
1 detm I + m n
2 Z? (Z? )
= log .
2m detmj
I+ n j
Z (Z ) j ∗
j=1 mj 2 ? ?
Combining it with (76) shows ∆R(Z , Π, ) > ∆R(Z? , Π, ), contradicting the optimality
of Z? . Therefore, the result in (77) holds.
Observe that the optimization problem in (77) depends on Z j only through its singular
values. That is, by letting σj := [σ1,j , . . . , σmin(mj ,n),j ] be the singular values of Z j , we have
min{mj ,n} !
m n j j ∗ X n 2 m
det I + m2 Z (Z ) (1 + m 2 σp,j )
log = log ,
det mj n j
I + mj 2 Z (Z ) j ∗
p=1
(1 + mnj 2 σp,j
2 )mj
56
ReduNet: A White-box Deep Network from Rate Reduction
also, we have
min{mj ,n}
X
kZ j k2F = 2
σp,j and rank(Z j ) = kσj k0 .
p=1
Let (σj )? = [(σ1,j )? , . . . , (σmin{mj ,n},j )? ] be an optimal solution to (79). Without loss of
generality we assume that the entries of (σj )? are sorted in descending order. It follows that
and
dj ! dj
n
X (1 + σ 2 )m
m2 p,j
X
2
[(σ1,j )? , . . . , (σdj ,j )? ] = argmax log n s.t. σp,j = mj .
d (1 + σ 2 )mj
mj 2 p,j
[σ1,j ,...,σdj ,j ]∈R+j p=1 p=1
σ1,j ≥···≥σdj ,j
(80)
Then we define !
n m
(1 + m 2 x)
f (x; n, , mj , m) = log ,
(1 + mnj 2 x)mj
and rewrite (80) as
dj dj
X X
max f (xp ; n, , mj , m) s.t. xp = mj . (81)
d
[x1 ,...,xdj ]∈R+j p=1 p=1
x1 ≥···≥xdj
We compute the first and second derivative for f with respect to x, which are given by
n2 x(m − mj )
f 0 (x; n, , mj , m) = ,
(nx + m2 )(nx + mj 2 )
n2 (m − mj )(mmj 4 − n2 x2 )
f 00 (x; n, , mj , m) = .
(nx + m2 )2 (nx + mj 2 )2
Note that
• f 0q
(x) is strictly increasing in [0, xT ] and strictly decreasing in [xT , ∞), where xT =
2 m mj
n n , and
57
Chan, Yu, You, Qi, Wright, and Ma
mj n2 m
• by using the condition 4 < m d2j , we have f 00 ( djj ) < 0.
Therefore, we may apply Lemma 13 and conclude that the unique optimal solution to (81)
is either
m mj
• x? = [ djj , . . . , dj ], or
m m
• x? = [xH , . . . , xH , xL ] for some xH ∈ ( djj , dj −1
j
) and xL > 0.
as claimed.
58
ReduNet: A White-box Deep Network from Rate Reduction
Fact 2 (Properties of circulant matrices) Circulant matrices have the following prop-
erties:
59
Chan, Yu, You, Qi, Wright, and Ma
• For a non-singular circulant matrix, its inverse is also circulant (hence representing a
circular convolution).
These properties of circulant matrices are extensively used in this work as for char-
acterizing the convolution structures of the operators E and C j . Given a set of vectors
[z 1 , . . . , z m ] ∈ Rn×m , let circ(z i ) ∈ Rn×n be the circulant matrix for z i . Then we have the
following (Proposition 4 in the main body and here restated for convenience):
Ez = e ~ z,
where e is the first column vector of E. Similarly, the matrices C j associated with any
subsets of Z are also circular convolutions.
To compute the coding rate reduction for a collection of such multi-channel 1D signals, we
may flatten the matrix representation into a vector representation by stacking the multiple
channels of z̄ as a column vector. In particular, we let
Furthermore, to obtain shift invariance for the coding rate reduction, we may generate
a collection of shifted copies of z̄ (along the temporal dimension). Stacking the vector
36. Notice that in the main paper, for simplicity, we have used n to indicate both the 1D “temporal” or 2D
“spatial” dimension of a signal, just to be consistent with the vector case, which corresponds to T here.
All notation should be clear within the context.
60
ReduNet: A White-box Deep Network from Rate Reduction
CT C CT C tr(Πj ) j
where α = mT 2 = m 2 , α j = j
tr(Π )T 2 = j
tr(Π )2 , γ j = m , and Π̄ is augmented
membership matrix in an obvious way. Note that we introduce the normalization factor T
in (90) because the circulant matrix circ(Z̄) contains T (shifted) copies of each signal.
By applying (14) and (15), we obtain the derivative of ∆Rcirc (Z̄, Π) as
∂ log det I + αcirc( Z̄)circ( Z̄) ∗ ∂ log det I + αcirc( Z̄)circ( Z̄) ∗
1 1 ∂circ(Z̄)
=
2T ∂vec(Z̄) 2T ∂circ(Z̄) ∂vec(Z̄)
−1 (91)
= α I + αcirc(Z̄)circ(Z̄)∗ vec(Z̄),
| {z }
Ē ∈R(C×T )×(C×T )
∂ log det I + α circ(Z̄)Πj circ(Z̄)∗ −1
γj j
= γj αj I + αj circ(Z̄)Πj circ(Z̄)∗ vec(Z̄)Πj .
2T ∂vec(Z̄) | {z }
C̄ j ∈R(C×T )×(C×T )
(92)
In the following, we show that Ē · vec(z̄) represents a multi-channel circular convolution.
Note that
Pm i i ∗
Pm i i ∗
−1
I+α i=1 circ(z [1])circ(z [1]) ··· i=1 circ(z [1])circ(z [C])
Ē = α .. .. .. . (93)
Pm . . Pm .
i=1 circ(z i [C])circ(z i [1])∗ ··· I+ i=1 αcirc(z i [C])circ(z i [C])∗
61
Chan, Yu, You, Qi, Wright, and Ma
By using Fact 2, the matrix in the inverse above is a block circulant matrix, i.e., a block
matrix where each block is a circulant matrix. A useful fact about the inverse of such a
matrix is the following.
Fact 3 (Inverse of block circulant matrices) The inverse of a block circulant matrix
is a block circulant matrix (with respect to the same block partition).
The main result of this subsection is the following (Proposition 5 in the main body and
here restated for convenience):
Similarly, the matrices C̄ j associated with any subsets of Z̄ are also multi-channel circular
convolutions.
Note that the calculation of Ē in (93) requires inverting a matrix of size (C ×T )×(C ×T ).
In the following, we show that this computation can be accelerated by working in the frequency
domain.
62
ReduNet: A White-box Deep Network from Rate Reduction
√
.
where ωT = exp(− 2π T −1 ) is the roots of unit (as ωTT = 1). The matrix FT is a unitary
matrix: FT FT∗ = I and is the well known Vandermonde matrix. Multiplying a vector with
FT is known as the discrete Fourier transform (DFT). Be aware that the conventional DFT
matrix differs from our definition of FT here by a scale: it does not have the √1T in front.
Here for simplicity, we scale it so that FT is a unitary matrix and its inverse is simply its
conjugate transpose FT∗ , columns of which represent the eigenvectors of a circulant matrix
(Abidi et al., 2016).
Regarding the relationship between a circulant matrix (convolution) and discrete Fourier
transform, we have:
Fact 5 An n × n matrix M ∈ Cn×n is a circulant matrix if and only if it is diagonalizable
by the unitary matrix Fn :
Fn M Fn∗ = D or M = Fn∗ DFn , (101)
where D is a diagonal matrix of eigenvalues.
This property allows us to easily “normalize” features after each layer onto the sphere Sn−1
directly in the spectral domain (see Eq. (26) and (128)).
63
Chan, Yu, You, Qi, Wright, and Ma
64
ReduNet: A White-box Deep Network from Rate Reduction
. −1
Ē(p) = α · I + α · DFT(Z̄)(p) · DFT(Z̄)(p)∗ ∈ CC×C , (113)
. −1
C¯j (p) = αj · I + αj · DFT(Z̄)(p) · Πj · DFT(Z̄)(p)∗ ∈ CC×C . (114)
In above, Ē(p) (resp., C¯j (p)) is the p-th slice of Ē (resp., C¯j ) on the last dimension. Then,
the gradient of ∆Rcirc (Z̄, Π) with respect to Z̄ can be calculated by the following result
(Theorem 6 in the main body and here restated for convenience).
65
Chan, Yu, You, Qi, Wright, and Ma
By the above theorem, the gradient ascent update in (13) (when applied to ∆Rcirc (Z̄, Π))
.
can be equivalently expressed as an update in frequency domain on V̄` = DFT(Z̄` ) as
k
X
V̄`+1 (p) ∝ V̄` (p) + η Ē` (p) · V̄` (p) − γj C¯`j (p) · V̄` (p)Πj , p = 0, . . . , T − 1. (127)
j=1
Similarly, the gradient-guided feature map increment in (26) can be equivalently expressed
.
as an update in frequency domain on v̄` = DFT(z̄` ) as
v̄`+1 (p) ∝ v̄` (p) + η · Ē` (p)v̄` (p) − η · σ [C¯`1 (p)v̄` (p), . . . , C¯`k (p)v̄` (p)] , p = 0, . . . , T − 1,
(128)
66
ReduNet: A White-box Deep Network from Rate Reduction
subject to the constraint that kv̄`+1 kF = kz̄`+1 kF = 1 (the first equality follows from Fact 7).
To a large degree, both conceptually and technically, the 2D case is very similar to the 1D
case that we have studied carefully in the previous Appendix B. For the sake of consistency
and completeness, we here give a brief account.
67
Chan, Yu, You, Qi, Wright, and Ma
The matrix circ(z) is known as the doubly block circulant matrix associated with z (see, e.g.,
Abidi et al. (2016); Sedghi et al. (2019)).
We now consider a multi-channel 2D signal represented as a tensor z̄ ∈ RC×H×W , where
C is the number of channels. The c-th channel of z̄ is represented as z̄[c] ∈ RH×W , and the
(h, w)-th pixel is represented as z̄(h, w) ∈ RC . To compute the coding rate reduction for a
collection of such multi-channel 2D signals, we may flatten the tenor representation into a
vector representation by concatenating the vector representation of each channel, i.e., we let
Furthermore, to obtain shift invariance for coding rate reduction, we may generate a collection
of translated versions of z̄ (along two spatial dimensions). Stacking the vector representation
for such translated copies as column vectors, we obtain
circ(z̄[1])
. .. (C×H×W )×(H×W )
circ(z̄) = . ∈R . (134)
circ(z̄[C])
68
ReduNet: A White-box Deep Network from Rate Reduction
We can now define a translation invariant coding rate reduction for multi-channel 2D
signals. Consider a collection of m multi-channel 2D signals {z̄ i ∈ RC×H×W }m i=1 . Compactly
representing the data by Z̄ ∈ R C×H×W ×m where the i-th slice on the last dimension is z̄ i ,
we denote
circ(Z̄) = [circ(z̄ 1 ), . . . , circ(z̄ m )] ∈ R(C×H×W )×(H×W ×m) . (135)
Then, we define
!
. 1 1 ∗
∆Rcirc (Z̄, Π) = ∆R(circ(Z̄), Π̄) = log det I + α · circ(Z̄) · circ(Z̄)
HW 2HW
k
!
X γj
− log det I + αj · circ(Z̄) · Π̄j · circ(Z̄)∗ , (136)
2HW
j=1
69
Chan, Yu, You, Qi, Wright, and Ma
Fact 9 (2D-DFT are eigenvalues of the doubly block circulant matrix) Given a sig-
nal z ∈ CH×W , we have
Doubly block circulant matrix and 2D-DFT for multi-channel signals. We now
consider multi-channel 2D signals z̄ ∈ RC×H×W . Let DFT(z̄) ∈ CC×H×W be a matrix
where the c-th slice on the first dimension is the DFT of the corresponding signal z[c]. That
is, DFT(z̄)[c] = DFT(z[c]) ∈ CH×W . We use DFT(z̄)(p, q) ∈ CC to denote slicing of z̄ on
the frequency dimensions.
By using Fact 9, circ(z̄) and DFT(z̄) are related as follows:
∗
F · diag(vec(DFT(z[1]))) · F
..
circ(z̄) = .
F ∗ · diag(vec(DFT(z[C]))) · F
∗
F ··· 0 diag(vec(DFT(z[1]))) (143)
0 · · · 0 diag(vec(DFT(z[2])))
= . .. .. · .. · F.
.. . . .
0 ··· F∗ diag(vec(DFT(z[C])))
Similar to the 1D case, this relation can be leveraged to produce a fast implementation of
ReduNet in the spectral domain.
Translation-invariant ReduNet in the Spectral Domain. Given a collection of
multi-channel 2D signals Z̄ ∈ RC×H×W ×m , we denote
.
DFT(Z̄)(p, q) = [DFT(z̄ 1 )(p, q), . . . , DFT(z̄ m )(p, q)] ∈ RC×m . (144)
We introduce the notations Ē(p, q) ∈ RC×C×H×W and C¯j (p, q) ∈ RC×C×H×W given by
. −1
Ē(p, q) = α · I + α · DFT(Z̄)(p, q) · DFT(Z̄)(p, q)∗ ∈ CC×C , (145)
. −1
C¯j (p, q) = αj · I + αj · DFT(Z̄)(p, q) · Πj · DFT(Z̄)(p, q)∗ ∈ CC×C . (146)
In above, Ē(p, q) (resp., C¯j (p, q)) is the (p, q)-th slice of Ē (resp., C¯j ) on the last two
dimensions. Then, the gradient of ∆Rcirc (Z̄, Π) with respect to Z̄ can be calculated by the
following result.
70
ReduNet: A White-box Deep Network from Rate Reduction
71
Chan, Yu, You, Qi, Wright, and Ma
As proved in Theorem 12, the proposed MCR2 objective promotes within-class diversity.
In this section, we use simulated data to verify the diversity promoting property of MCR2 .
As shown in Table 4, we calculate our proposed MCR2 objective on simulated data. We
observe that orthogonal subspaces with higher dimension achieve higher MCR2 value, which
is consistent with our theoretical analysis in Theorem 12.
Training Setting. We mainly use ResNet-18 (He et al., 2016) in our experiments, where
we use 4 residual blocks with layer widths {64, 128, 256, 512}. The implementation of network
architectures used in this paper are mainly based on this github repo.37 For data augmenta-
tion in the supervised setting, we apply the RandomCrop and RandomHorizontalFlip. For
the supervised setting, we train the models for 500 epochs and use stage-wise learning rate
decay every 200 epochs (decay by a factor of 10). For the supervised setting, we train the
models for 100 epochs and use stage-wise learning rate decay at 20-th epoch and 40-th epoch
(decay by a factor of 10).
Evaluation Details. For the supervised setting, we set the number of principal compo-
nents for nearest subspace classifier rj = 30. We also study the effect of rj in Section D.3.2.
For the CIFAR100 dataset, we consider 20 superclasses and set the cluster number as 20,
which is the same setting as in Chang et al. (2017); Wu et al. (2018).
Datasets. We apply the default datasets in PyTorch, including CIFAR10, CIFAR100, and
STL10.
Augmentations T used for the self-supervised setting. We apply the same data
augmentation for CIFAR10 dataset and CIFAR100 dataset and the pseudo-code is as follows.
The augmentations we use for STL10 dataset and the pseudo-code is as follows.
37. https://github.com/kuangliu/pytorch-cifar
72
ReduNet: A White-box Deep Network from Rate Reduction
Table 4: MCR2 objective on simulated data. We evaluate the proposed MCR2 objective
defined in (11), including R, Rc , and ∆R, on simulated data. The output dimension d is
set as 512, 256, and 128. We set the batch size as m = 1000 and random assign the label
of each sample from 0 to 9, i.e., 10 classes. We generate two types of data: 1) (Random
Gaussian) For comparison with data without structures, for each class we generate random
vectors sampled from Gaussian distribution (the dimension is set as the output dimension
d) and normalize each vector to be on the unit sphere. 2) (Subspace) For each class, we
generate vectors sampled from its corresponding subspace with dimension dj and normalize
each vector to be on the unit sphere. We consider the subspaces from different classes are
orthogonal/nonorthogonal to each other.
73
Chan, Yu, You, Qi, Wright, and Ma
30 30
25
25 25
20
Sigular Values
Sigular Values
Sigular Values
20 20
15
15 15
10
10 10
5 5 5
0 0 0
0 5 10 15 20 25 30 0 20 40 60 80 100 120 0 5 10 15 20 25 30
Components Components Components
(a) PCA: MCR2 training learned (b) PCA: MCR2 training learned (c) PCA: MCR2 training learned
features for overall data (first 30 features for overall data. features for every class.
components).
70 70 12
60 60
10
Sigular Values
Sigular Values
Sigular Values
50 50
8
40 40
6
30 30
4
20 20
2
10 10
0 0 0
0 5 10 15 20 25 30 0 20 40 60 80 100 120 0 5 10 15 20 25 30
Components Components Components
(d) PCA: cross-entropy training (e) PCA: cross-entropy training (f) PCA: cross-entropy training
learned features for overall data learned features for overall data. learned features for every class.
(first 30 components).
Figure 17: Principal component analysis (PCA) of learned representations for the MCR2 trained
model (first row) and the cross-entropy trained model (second row).
74
ReduNet: A White-box Deep Network from Rate Reduction
Figure 18: Cosine similarity between learned features by using the MCR2 objective (left) and CE
loss (right).
airplane airplane
automobile automobile
bird bird
cat cat
deer deer
dog dog
frog frog
horse horse
ship ship
truck truck
(a) 10 representative images from each class based (b) Randomly selected 10 images from each class.
on top-10 principal components of the SVD of
learned representations by MCR2 .
Figure 19: Visualization of top-10 “principal” images for each class in the CIFAR10 dataset. (a)
For each class-j, we first compute the top-10 singular vectors of the SVD of the learned
features Z j . Then for the l-th singular vector of class j, ulj , and for the feature of the
i-th image of class j, zji , we calculate the absolute value of inner product, |hzji , ulj i|, then
we select the largest one for each singular vector within class j. Each row corresponds
to one class, and each image corresponds to one singular vector, ordered by the value of
the associated singular value. (b) For each class, 10 images are randomly selected in the
dataset. These images are the ones displayed in the CIFAR dataset website (Krizhevsky,
2009).
75
Chan, Yu, You, Qi, Wright, and Ma
dimension of the overall features is nearly 120, and the output dimension is 128. In contrast,
the dimension of the overall features learned using entropy is slightly greater than 10, which
is much smaller than that learned by MCR2 . From Figure 18, for MCR2 training, we find
that the features of different class are almost orthogonal.
76
ReduNet: A White-box Deep Network from Rate Reduction
Number of components rj = 10 rj = 20 rj = 30 rj = 40 rj = 50
Mainline (Label Noise Ratio=0.0) 0.926 0.925 0.922 0.923 0.921
Label Noise Ratio=0.1 0.917 0.917 0.911 0.918 0.917
Label Noise Ratio=0.2 0.906 0.906 0.897 0.906 0.905
Label Noise Ratio=0.3 0.882 0.879 0.881 0.881 0.881
Label Noise Ratio=0.4 0.864 0.866 0.866 0.867 0.864
Label Noise Ratio=0.5 0.839 0.841 0.843 0.841 0.837
Table 6: Effect of number of components rj for nearest subspace classification in the supervised
setting.
Effect of 2 on learning from corrupted labels. To further study the proposed MCR2
on learning from corrupted labels, we use different precision parameters, 2 = 0.75, 1.0, in
addition to the one shown in Table 1. Except for the precision parameter 2 , all the other
parameters are the same as the mainline experiment (the first row in Table 5). The first
row (2 = 0.5) in Table 7 is identical to the MCR2 training in Table 10. Notice that with
slightly different choices in 2 , one might even see slightly improved performance over the
ones reported in the main body.
77
Chan, Yu, You, Qi, Wright, and Ma
Table 7: Effect of Precision 2 on classification results with features learned with labels corrupted
at different levels by using MCR2 training.
Table 8: Comparison with related work (OLE (Lezama et al., 2018), LargeMargin (Elsayed et al.,
2018), ITLM (Shen and Sanghavi, 2019)) on learning from noisy labels.
78
ReduNet: A White-box Deep Network from Rate Reduction
learning rate lr=0.1, momentum momentum=0.9 and weight decay wd=5e-4. We also use
Cosine Annealing learning rate scheduler during training. We show the respective testing
accuracy in Table 9. Although the classification result of using MCR2 slightly lags behind
that of using CE, when the noise level is small, their performances are comparable with
each other. Similar sensitivity to the noise indicates that the reason might be because of the
choice of the same network architecture. We reserve the study on how to improve robustness
to input noise for future work.
Table 9: Classification results of features learned with inputs corrupted by Gaussian noise at different
levels.
79
Chan, Yu, You, Qi, Wright, and Ma
70
70
6 60
60
10
5 50
50
8
4
Loss
40 70
Loss
40 6
Loss
Loss
3
30 60
30 4
2
70 50 70
20 1
20 2
60 R c 60 Rc
R (train)R Rc R R
Loss
10 0
R (train)
10
R (train) 0 R (train)
40 R (train) Rc (train)
0 R (test)
10000 50
20000 R30000
(test) 40000 Rc (test)
50000
R0 (test)10000 50
20000 30000
R (test) 40000 c
50000
R (test)
Number of iterations Number of iterations
0 30
4000
40
Loss
0 100 200 40
(a) MCR2 300 500
Loss
200 MCR70
2
0 100 (b) -CTRL
300 . 400 500
Epoch
30
20 Epoch
30
2 2
Figure 20: Evolution of
20 the rates of (left) MCR and (right) MCR -CTRL
60in the training process
in the self-supervised setting on CIFAR10 dataset.
10
20 R (train) R (train) Rc (train)
10
10 R (test) R (test) Rc (test)
R R Rc 50
0 0 R R Rc
100 101 102 103 0104 0100 200 300 400 500
Number of iterations 1040 101Epoch102
Loss
0
103 104
that, without class labels, the overall coding rate R expands quickly and theNumber MCR2 loss
of iterations
saturates (at a local maximum), see Fig 20(a). Our experience suggests 30 that learning a
good representation from unlabeled data might be too ambitious when directly optimizing
the original ∆R. Nonetheless, from the geometric meaning of R and 20 Rc , one can design
a different learning strategy by controlling the dynamics of expansion and compression
differently during training. For instance, we may re-scale the rate by replacing
10 R(Z, ) with
. 1 γ2 d R
e
R(Z, ) = log det(I + ZZ ∗ ). 0 (151)
2γ1 m2
100 101 102
With γ1 = γ2 = k, the learning dynamics change from Figure 20(a) to Figure 20(b): All Number of ite
features are first compressed then gradually expand. We denote the controlled MCR2
training by MCR2 -CTRL.
Experiments on real data. Similar to the supervised learning setting, we train exactly
the same ResNet-18 network on the CIFAR10, CIFAR100, and STL10 (Coates et al., 2011)
datasets. We set the mini-batch size as k = 20, number of augmentations for each sample as
n = 50 and the precision parameter as 2 = 0.5. Table 10 shows the results of the proposed
MCR2 -CTRL in comparison with methods JULE (Yang et al., 2016), RTM (Nina et al.,
2019), DEC (Xie et al., 2016), DAC (Chang et al., 2017), and DCCM (Wu et al., 2019)
that have achieved the best results on these datasets. Surprisingly, without utilizing any
inter-class or inter-sample information and heuristics on the data, the invariant features
learned by our method with augmentations alone achieves a better performance over other
highly engineered clustering methods. More comparisons and ablation studies can be found
in Appendix D.6.2.
Nevertheless, compared to the representations learned in the supervised setting where
the optimal partition Π in (11) is initialized by correct class information, the representations
here learned with self-supervised classes are far from being optimal. It remains wide open
how to design better optimization strategies and dynamics to learn from unlabelled or
80
ReduNet: A White-box Deep Network from Rate Reduction
Dataset Metric K-Means JULE RTM DEC DAC DCCM MCR2 -Ctrl
NMI 0.087 0.192 0.197 0.257 0.395 0.496 0.630
CIFAR10 ACC 0.229 0.272 0.309 0.301 0.521 0.623 0.684
ARI 0.049 0.138 0.115 0.161 0.305 0.408 0.508
NMI 0.084 0.103 - 0.136 0.185 0.285 0.387
CIFAR100 ACC 0.130 0.137 - 0.185 0.237 0.327 0.375
ARI 0.028 0.033 - 0.050 0.087 0.173 0.178
NMI 0.124 0.182 - 0.276 0.365 0.376 0.446
STL10 ACC 0.192 0.182 - 0.359 0.470 0.482 0.491
ARI 0.061 0.164 - 0.186 0.256 0.262 0.290
partially-labelled data better representations (and the associated partitions) close to the
global maxima of the MCR2 objective (11).
Training details of MCR2 -CTRL. For three datasets (CIFAR10, CIFAR100, and STL10),
we use ResNet-18 as in the supervised setting, and we set the output dimension d = 128,
precision 2 = 0.5, mini-batch size k = 20, number of augmentations n = 50, γ1 = γ2 = 20.
We observe that MCR2 -CTRL can achieve better clustering performance by using smaller γ2 ,
i.e., γ2 = 15, on CIFAR10 and CIFAR100 datasets. We use SGD as the optimizer, and set
the learning rate lr=0.1, weight decay wd=5e-4, and momentum=0.9.
Training dynamic comparison between MCR2 and MCR2 -CTRL. In the self-supervised
setting, we compare the training process for MCR2 and MCR2 -CTRL in terms of R, R,e Rc ,
2
and ∆R. For MCR training shown in Figure 20(a), the features first expand (for both R
and Rc ) then compress (for Rc ). For MCR2 -CTRL, both Re and Rc first compress then R
e
expands quickly and Rc remains small, as we have seen in Figure 20(b).
Table 11: Clustering comparison between MCR2 and MCR2 -CTRL on CIFAR10 dataset.
81
Chan, Yu, You, Qi, Wright, and Ma
where Yi is the i-th cluster in Y and Cj is the j-th cluster in C, and m is the total number
of samples.
where S is the set includes all the one-to-one mappings from cluster to label, and Y =
[y 1 , . . . , y m ], C = [c1 , . . . , cm ].
Adjusted rand index (ARI). Suppose there are m samples, and let Y and C be two
clustering of these samples, where Y = {Y1 , . . . , Yr } and C = {C1 , . . . , Cs }. Let mij denote
the number of the intersection between Yi and Cj , i.e., mij = |Yi ∩ Cj |. The ARI metric is
defined as
P mij P ai P bj m
ij 2 − i 2 j 2 2
ARI = P P ,
1 ai P bj ai P bj m
2 i 2 + j 2 − i 2 j 2 2
P P
where ai = j mij and bj = i mij .
Comparison with Ji et al. (2019); Hu et al. (2017). We compare MCR2 with IIC (Ji
et al., 2019) and IMSAT (Hu et al., 2017) in Table 12. We find that MCR2 outperforms
IIC (Ji et al., 2019) and IMSAT (Hu et al., 2017) on both CIFAR10 and CIFAR100 by a
large margin. For STL10, Hu et al. (2017) applied pretrained ImageNet models and Ji et al.
(2019) used more data for training.
82
ReduNet: A White-box Deep Network from Rate Reduction
Table 13: Experiments of MCR2 -CTRL in the self-supervised setting on STL10 dataset.
83
Chan, Yu, You, Qi, Wright, and Ma
Figure 21: Examples of rotated images of MNIST digits, each by 18◦ . (Left) Diagram for
polar coordinate representation; (Right) Rotated images of digit ‘0’ and ‘1’.
Figure 22: (Left) A torus on which 2D cyclic translation is defined; (Right) Cyclic translated
images of MNIST digits ‘0’ and ‘1’ (with stride=7).
84
ReduNet: A White-box Deep Network from Rate Reduction
Channel 5 10 15 20
Training Acc 1.000 1.000 1.000 1.000
Test Acc 0.560 0.610 0.610 0.610
Invariance Training Acc 0.838 0.972 0.990 1.000
Invariance Test Acc 0.559 0.609 0.610 0.610
2000 0 400 800 1200 1600 2000 2000 0 400 800 1200 1600 2000 2000 0 400 800 1200 1600 2000 2000 0 400 800 1200 1600 2000
0.0
Figure 23: Heatmaps of cosine similarity among shifted learned features (with different
channel sizes) Z̄shift for rotation invariance on the MNIST dataset (RI-MNIST).
Loss
Loss
Loss
Figure 24: Training and test losses of rotational invariant ReduNet for rotation invariance
on the MNIST dataset (RI-MNIST).
85
Chan, Yu, You, Qi, Wright, and Ma
Channel 5 15 35 55 75
Training Acc 1.000 1.000 1.000 1.000 1.000
Test Acc 0.680 0.610 0.670 0.770 0.840
Invariance Training Acc 0.879 0.901 0.933 0.933 0.976
Invariance Test Acc 0.619 0.599 0.648 0.767 0.838
400
400 400 400 400
800
0.5 800 0.5 800 0.5 800 0.5 800 0.5
1200
2000 0 400 800 1200 1600 2000 1600 0 400 800 1200 1600 1600 0 400 800 1200 1600
0.0 1600 0 400 800 1200 1600
0.0 1600 0.0
0 400 800 1200 1600
(a) 5 Channels (b) 15 Channels (c) 35 Channels (d) 55 Channels (e) 75 Channels
Figure 25: Heatmaps of cosine similarity among shifted learned features (with different
channel sizes) Z̄shift for translation invariance on the MNIST dataset (TI-MNIST).
Loss
Loss
Loss
0.015 0.10
Loss
(a) 5 Channels (b) 15 Channels (c) 35 Channels (d) 55 Channels (e) 75 Channels
Figure 26: Training and test losses of translation invariant ReduNet for translation invariance
on the MNIST dataset (TI-MNIST).
86
ReduNet: A White-box Deep Network from Rate Reduction
0 1.0 0 1.0
2.0
0.5 0.5
1.5
Loss
250 250 0.5
0.0 0.0
1.0
0.5
−0.5 −0.5 0.5
∆R (train) Rc (train) R (test)
R (train) ∆R (test) Rc (test)
(a) X(2D) (left: scatter plot; right: (b) Z(2D) (left: scatter plot; right: (c) Loss
cosine similarity visualization) cosine similarity visualization)
0 1.0 0 1.0
Loss
0.5 0.5
0.00 0.00
2.0
−0.25 −0.25
1.5
−0.50 500 −0.50 500
−1.00 −1.00 1.0
−0.75 −0.75 −0.75 −0.75
−0.50 −0.50 Rc (train)
−0.25 −0.25 0.5 ∆R (train) R (train)
−1.00 0.00 −1.00 0.00 ∆R (test) R (test) Rc (test)
0.25 0.25
−1.00 0.50 −1.00 0.50 0.0
−0.75
−0.50
−0.250.00 0.25 0.75 −0.75
−0.50
−0.250.00 0.25 0.75
0.50 0.75 1.00 1.00 0.50 0.75 1.00 1.00 0 250 500 750 1000 1250 1500 1750 2000
750
0 250 500 750
750
0 250 500 750 Layers
(d) X(3D) (left: scatter plot; right: (e) Z(3D) (left: scatter plot; right: co- (f) Loss
cosine similarity visualization) sine similarity visualization)
Figure 27: Learning mixture of Gaussians in S1 and S2 . (Top) For S1 , we set σ1 = σ2 = 0.1;
(Bottom) For S2 , we set σ1 = σ2 = σ3 = 0.1.
Figure 28: Learning mixture of Gaussian distributions with more than 2 classes. For both
cases, we use step size η = 0.5 and precision = 0.1. For (a), we set iteration
L = 2, 500; for (b), we set iteration L = 4, 000.
87
Chan, Yu, You, Qi, Wright, and Ma
Scales Angles ∆R R Rc
3
2 4 46.1418 66.0627 19.9208
23 8 46.3207 66.1358 19.8151
23 16 46.4162 66.1474 19.7312
24 4 40.5377 58.8128 18.2751
24 8 41.3374 60.1895 18.8521
24 16 42.5311 61.5164 18.9854
ResNet-18 48.6497 69.2653 20.6155
Table 16: Objective values under scattering transform with varying scales and angles.
88
ReduNet: A White-box Deep Network from Rate Reduction
Lifting n L Acc
Scattering (23 scales, 4 angles) 441 150 0.9734
Random filters 441 150 0.9225
Scattering (23 scales, 8 angles) 873 100 0.9781
Random filters 873 100 0.9153
Scattering (23 scales, 16 angles) 1737 50 0.9795
Random filters 1737 50 0.9061
Scattering (24 scales, 4 angles) 65 400 0.8952
Random filters 65 400 0.8794
Scattering (24 scales, 8 angles) 129 300 0.9392
Random filters 129 300 0.9172
Scattering (24 scales, 16 angles) 257 200 0.9569
Random filters 257 200 0.8794
Table 17: Lifting by Scattering transform versus by random filters. The dimension of the
feature vector n and the number of layers used in the ReduNet L are also stated.
89
Chan, Yu, You, Qi, Wright, and Ma
Figure 29: Verification of Equivariance in ReduNet. For each class of samples, and for
all training samples (and their augmentations) z i and a test sample (and its
augmentations) ẑ, we compute their inner product |hẑ, z i i|, ∀i. We select the top-
9 largest inner products (sorted from left to right), and visualize their respective
image. The test sample and its augmented version are highlighted in red.
References
Mongi A Abidi, Andrei V Gribok, and Joonki Paik. Optimization Techniques in Computer
Vision. Springer, 2016.
Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via
over-parameterization. In International Conference on Machine Learning, pages 242–252.
PMLR, 2019.
Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau,
Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient
descent by gradient descent. In Advances in neural information processing systems, pages
3981–3989, 2016.
Sanjeev Arora, Nadav Cohen, Wei Hu, and Yuping Luo. Implicit regularization in deep
matrix factorization. In Advances in Neural Information Processing Systems, pages
7411–7422, 2019.
Aharon Azulay and Yair Weiss. Why do deep convolutional networks generalize so poorly
to small image transformations? arXiv preprint arXiv:1805.12177, 2018.
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv
preprint arXiv:1607.06450, 2016.
90
ReduNet: A White-box Deep Network from Rate Reduction
Bowen Baker, Otkrist Gupta, N. Naik, and R. Raskar. Designing neural network architectures
using reinforcement learning. ArXiv, abs/1611.02167, 2017.
Pierre Baldi and Kurt Hornik. Neural networks and principal component analysis: Learning
from examples without local minima. Neural networks, 2(1):53–58, 1989.
A. Barron. Approximation and estimation bounds for artificial neural networks. Machine
Learning, 14:115–133, 1991.
Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin
bounds for neural networks. In Advances in Neural Information Processing Systems, pages
6241–6250, 2017.
Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear
inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.
Andrei Belitski, Arthur Gretton, Cesare Magri, Yusuke Murayama, Marcelo A. Montemurro,
Nikos K. Logothetis, and Stefano Panzeri. Low-frequency local field potentials and spikes
in primary visual cortex convey independent visual information. Journal of Neuroscience,
28(22):5696–5709, 2008. ISSN 0270-6474. doi: 10.1523/JNEUROSCI.0009-08.2008. URL
https://www.jneurosci.org/content/28/22/5696.
Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mand al. Reconciling modern machine
learning and the bias-variance trade-off. arXiv e-prints, art. arXiv:1812.11118, Dec 2018.
Stephen P Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university
press, 2004.
Michael M. Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst.
Geometric deep learning: Going beyond euclidean data. IEEE Signal Processing Magazine,
34(4):18–42, 2017. doi: 10.1109/MSP.2017.2693418.
Joan Bruna and Stéphane Mallat. Invariant scattering convolution networks. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 35(8):1872–1886, 2013.
Sam Buchanan, Dar Gilboa, and John Wright. Deep networks and the multiple manifold
problem. arXiv preprint arXiv:2008.11245, 2020.
Jacopo Cavazza, Pietro Morerio, Benjamin Haeffele, Connor Lane, Vittorio Murino, and
Rene Vidal. Dropout as a low-rank regularizer for matrix factorization. In International
Conference on Artificial Intelligence and Statistics, pages 435–444, 2018.
Tsung-Han Chan, Kui Jia, Shenghua Gao, Jiwen Lu, Zinan Zeng, and Yi Ma. PCANet: A
simple deep learning baseline for image classification? TIP, 2015.
Jianlong Chang, Lingfeng Wang, Gaofeng Meng, Shiming Xiang, and Chunhong Pan. Deep
adaptive image clustering. In Proceedings of the IEEE International Conference on
Computer Vision, pages 5879–5887, 2017.
Le Chang and Doris Tsao. The code for facial identity in the primate brain. Cell, 169:
1013–1028.e14, 06 2017. doi: 10.1016/j.cell.2017.05.011.
91
Chan, Yu, You, Qi, Wright, and Ma
Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary
differential equations. In NeurIPS, 2018.
Francois Chollet. Xception: Deep learning with depthwise separable convolutions. In 2017
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1800–1807,
2017.
Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in
unsupervised feature learning. In International Conference on Artificial Intelligence and
Statistics, pages 215–223, 2011.
Taco Cohen and Max Welling. Group equivariant convolutional networks. In International
Conference on Machine Learning, pages 2990–2999, 2016a.
Taco S. Cohen and Max Welling. Group equivariant convolutional networks. CoRR,
abs/1602.07576, 2016b. URL http://arxiv.org/abs/1602.07576.
Taco S Cohen, Mario Geiger, and Maurice Weiler. A general theory of equivariant CNNs
on homogeneous spaces. In Advances in Neural Information Processing Systems, pages
9142–9153, 2019.
Thomas M. Cover and Joy A. Thomas. Elements of Information Theory (Wiley Series
in Telecommunications and Signal Processing). Wiley-Interscience, USA, 2006. ISBN
0471241954.
Matthias Delange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Ales Leonardis,
Greg Slabaugh, and Tinne Tuytelaars. A continual learning survey: Defying forgetting in
classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence,
2021.
Simon S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably
optimizes over-parameterized neural networks. In International Conference on Learning
Representations, 2019. URL https://openreview.net/forum?id=S1eK3i09YQ.
Gamaleldin Elsayed, Dilip Krishnan, Hossein Mobahi, Kevin Regan, and Samy Bengio. Large
margin deep networks for classification. In Advances in neural information processing
systems, pages 842–852, 2018.
Logan Engstrom, Brandon Tran, Dimitris Tsipras, Ludwig Schmidt, and Aleksander Madry.
A rotation and a translation suffice: Fooling CNNs with simple transformations. arXiv,
2017.
Cong Fang, Hangfeng He, Qi Long, and Weijie J Su. Layer-peeled model: Toward under-
standing well-trained deep neural networks. arXiv preprint arXiv:2101.12699, 2021.
92
ReduNet: A White-box Deep Network from Rate Reduction
M. Fazel, H. Hindi, and S. P. Boyd. Log-det heuristic for matrix rank minimization with
applications to Hankel and Euclidean distance matrices. In Proceedings of the 2003
American Control Conference, 2003., volume 3, pages 2156–2162 vol.3, June 2003. doi:
10.1109/ACC.2003.1243393.
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion
parameter models with simple and efficient sparsity. arXiv:2101.03961, 01 2021.
Gauthier Gidel, Francis Bach, and Simon Lacoste-Julien. Implicit regularization of discrete
gradient dynamics in linear neural networks. In Advances in Neural Information Processing
Systems, pages 3196–3206, 2019.
Raja Giryes, Yonina C Eldar, Alex M Bronstein, and Guillermo Sapiro. Tradeoffs between
convergence speed and reconstruction accuracy in inverse problems. IEEE Transactions
on Signal Processing, 66(7):1676–1690, 2018.
Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity
of neural networks. CoRR, abs/1712.06541, 2017. URL http://arxiv.org/abs/1712.
06541.
Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning,
volume 1. MIT Press, 2016.
Karol Gregor and Yann LeCun. Learning fast approximations of sparse coding. In Proceedings
of the 27th International Conference on International Conference on Machine Learning,
pages 399–406, 2010.
Suriya Gunasekar, Jason D Lee, Daniel Soudry, and Nati Srebro. Implicit bias of gradient
descent on linear convolutional networks. In Advances in Neural Information Processing
Systems, pages 9461–9471, 2018.
Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an
invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and
Pattern Recognition, CVPR 2006, pages 1735–1742, 2006.
Benjamin David Haeffele, Chong You, and Rene Vidal. A critique of self-expressive deep
subspace clustering. In International Conference on Learning Representations, 2021.
XY Han, Vardan Papyan, and David L Donoho. Neural collapse under mse loss: Proximity
to and dynamics on the central path. arXiv preprint arXiv:2106.02073, 2021.
Boris Hanin. Universal function approximation by deep neural nets with bounded width
and ReLU activations. arXiv preprint arXiv:1708.02691, 2017.
Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning.
Springer, second edition, 2009.
93
Chan, Yu, You, Qi, Wright, and Ma
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. In CVPR, 2016.
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast
for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722, 2019.
Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common
corruptions and perturbations. In International Conference on Learning Representations,
2018.
Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint
arXiv:1606.08415, 2016.
R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman,
Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information
estimation and maximization. arXiv preprint arXiv:1808.06670, 2018.
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9:
1735–80, 12 1997. doi: 10.1162/neco.1997.9.8.1735.
David Hong, Yue Sheng, and Edgar Dobriban. Selecting the number of components in PCA
via random signflips, 2020.
Weihua Hu, Takeru Miyato, Seiya Tokui, Eiichi Matsumoto, and Masashi Sugiyama. Learn-
ing discrete representations via information maximizing self-augmented training. In
International Conference on Machine Learning, pages 1558–1567, 2017.
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely
connected convolutional networks. In 2017 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), pages 2261–2269, 2017.
Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of Classification, 2(1):
193–218, 1985.
Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren, editors. Automatic Machine Learning:
Methods, Systems, Challenges. Springer, 2019.
94
ReduNet: A White-box Deep Network from Rate Reduction
Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training
by reducing internal covariate shift. In International Conference on Machine Learning,
pages 448–456, 2015.
Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and
generalization in neural networks. June 2018. URL http://arxiv.org/abs/1806.07572.
Pan Ji, Tong Zhang, Hongdong Li, Mathieu Salzmann, and Ian Reid. Deep subspace
clustering networks. In Advances in Neural Information Processing Systems, pages 24–33,
2017.
Xu Ji, João F Henriques, and Andrea Vedaldi. Invariant information clustering for unsuper-
vised image classification and segmentation. In Proceedings of the IEEE International
Conference on Computer Vision, pages 9865–9874, 2019.
Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. How to
escape saddle points efficiently. In International conference on machine learning, 2017a.
Chi Jin, Praneeth Netrapalli, and Michael I. Jordan. Accelerated gradient descent escapes
saddle points faster than gradient descent. arXiv:1711.10456 [cs.LG], 2017b.
Chi Jin, Praneeth Netrapalli, and Michael I. Jordan. Accelerated gradient descent escapes
saddle points faster than gradient descent. In Sébastien Bubeck, Vianney Perchet, and
Philippe Rigollet, editors, Conference On Learning Theory, COLT 2018, Stockholm,
Sweden, 6-9 July 2018, volume 75 of Proceedings of Machine Learning Research, pages
1042–1085. PMLR, 2018. URL http://proceedings.mlr.press/v75/jin18a.html.
Zhao Kang, Chong Peng, Jie Cheng, and Qiang Cheng. Logdet rank minimization with
application to subspace clustering. CoRR, abs/1507.00908, 2015. URL http://arxiv.
org/abs/1507.00908.
Koray Kavukcuoglu, Marc’Aurelio Ranzato, Rob Fergus, and Yann LeCun. Learning invariant
features through topographic filter maps. In 2009 IEEE Conference on Computer Vision
and Pattern Recognition, pages 1605–1612. IEEE, 2009.
Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self-
normalizing neural networks. In Proceedings of the 31st International Conference on
Neural Information Processing Systems, pages 972–981, 2017.
Artemy Kolchinsky, Brendan D Tracey, and Steven Van Kuyk. Caveats for information
bottleneck in deterministic scenarios. arXiv preprint arXiv:1808.07593, 2018.
Irwin Kra and Santiago R Simanca. On circulant matrices. Notices of the American
Mathematical Society, 59:368–377, 2012.
95
Chan, Yu, You, Qi, Wright, and Ma
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep
convolutional neural networks. In Advances in neural information processing systems,
pages 1097–1105, 2012.
Andrew Lavin and Scott Gray. Fast algorithms for convolutional neural networks. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages
4013–4021, 2016.
Yann LeCun. The MNIST database of handwritten digits, 1998. URL http://yann.lecun.
com/exdb/mnist/.
Yann LeCun and Yoshua Bengio. Convolutional networks for images, speech, and time series.
In The handbook of brain theory and neural networks. MIT Press, 1995.
Yann LeCun, Lawrence D Jackel, Léon Bottou, Corinna Cortes, John S Denker, Harris
Drucker, Isabelle Guyon, Urs A Muller, Eduard Sackinger, Patrice Simard, et al. Learning
algorithms for classification: A comparison on handwritten digit recognition. Neural
networks: the statistical mechanics perspective, 261(276):2, 1995.
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning
applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
Yann LeCun, Fu Jie Huang, and Leon Bottou. Learning methods for generic object recognition
with invariance to pose and lighting. In Proceedings of the 2004 IEEE Computer Society
Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004., volume 2,
pages II–104. IEEE, 2004.
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):
436–444, 2015.
José Lezama, Qiang Qiu, Pablo Musé, and Guillermo Sapiro. OLE: Orthogonal low-rank
embedding-a plug and play geometric loss for deep learning. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 8109–8118, 2018.
Ke Li, Shichong Peng, Tianhao Zhang, and Jitendra Malik. Multimodal image synthesis with
conditional implicit maximum likelihood estimation. International Journal of Computer
Vision, 2020.
Yanjun Li and Yoram Bresler. Multichannel sparse blind deconvolution on the sphere. IEEE
Transactions on Information Theory, 65(11):7415–7436, 2019.
Sheng Liu, Xiao Li, Yuexiang Zhai, Chong You, Zhihui Zhu, Carlos Fernandez-Granda, and
Qing Qu. Convolutional normalization: Improving deep convolutional network robustness
and training. arXiv preprint arXiv:2103.00673, 2021.
96
ReduNet: A White-box Deep Network from Rate Reduction
Bethany Lusch, J. Kutz, and Steven Brunton. Deep learning for universal linear embeddings of
nonlinear dynamics. Nature Communications, 9, 11 2018. doi: 10.1038/s41467-018-07210-0.
Yi Ma, Harm Derksen, Wei Hong, and John Wright. Segmentation of multivariate mixed
data via lossy data coding and compression. PAMI, 2007.
Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier nonlinearities improve neural
network acoustic models. In Proc. ICML, volume 30, page 3. Citeseer, 2013.
Jan MacDonald, Stephan Wäldchen, Sascha Hauch, and Gitta Kutyniok. A rate-distortion
framework for explaining neural network decisions. CoRR, abs/1905.11092, 2019. URL
http://arxiv.org/abs/1905.11092.
Najib J. Majaj, Ha Hong, Ethan A. Solomon, and James J. DiCarlo. Simple learned
weighted sums of inferior temporal neuronal firing rates accurately predict human core
object recognition performance. Journal of Neuroscience, 35(39):13402–13418, 2015. ISSN
0270-6474. doi: 10.1523/JNEUROSCI.5181-14.2015. URL https://www.jneurosci.org/
content/35/39/13402.
Michael Mathieu, Mikael Henaff, and Yann LeCun. Fast training of convolutional networks
through FFTs, 2013.
Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks:
The sequential learning problem. In Psychology of learning and motivation, volume 24,
pages 109–165. Elsevier, 1989.
W. McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervous activity.
the Bulletin of Mathematical Biology, 5:115–133, 1943.
Song Mei and Andrea Montanari. The generalization error of random features regression: Pre-
cise asymptotics and double descent curve. ArXiv Statistics e-prints, arXiv:1908.05355v2,
08 2019.
Poorya Mianjy, Raman Arora, and Rene Vidal. On the implicit bias of dropout. In Jennifer Dy
and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine
Learning, volume 80 of Proceedings of Machine Learning Research, pages 3540–3548. PMLR,
10–15 Jul 2018. URL http://proceedings.mlr.press/v80/mianjy18b.html.
Dustin G Mixon, Hans Parshall, and Jianzong Pi. Neural collapse with unconstrained
features. arXiv preprint arXiv:2011.11619, 2020.
Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normal-
ization for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.
Vishal Monga, Yuelong Li, and Yonina C Eldar. Algorithm unrolling: Interpretable, efficient
deep learning for signal and image processing. arXiv preprint arXiv:1912.10557, 2019.
S. Nam, M.E. Davies, M. Elad, and R. Gribonval. The cosparse analysis model and algorithms.
Applied and Computational Harmonic Analysis, 34(1):30 – 56, 2013. ISSN 1063-5203.
doi: https://doi.org/10.1016/j.acha.2012.03.006. URL http://www.sciencedirect.com/
science/article/pii/S1063520312000450.
97
Chan, Yu, You, Qi, Wright, and Ma
Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of
convergence o (1/kˆ 2). In Doklady AN USSR, volume 269, pages 543–547, 1983.
Oliver Nina, Jamison Moody, and Clarissa Milligan. A decoder-free approach for unsupervised
clustering and manifold learning with random triplet mining. In Proceedings of the IEEE
International Conference on Computer Vision Workshops, pages 0–0, 2019.
J. Nocedal and S. Wright. Numerical Optimization. Springer, New York, 2rd edition, 2006a.
Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer, New York, NY,
USA, second edition, 2006b.
Chigozie Nwankpa, Winifred Ijomah, Anthony Gachagan, and Stephen Marshall. Activation
functions: Comparison of trends in practice and research for deep learning. arXiv preprint
arXiv:1811.03378, 2018.
Bruno A Olshausen and David J Field. Emergence of simple-cell receptive field properties
by learning a sparse code for natural images. Nature, 381(6583):607, 1996.
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive
predictive coding. arXiv preprint arXiv:1807.03748, 2018.
Vardan Papyan, Yaniv Romano, and Michael Elad. Convolutional neural networks analyzed
via convolutional sparse coding. The Journal of Machine Learning Research, 18(1):
2887–2938, 2017.
Vardan Papyan, XY Han, and David L Donoho. Prevalence of neural collapse during the
terminal phase of deep learning training. arXiv preprint arXiv:2008.08186, 2020.
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan,
Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imper-
ative style, high-performance deep learning library. In Advances in Neural Information
Processing Systems, pages 8024–8035, 2019.
Xi Peng, Jiashi Feng, Shijie Xiao, Jiwen Lu, Zhang Yi, and Shuicheng Yan. Deep sparse
subspace clustering. arXiv preprint arXiv:1709.08374, 2017.
Phil Pope, Chen Zhu, Ahmed Abdelkader, Micah Goldblum, and Tom Goldstein. The intrinsic
dimension of images and its impact on learning. In International Conference on Learning
Representations, 2021. URL https://openreview.net/forum?id=XJk19XzGq2J.
Haozhi Qi, Chong You, Xiaolong Wang, Yi Ma, and Jitendra Malik. Deep isometric learning
for visual recognition. In Proceedings of the International Conference on International
Conference on Machine Learning, 2020.
Qing Qu, Xiao Li, and Zhihui Zhu. A nonconvex approach for exact and efficient multichannel
sparse blind deconvolution. In Advances in Neural Information Processing Systems, pages
4017–4028, 2019.
Qing Qu, Xiao Li, and Zhihui Zhu. Exact recovery of multichannel sparse blind deconvolution
via gradient descent. SIAM Journal on Imaging Sciences, 13(3):1630–1652, 2020a.
98
ReduNet: A White-box Deep Network from Rate Reduction
Qing Qu, Yuexiang Zhai, Xiao Li, Yuqian Zhang, and Zhihui Zhu. Geometric analysis of
nonconvex optimization landscapes for overcomplete learning. In International Confer-
ence on Learning Representations, 2020b. URL https://openreview.net/forum?id=
rygixkHKDH.
Salah Rifai, Pascal Vincent, Xavier Muller, Xavier Glorot, and Yoshua Bengio. Contrac-
tive auto-encoders: Explicit invariance during feature extraction. In In International
Conference on Machine Learning, page 833–840, 2011.
David Rolnick and Max Tegmark. The power of deeper networks for expressing natural
functions. CoRR, abs/1705.05502, 2017. URL http://arxiv.org/abs/1705.05502.
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for
biomedical image segmentation. In International Conference on Medical image computing
and computer-assisted intervention, pages 234–241. Springer, 2015.
F. Rosenblatt. The perceptron: a probabilistic model for information storage and organization
in the brain. Psychological review, 65 6:386–408, 1958.
Sara Sabour, Nicholas Frosst, and Geoffrey E. Hinton. Dynamic routing between capsules.
CoRR, abs/1710.09829, 2017. URL http://arxiv.org/abs/1710.09829.
Dominik Scherer, Andreas Müller, and Sven Behnke. Evaluation of pooling operations in
convolutional architectures for object recognition. In International conference on artificial
neural networks, pages 92–101. Springer, 2010.
Hanie Sedghi, Vineet Gupta, and Philip M Long. The singular values of convolutional layers.
In ICLR, 2019.
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton,
and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts
layer. In ICLR, 2017. URL https://openreview.net/pdf?id=B1ckMDqlg.
Yanyao Shen and Sujay Sanghavi. Learning with bad training data via iterative trimmed
loss minimization. In International Conference on Machine Learning, pages 5739–5748.
PMLR, 2019.
99
Chan, Yu, You, Qi, Wright, and Ma
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale
image recognition. In ICLR, 2015.
William R Softky and Christof Koch. The highly irregular firing of cortical cells is inconsistent
with temporal integration of random EPSPs. Journal of Neuroscience, 13(1):334–350,
1993.
Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee. Theoretical insights into the
optimization landscape of over-parameterized shallow neural networks. IEEE Transactions
on Information Theory, 65(2):742–769, 2018.
Stefano Spigler, Mario Geiger, and Matthieu Wyart. Asymptotic learning curves of kernel
methods: empirical data vs teacher-student paradigm. arXiv preprint arXiv:1905.10843,
2019.
Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks. arXiv
preprint arXiv:1505.00387, 2015.
Alexander Strehl and Joydeep Ghosh. Cluster ensembles—a knowledge reuse framework for
combining multiple partitions. Journal of Machine Learning Research, 3(Dec):583–617,
2002.
Jeremias Sulam, Vardan Papyan, Yaniv Romano, and Michael Elad. Multilayer convolutional
sparse modeling: Pursuit and dictionary learning. IEEE Transactions on Signal Processing,
2018.
Xiaoxia Sun, Nasser M Nasrabadi, and Trac D Tran. Supervised deep sparse coding networks
for image classification. IEEE Transactions on Image Processing, 29:405–418, 2020.
Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian
Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint
arXiv:1312.6199, 2013.
Yi Tay, Mostafa Dehghani, Jai Gupta, Dara Bahri, Vamsi Aribandi, Zhen Qin, and Donald
Metzler. Are pre-trained convolutions better than pre-trained transformers?, 2021.
Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle.
In 2015 IEEE Information Theory Workshop (ITW), pages 1–5. IEEE, 2015.
Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normalization: The
missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.
Nicolas Vasilache, J. Johnson, Michaël Mathieu, Soumith Chintala, Serkan Piantino, and
Y. LeCun. Fast convolutional nets with fbfft: A GPU performance evaluation. CoRR,
abs/1412.7580, 2015.
Rene Vidal, Yi Ma, and S. S. Sastry. Generalized Principal Component Analysis. Springer
Publishing Company, Incorporated, 1st edition, 2016. ISBN 0387878106.
100
ReduNet: A White-box Deep Network from Rate Reduction
Oriol Vinyals, Yangqing Jia, Li Deng, and Trevor Darrell. Learning with recursive perceptual
representations. In Proceedings of the 25th International Conference on Neural Information
Processing Systems - Volume 2, NIPS’12, page 2825–2833, Red Hook, NY, USA, 2012.
Curran Associates Inc.
Andrew Wagner, John Wright, Arvind Ganesh, Zihan Zhou, Hossein Mobahi, and Yi Ma.
Toward a practical face recognition system: Robust alignment and illumination by sparse
representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(2):
372–386, 2012.
Michael B Wakin, David L Donoho, Hyeokho Choi, and Richard G Baraniuk. The multiscale
structure of non-differentiable image manifolds. In Proceedings of SPIE, the International
Society for Optical Engineering, pages 59141B–1, 2005.
Colin Wei, Sham Kakade, and Tengyu Ma. The implicit and explicit regularization effects of
dropout. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International
Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research,
pages 10181–10192. PMLR, 13–18 Jul 2020. URL http://proceedings.mlr.press/v119/
wei20d.html.
Scott Wisdom, Thomas Powers, James Pitton, and Les Atlas. Interpretable recurrent neural
networks using sequential sparse recovery. ArXiv, abs/1611.07252, 2016.
John Wright and Yi Ma. High-Dimensional Data Analysis with Low-Dimensional Models:
Principles, Computation, and Applications. Cambridge University Press, 2021.
John Wright, Yangyu Tao, Zhouchen Lin, Yi Ma, and Heung-Yeung Shum. Classification
via minimum incremental coding length (MICL). In Advances in Neural Information
Processing Systems, pages 1633–1640, 2008.
John Wright, Allen Y. Yang, Arvind Ganesh, S. Shankar Sastry, and Yi Ma. Robust
face recognition via sparse representation. IEEE Trans. Pattern Anal. Mach. Intell.,
31(2):210–227, February 2009. ISSN 0162-8828. doi: 10.1109/TPAMI.2008.79. URL
http://dx.doi.org/10.1109/TPAMI.2008.79.
Jianlong Wu, Keyu Long, Fei Wang, Chen Qian, Cheng Li, Zhouchen Lin, and Hongbin Zha.
Deep comprehensive correlation mining for image clustering. In Proceedings of the IEEE
International Conference on Computer Vision, pages 8150–8159, 2019.
Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the European conference
on computer vision (ECCV), pages 3–19, 2018.
101
Chan, Yu, You, Qi, Wright, and Ma
Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning
via non-parametric instance discrimination. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 3733–3742, 2018.
Ziyang Wu, Christina Baek, Chong You, and Yi Ma. Incremental learning via rate reduction.
In IEEE Computer Society Conference on Computer Vision and Pattern Recognition,
2021.
Junyuan Xie, Ross Girshick, and Ali Farhadi. Unsupervised deep embedding for clustering
analysis. In International Conference on Machine Learning, pages 478–487, 2016.
Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual
transformations for deep neural networks. In CVPR, 2017a.
Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual
transformations for deep neural networks. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 1492–1500, 2017b.
Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. Empirical evaluation of rectified activations
in convolutional network. arXiv preprint arXiv:1505.00853, 2015.
Jianwei Yang, Devi Parikh, and Dhruv Batra. Joint unsupervised learning of deep represen-
tations and image clusters. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 5147–5156, 2016.
Zitong Yang, Yaodong Yu, Chong You, Jacob Steinhardt, and Yi Ma. Rethinking bias-
variance trade-off for generalization of neural networks. In International Conference on
Machine Learning (ICML), 2020.
Chong You, Chun-Guang Li, Daniel P Robinson, and René Vidal. Oracle based active
set algorithm for scalable elastic net subspace clustering. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 3928–3937, 2016.
Yaodong Yu, Kwan Ho Ryan Chan, Chong You, Chaobing Song, and Yi Ma. Learning diverse
and discriminative representations via the principle of maximal coding rate reduction.
Advances in Neural Information Processing Systems, 33, 2020.
Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhut-
dinov, and Alexander J Smola. Deep sets. In I. Guyon, U. V. Luxburg, S. Bengio,
H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural
Information Processing Systems 30, pages 3391–3401. Curran Associates, Inc., 2017. URL
http://papers.nips.cc/paper/6931-deep-sets.pdf.
John Zarka, Louis Thiry, Tomás Angles, and Stéphane Mallat. Deep network classification
by scattering and homotopy dictionary learning. In International Conference on Learning
Representations, 2020. URL https://openreview.net/forum?id=SJxWS64FwH.
John Zarka, Florentin Guth, and Stéphane Mallat. Separation and concentration in deep
networks. In International Conference on Learning Representations, 2021. URL https:
//openreview.net/forum?id=8HhkbjrWLdE.
102
ReduNet: A White-box Deep Network from Rate Reduction
Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks.
In European Conference on Computer Vision, pages 818–833. Springer, 2014.
Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Under-
standing deep learning requires rethinking generalization. In ICLR, 2017a.
Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Under-
standing deep learning requires rethinking generalization. In International Conference on
Learning Representations, 2017b.
Junjian Zhang, Chun-Guang Li, Chong You, Xianbiao Qi, Honggang Zhang, Jun Guo, and
Zhouchen Lin. Self-supervised convolutional subspace clustering network. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5473–5482,
2019a.
Tong Zhang, Pan Ji, Mehrtash Harandi, Richard Hartley, and Ian Reid. Scalable deep
k-subspace clustering. In Asian Conference on Computer Vision, pages 466–481. Springer,
2018.
Tong Zhang, Pan Ji, Mehrtash Harandi, Wenbing Huang, and Hongdong Li. Neural
collaborative subspace clustering. arXiv preprint arXiv:1904.10596, 2019b.
Xiao Zhang, Yaodong Yu, Lingxiao Wang, and Quanquan Gu. Learning one-hidden-layer
relu networks via gradient descent. In The 22nd International Conference on Artificial
Intelligence and Statistics, pages 1524–1534. PMLR, 2019c.
Kai Zhong, Zhao Song, Prateek Jain, Peter L Bartlett, and Inderjit S Dhillon. Recovery
guarantees for one-hidden-layer neural networks. In International conference on machine
learning, pages 4140–4149. PMLR, 2017.
Pan Zhou, Yunqing Hou, and Jiashi Feng. Deep adversarial subspace clustering. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages
1596–1604, 2018.
Zhihui Zhu, Tianyu Ding, Jinxin Zhou, Xiao Li, Chong You, Jeremias Sulam, and Qing Qu.
A geometric analysis of neural collapse with unconstrained features, 2021.
Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv
preprint arXiv:1611.01578, 2016.
103