Recurrent Nets, Convolutional Nets: Yann Lecun Nyu - Courant Institute & Center For Data Science Facebook Ai Research

Recurrent Nets,
Convolutional Nets
Yann LeCun
NYU - Courant Institute & Center for Data Science
Facebook AI Research
Deep Learning, NYU Spring 2021

Y. LeCun
“Hypernetwork”
When the parameter vector is the

output of another network H(x,u)
y C(y,y)
The weights of network G(x,w) are
dynamically configured by network
H(x,u)
w G(x,w)
The concept is very powerful

u H(x,u)
The idea is very old
x y
Y. LeCun
Shared Weights for Motif Detection
Detecting motifs anywhere on an input
MAX
w G(x,w) G(x,w) G(x,w) G(x,w) G(x,w)

Recurrent Nets
Yann LeCun
NYU - Courant Institute & Center for Data Science
Facebook AI Research
Y. LeCun
Recurrent Networks
Networks with loops. For backprop, unroll the loop.

Delay
Y. LeCun
Recurrent Networks
Networks with loops. For backprop, unroll the loop.
Dec(z(0)) Dec(z(1)) Dec(z(2))
G(h,z,w) G(h,z,w) G(h,z,w)

z(0) z(1) z(2)
Enc(x(0)) Enc(x(1)) Enc(x(2))

Y. LeCun
RNN tricks
[Pascanu, Mikolov, Bengio, ICML 2013; Bengio, Boulanger & Pascanu, ICASSP 2013]
Clipping gradients (avoid exploding gradients)

Leaky integration (propagate long-term dependencies)
Momentum (cheap 2nd order)
Initialization (start in right ballpark avoids exploding/vanishing)
Sparse Gradients (symmetry breaking)
Gradient propagation regularizer (avoid vanishing gradient)
LSTM self-loops (avoid vanishing gradient)
Y. LeCun
Recurrent Networks for differential equations
Differential equation
Time discretization
The network updates the state
Delay
Y. LeCun
GRU (Gated Recurrent Units)

Recurrent nets quickly “forget” their state
Solution: explicit memory cells
GRU [Cho arXiv:1406.1078]
Y. LeCun
LSTM (Long Short-Term Memory)

Recurrent nets quickly “forget” their state
Solution: explicit memory cells
LSTM [Hochreiter & Schmidhuber 97]
Guillaume Chevalier https://commons.wikimedia.org/w/index.php?curid=71836793

Y. LeCun
Sequence to Sequence
Sequence encoder → sequence decoder

Applications:
y y y
Translation [Sutskever NIPS 2014]
Multi-layer LSTM
G(h,z,w) G(h,z,w) G(h,z,w)
G(h,z,w) G(h,z,w) G(h,z,w) G(x,z,w) G(x,z,w) G(x,z,w)
G(x,z,w) G(x,z,w) G(x,z,w) <> y y
Representation
x x x
Of the input sequence
Y. LeCun
Sequence to Sequence with Attention
Sequence encoder → sequence decoder with attention

Applications:
Translation [Bahdanau, Cho, Bengio ArXiv:1409.0473]
y y y y y
G(h,z,w) G(h,z,w) G(h,z,w) G(h,z,w) G(h,z,w)
attention attention attention attention attention
G(x,z,w) G(x,z,w) G(x,z,w) G(x,z,w) G(x,z,w)
x x x x x
Convolutional
Networks
For sequences, audio, speech, images, volumetric
images, video, and other natural signals.
Y. LeCun
Shared Weights for Motif Detection
Detecting motifs anywhere on an input
MAX
w G(x,w) G(x,w) G(x,w) G(x,w) G(x,w)

Y. LeCun
Deep Learning = Learning Hierarchical Representations
It's deep if it has more than one stage of non-linear feature transformation
Low-Level Mid-Level High- Trainable

Feature Feature Level Classifier
Feature
Feature visualization of convolutional net trained on ImageNet from [Zeiler & Fergus 2013]
Y. LeCun
Detecting Motifs in Images
Swipe “templates” over the image to detect motifs
weighted thresholded
detectors sums outputs
Linear
classifier
input
image
"C"
"D"
weighted thresholded weighted thresholded

Y. LeCun
Detecting Motifs in Images

"D"
Shift invariance
weighted thresholded weighted thresholded

detectors sums outputs detectors sums outputs
input input
image image
Y. LeCun
Discrete Convolution (or cross-correlation)
Definition
convolution
In practice
Cross-correlation
In 2D
Y. LeCun
Backpropagating through convolutions
Convolution
(really: cross-correlation)
Backprop to input
Sometimes called
“back-convolution”
Backprop to weights
Y. LeCun
Stride and Skip: subsampling and convolution “à trous”
Regular convolution
“dense”
Stride
subsampling convolution
Reduces spatial resolution
Skip, convolution “à trous”

pronounced “ah troo” (means “with holes”)
Dimensionality reduction without loss of
resolution
Y. LeCun
Convolutional Network Architecture
Filter Bank +non-linearity
Pooling
Pooling
[LeCun et al. NIPS 1989]

Y. LeCun
Multiple Convolutions
Animation: Andrej Karpathy http://cs231n.github.io/convolutional-networks/

Y. LeCun
Convolutional Network (vintage 1990)

Filters-tanh → pooling → filters-tanh → pooling → filters-tanh
Overall Architecture: multiple stages of
Y. LeCun
Normalization → Filter Bank → Non-Linearity → Pooling
Filter Non- feature Filter Non- feature

Norm Norm Classifier
Bank Linear Pooling Bank Linear Pooling
Normalization: variation on whitening (optional)

– Subtractive: average removal, high pass filtering
– Divisive: local contrast normalization, variance normalization
Filter Bank: dimension expansion, projection on overcomplete basis
Non-Linearity: sparsification, saturation, lateral inhibition....
– Rectification (ReLU), Component-wise shrinkage, tanh,..
ReLU ( x )= max ( x , 0)
Pooling: aggregation over space or feature type
– Max, Lp norm, log prob.
p p 1 bX
MAX : Max i ( X i ) ; L p : √ X i ; PROB : log ∑ e
b ( i )
i
Y. LeCun
LeNet5
Simple ConvNet
for MNIST
[LeCun 1998]
PyTorch code ------→

– (slightly different net)
Layer 2 Layer 3 Layer 5

input Layer 1 Layer 4
6@14x14 12@10x10 100@1x1
1@32x32 6@28x2 12@5x5 Layer 6: 10
8
10
2x2 5x5 2x2 5x5

5x5 pooling/ convolution
pooling/ convolution
convolution subsampling
subsampling
Y. LeCun
LeNet5
Simple ConvNet
for MNIST
[LeCun 1998]
PyTorch code --→

github.com/activatedgeek/LeNet-5
nn.sequential() with
ordered dictionary Layer 3 Layer 5
Layer 1 Layer 2 Layer 4
argument. 100@1x1
6@28x2 6@14x14 12@10x10 12@5x5 Layer 6: 10
8
10
5x5 5x5
5x5 2x2 2x2
convolution convolution
convolution pooling/ pooling/
Y. LeCun
Sliding Window ConvNet + Weighted Finite-State Machine

Y. LeCun
LeNet character recognition demo 1992

Running on an AT&T DSP32C (floating-point DSP, 20 MFLOPS)
Y. LeCun
How does the brain interprets images?

The ventral (recognition) pathway in the visual cortex has multiple stages
Retina - LGN - V1 - V2 - V4 - PIT - AIT ....
[picture from Simon Thorpe]

[Gallant & Van Essen]
Y. LeCun
Hubel & Wiesel's Model of the Architecture of the Visual Cortex
[Hubel & Wiesel 1962]:

simple cells detect local features
complex cells “pool” the outputs of simple
cells within a retinotopic neighborhood.
“Simple cells”
“Complex
cells”
pooling
Multiple subsampling
convolutions
[Fukushima 1982][LeCun 1989, 1998],[Riesenhuber 1999]......

First ConvNets Y. LeCun
(U Toronto)[LeCun 88, 89]

Trained with Backprop. 320 examples.
Single layer Two layers FC locally connected Shared weights Shared weights
- Convolutions with stride
(subsampling)
- No separate pooling layers
Y. LeCun
First “Real” ConvNets at Bell Labs [LeCun et al 89]
Trained with Backprop.

USPS Zipcode digits: 7300 training, 2000 test.
Convolution with stride. No separate pooling.

Recurrent Nets, Convolutional Nets: Yann Lecun Nyu - Courant Institute & Center For Data Science Facebook Ai Research

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Recurrent Nets, Convolutional Nets: Yann Lecun Nyu - Courant Institute & Center For Data Science Facebook Ai Research

Uploaded by

Copyright:

Available Formats

Recurrent Nets,

Deep Learning, NYU Spring 2021

When the parameter vector is the

The concept is very powerful

Shared Weights for Motif Detection

Detecting motifs anywhere on an input

w G(x,w) G(x,w) G(x,w) G(x,w) G(x,w)

Networks with loops. For backprop, unroll the loop.

Networks with loops. For backprop, unroll the loop.

Dec(z(0)) Dec(z(1)) Dec(z(2))

G(h,z,w) G(h,z,w) G(h,z,w)

Enc(x(0)) Enc(x(1)) Enc(x(2))

Clipping gradients (avoid exploding gradients)

Recurrent Networks for differential equations

GRU (Gated Recurrent Units)

LSTM (Long Short-Term Memory)

Guillaume Chevalier https://commons.wikimedia.org/w/index.php?curid=71836793

Sequence encoder → sequence decoder

G(h,z,w) G(h,z,w) G(h,z,w) G(x,z,w) G(x,z,w) G(x,z,w)

G(x,z,w) G(x,z,w) G(x,z,w) <> y y

Sequence to Sequence with Attention

Sequence encoder → sequence decoder with attention

G(h,z,w) G(h,z,w) G(h,z,w) G(h,z,w) G(h,z,w)

attention attention attention attention attention

G(x,z,w) G(x,z,w) G(x,z,w) G(x,z,w) G(x,z,w)

Shared Weights for Motif Detection

Detecting motifs anywhere on an input

w G(x,w) G(x,w) G(x,w) G(x,w) G(x,w)

Low-Level Mid-Level High- Trainable

Detecting Motifs in Images

Swipe “templates” over the image to detect motifs

weighted thresholded weighted thresholded

Detecting Motifs in Images

weighted thresholded weighted thresholded

Discrete Convolution (or cross-correlation)

Backpropagating through convolutions

Stride and Skip: subsampling and convolution “à trous”

Skip, convolution “à trous”

Convolutional Network Architecture

Filter Bank +non-linearity

Filter Bank +non-linearity

[LeCun et al. NIPS 1989]

Animation: Andrej Karpathy http://cs231n.github.io/convolutional-networks/

Convolutional Network (vintage 1990)

Normalization → Filter Bank → Non-Linearity → Pooling

Filter Non- feature Filter Non- feature

Normalization: variation on whitening (optional)

PyTorch code ------→

Layer 2 Layer 3 Layer 5

2x2 5x5 2x2 5x5

PyTorch code --→

Sliding Window ConvNet + Weighted Finite-State Machine

LeNet character recognition demo 1992

How does the brain interprets images?

[picture from Simon Thorpe]

Hubel & Wiesel's Model of the Architecture of the Visual Cortex

[Hubel & Wiesel 1962]:

[Fukushima 1982][LeCun 1989, 1998],[Riesenhuber 1999]......

(U Toronto)[LeCun 88, 89]

Trained with Backprop.

You might also like