Fall 2020 Applied Machine Learning Guest Lecture Neural Network Architectures

Fall 2020 Applied Machine Learning
Guest Lecture
Neural Network Architectures

Jin Sun
Oct. 2020 1
● CNN Architectures
● RNN Architectures
● Neural Network Architecture Design
Based on course materials from http://cs231n.github.io/ and the Dive into Deep Learning book: https://d2l.ai/
2
Recap: Neural Networks and MLP
Neural networks encode a function mapping from input to output.
Such function is a result of composition of multiple functions (for each layer).

3
Recap: Neural Networks and MLP
Multilayer Perceptrons:
Multiple fully connected layers (at least input, hidden, and output layers).
Non-linear activations (sigmoid, ReLU, etc).
4
Recap: Neural Network Layers and Representation
5
Why so many NN architectures?
https://arxiv.org/abs/1605.07678 6
Why so many NN architectures?
Theoretically a single hidden layer network with infinite number of neurons
can fit any function.
When we refer to NN architectures, we talk about number of layers,

activation, connectivity patterns in a model.
● Performance: deeper networks -> better performance

● Efficiency: same performance but less footprints
● Robustness: more stable to train, e.g., normalization layers
● Task/data-dependent: better suited to the task, e.g., LSTM for sequences
7
8
Recap: Convolutional Neural Networks
5 filters 5 activation maps

5x28x28
9
Recap: Convolutional Neural Networks
10
Recap: AlexNet
5 conv layers: convolution + ReLU + max pooling
3 fully connected layers

11
VGG
Simply more layers.
Smaller conv filters (3x3).
Performance 7.3% error compared to

16.4% of AlexNet (lower the better).
VGG
Simply more layers.
Smaller conv filters (3x3).
Stacked small filters have

similar receptive field as
Performance 7.3% error compared to single large filters but has:
16.4% of AlexNet (lower the better). 1)

2)
Less parameters
More nonlinearity
If more layers the better, can we just stack arbitrarily large number of layers?
VGG-infinity??
...
14
Train on 56 layers vs 20 layers
Note this is not overfitting because the training error is also worse.
15
ResNet
This doesn’t make sense.
The deeper network should be able to learn

everything the shallow network can.
In fact, isn’t the shallow network ‘part of’ the

deeper network by setting some layers to
identity functions?
Unless such identity mapping is actually hard

to learn in training the deeper networks.
Residual Block
Then let’s help the network with identity mapping
17
ResNet
ResNet-50
ResNet-101
ResNet-152
Achieve 3.6% error.
18
DenseNet
Each layer is connected to every other layer in the same block.
20
Sequential Data
Data distribution changes over time
21
Sequential Data Same value point, going up or going down?
22
Sequential Data
23
Sequential Data Similar sound, similar meaning?
24
Sequential Data
25
Sequential Data Going forward or backward?
Learning and Using the Arrow of Time, CVPR 2018 26

Sequential Data
27
Sequential Data
28
Sequential Data
Data distribution changes over time.
The predictive model’s output is depending on the previous input/status.
To model this, we need new structures other than feedforward networks.
29
Recurrent Neural Networks
Neural network models that have connections from each layer to itself.
30
Recurrent Neural Network - a simple example
31
Gated Recurrent Units (GRU)
32
Learning long term dependency is hard!
http://colah.github.io/posts/2015-08-Understanding-LSTMs/ 33
Long Short Term Memory (LSTM)
http://colah.github.io/posts/2015-08-Understanding-LSTMs/ 34
Forget gate
35
Store info in cell states
36
Update cell states
37
Get output
38
Attention & Transformers
However, RNNs are not the default choice for temporal data anymore!
Attention Model
RNNs
39
Attention Model
RNNs
In this example it will be very hard for standard RNNs to

model because the useful information is at later part.
40
With attention mechanism as building block, an architecture called the
Transformers are proposed.
41
Attention and Transformers are very successful in recent NLP development.
They show that you don’t really need complex recurrent structure to perform
tasks on sequential data like languages.
GPT models, based on Transformer structure, are demonstrating incredible

performance when trained with massive amount of data.
42
43
Design a NN model for your problem
Things to consider:
❏ The characteristic of your data

❏ Data input/output space
❏ Budget on hardwares, training time...
44
Data and Neural Network Models
Static Data Dynamic Data Unsupervised Data
Convolutional Recurrent Generative

Neural Neural Neural
Networks Networks Networks
45
Static Data - Image
46
Static Data - Translation invariance in images
47
48
49
Convolutional Networks
50
Things to consider:

51
Predicting a category
52
Dense predictions
U-Net
53
Input is a graph
Graph Conv Net
54
https://github.com/tkipf/gcn
Things to consider:

55
Binary Networks
56
https://mohitjain.me/2018/07/14/bnn/
Knowledge Distillation
Train a smaller network (Student) by learning from a more powerful network (Teacher).
57
https://towardsdatascience.com/knowledge-distillation-simplified-dd4973dbc764
Neural Network Model Search
Even with a targeted network architecture, there are still a lot of flexibility in
the exact configuration of your neural network framework.
How many filters?
How many layers?
58
Hyperparameters
Learning rate: controls how much you update the model’s weights.
Batch size: how many data you put into the model each time to calculate the
gradient.
Regularization: e.g., weight decay, L1.
Network structure: number of layers, activation functions, etc.
Even Data: it is almost always guaranteed that a model would get better with
more data.
59
Tools for hyperparameter tuning
60
Finally...
If neural networks are so magical/capable/powerful, why don’t you train a

neural network to find the best neural networks for a task?
In fact, it has been done.
61
AutoML, Neural Architecture Search, meta-learning…
Google Cloud already provides automated tools.

62
Automatic Model Search
https://cloud.google.com/automl/ 63
Summary: Neural Network Architectures
64

Fall 2020 Applied Machine Learning Guest Lecture Neural Network Architectures

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fall 2020 Applied Machine Learning Guest Lecture Neural Network Architectures

Uploaded by

Copyright:

Available Formats

Fall 2020 Applied Machine Learning

Neural Network Architectures

Neural networks encode a function mapping from input to output.

Such function is a result of composition of multiple functions (for each layer).

When we refer to NN architectures, we talk about number of layers,

● Performance: deeper networks -> better performance

5 filters 5 activation maps

5 conv layers: convolution + ReLU + max pooling

3 fully connected layers

Simply more layers.

Smaller conv ﬁlters (3x3).

Performance 7.3% error compared to

Simply more layers.

Smaller conv ﬁlters (3x3).

Stacked small ﬁlters have

16.4% of AlexNet (lower the better). 1)

The deeper network should be able to learn

In fact, isn’t the shallow network ‘part of’ the

Unless such identity mapping is actually hard

Achieve 3.6% error.

Data distribution changes over time

Data distribution changes over time

Data distribution changes over time

Learning and Using the Arrow of Time, CVPR 2018 26

Data distribution changes over time.

The predictive model’s output is depending on the previous input/status.

To model this, we need new structures other than feedforward networks.

In this example it will be very hard for standard RNNs to

Attention and Transformers are very successful in recent NLP development.

GPT models, based on Transformer structure, are demonstrating incredible

❏ The characteristic of your data

Convolutional Recurrent Generative

❏ The characteristic of your data

Graph Conv Net

❏ The characteristic of your data

How many ﬁlters?

How many layers?

Regularization: e.g., weight decay, L1.

Network structure: number of layers, activation functions, etc.

If neural networks are so magical/capable/powerful, why don’t you train a

In fact, it has been done.

Google Cloud already provides automated tools.

You might also like