You are on page 1of 64

Fall 2020 Applied Machine Learning

Guest Lecture

Neural Network Architectures


Jin Sun

Oct. 2020 1
Neural Network Architectures

● CNN Architectures
● RNN Architectures
● Neural Network Architecture Design

Based on course materials from http://cs231n.github.io/ and the Dive into Deep Learning book: https://d2l.ai/

2
Recap: Neural Networks and MLP

Neural networks encode a function mapping from input to output.

Such function is a result of composition of multiple functions (for each layer).


3
Recap: Neural Networks and MLP

Multilayer Perceptrons:
Multiple fully connected layers (at least input, hidden, and output layers).
Non-linear activations (sigmoid, ReLU, etc).
4
Recap: Neural Network Layers and Representation

5
Why so many NN architectures?

https://arxiv.org/abs/1605.07678 6
Why so many NN architectures?
Theoretically a single hidden layer network with infinite number of neurons
can fit any function.

When we refer to NN architectures, we talk about number of layers,


activation, connectivity patterns in a model.

● Performance: deeper networks -> better performance


● Efficiency: same performance but less footprints
● Robustness: more stable to train, e.g., normalization layers
● Task/data-dependent: better suited to the task, e.g., LSTM for sequences

7
Neural Network Architectures

● CNN Architectures
● RNN Architectures
● Neural Network Architecture Design

Based on course materials from http://cs231n.github.io/ and the Dive into Deep Learning book: https://d2l.ai/

8
Recap: Convolutional Neural Networks

5 filters 5 activation maps


5x28x28

9
Recap: Convolutional Neural Networks

10
Recap: AlexNet

5 conv layers: convolution + ReLU + max pooling

3 fully connected layers


11
VGG

Simply more layers.

Smaller conv filters (3x3).

Performance 7.3% error compared to


16.4% of AlexNet (lower the better).

https://arxiv.org/abs/1409.1556 12
VGG

Simply more layers.

Smaller conv filters (3x3).

Stacked small filters have


similar receptive field as
Performance 7.3% error compared to single large filters but has:

16.4% of AlexNet (lower the better). 1)


2)
Less parameters
More nonlinearity

https://arxiv.org/abs/1409.1556 13
If more layers the better, can we just stack arbitrarily large number of layers?

VGG-infinity??

...

14
Train on 56 layers vs 20 layers

Note this is not overfitting because the training error is also worse.
15
ResNet
This doesn’t make sense.

The deeper network should be able to learn


everything the shallow network can.

In fact, isn’t the shallow network ‘part of’ the


deeper network by setting some layers to
identity functions?

Unless such identity mapping is actually hard


to learn in training the deeper networks.

https://arxiv.org/abs/1512.03385 16
Residual Block
Then let’s help the network with identity mapping

17
ResNet

ResNet-50

ResNet-101

ResNet-152

Achieve 3.6% error.

18
DenseNet
Each layer is connected to every other layer in the same block.

https://arxiv.org/abs/1608.06993 19
Neural Network Architectures

● CNN Architectures
● RNN Architectures
● Neural Network Architecture Design

Based on course materials from http://cs231n.github.io/ and the Dive into Deep Learning book: https://d2l.ai/

20
Sequential Data
Data distribution changes over time

21
Sequential Data Same value point, going up or going down?

Data distribution changes over time

22
Sequential Data
Data distribution changes over time

23
Sequential Data Similar sound, similar meaning?

Data distribution changes over time

24
Sequential Data
Data distribution changes over time

25
Sequential Data Going forward or backward?

Data distribution changes over time

Learning and Using the Arrow of Time, CVPR 2018 26


Sequential Data
Data distribution changes over time

27
Sequential Data
Data distribution changes over time

28
Sequential Data

Data distribution changes over time.

The predictive model’s output is depending on the previous input/status.

To model this, we need new structures other than feedforward networks.

29
Recurrent Neural Networks

Neural network models that have connections from each layer to itself.

30
Recurrent Neural Network - a simple example

31
Gated Recurrent Units (GRU)

32
Learning long term dependency is hard!

http://colah.github.io/posts/2015-08-Understanding-LSTMs/ 33
Long Short Term Memory (LSTM)

http://colah.github.io/posts/2015-08-Understanding-LSTMs/ 34
Forget gate

35
Store info in cell states

36
Update cell states

37
Get output

38
Attention & Transformers
However, RNNs are not the default choice for temporal data anymore!
Attention Model
RNNs

39
Attention & Transformers
However, RNNs are not the default choice for temporal data anymore!
Attention Model
RNNs

In this example it will be very hard for standard RNNs to


model because the useful information is at later part.

40
Attention & Transformers
With attention mechanism as building block, an architecture called the
Transformers are proposed.

41
Attention & Transformers
However, RNNs are not the default choice for temporal data anymore!

Attention and Transformers are very successful in recent NLP development.

They show that you don’t really need complex recurrent structure to perform
tasks on sequential data like languages.

GPT models, based on Transformer structure, are demonstrating incredible


performance when trained with massive amount of data.

42
Neural Network Architectures

● CNN Architectures
● RNN Architectures
● Neural Network Architecture Design

Based on course materials from http://cs231n.github.io/ and the Dive into Deep Learning book: https://d2l.ai/

43
Design a NN model for your problem

Things to consider:

❏ The characteristic of your data


❏ Data input/output space
❏ Budget on hardwares, training time...

44
Data and Neural Network Models
Static Data Dynamic Data Unsupervised Data

Convolutional Recurrent Generative


Neural Neural Neural
Networks Networks Networks

45
Static Data - Image

46
Static Data - Translation invariance in images

47
Static Data - Translation invariance in images

48
Static Data - Translation invariance in images

49
Convolutional Networks

50
Design a NN model for your problem

Things to consider:

❏ The characteristic of your data


❏ Data input/output space
❏ Budget on hardwares, training time...

51
Predicting a category

52
Dense predictions

U-Net
53
Input is a graph

Graph Conv Net

54
https://github.com/tkipf/gcn
Design a NN model for your problem

Things to consider:

❏ The characteristic of your data


❏ Data input/output space
❏ Budget on hardwares, training time...

55
Binary Networks

56
https://mohitjain.me/2018/07/14/bnn/
Knowledge Distillation
Train a smaller network (Student) by learning from a more powerful network (Teacher).

57
https://towardsdatascience.com/knowledge-distillation-simplified-dd4973dbc764
Neural Network Model Search
Even with a targeted network architecture, there are still a lot of flexibility in
the exact configuration of your neural network framework.

How many filters?

How many layers?

58
Hyperparameters
Learning rate: controls how much you update the model’s weights.

Batch size: how many data you put into the model each time to calculate the
gradient.

Regularization: e.g., weight decay, L1.

Network structure: number of layers, activation functions, etc.

Even Data: it is almost always guaranteed that a model would get better with
more data.
59
Tools for hyperparameter tuning

60
Finally...

If neural networks are so magical/capable/powerful, why don’t you train a


neural network to find the best neural networks for a task?

In fact, it has been done.

61
AutoML, Neural Architecture Search, meta-learning…

Google Cloud already provides automated tools.


62
Automatic Model Search

https://cloud.google.com/automl/ 63
Summary: Neural Network Architectures

● CNN Architectures
● RNN Architectures
● Neural Network Architecture Design

Based on course materials from http://cs231n.github.io/ and the Dive into Deep Learning book: https://d2l.ai/

64

You might also like