Ch1 - Introduction To Deep Learning

Chapter 1
Introduction to Deep Learning
1
Introduction to Deep Learning Chapter 1
Exciting Recent Developments
HCM City Univ. of Technology, Faculty of Mechanical Engineering 2 Duong Van Tu

What do these all have in common?

Deep Learning! 😄
https://www.youtube.com/watch?v=0iuCruB1wcs
Google DeepMind
3
Motivation: ZIP codes
In 1990s, great increase in documents on paper

(mail, checks, books, etc.)
Motivation for a ZIP code recognizer on real U.S. mail for the postal
service!

“three”
How do we know this is a three?
How does a computer know this is a three?

0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0 0
120 0 0 0 0 0 0 0 0 0 0 0
240 40 0 0 0 0 0 0 0 0 0 0
242 128 0 0 0 0 0 0 0 0 0 0
255 240 10 0 0 0 0 0 0 0 0 0
254 244 120 0 0 0 0 0 0 0 0 0
255 255 121 8 0 0 0 0 0 0 0 0
what the computer sees

Representing digits in the computer
How do we represent digits so that our machine can operate on them?
Represent image colors and intensities in a two-dimensional matrix of

numbers (i.e. an image).
0 is white, 255 is black, and numbers in between are shades of gray.
Sometimes the inverse of this scheme is used (0 = black, 255 = white)

Numbers known as pixel values (a grid of discrete samples that

make up an image)

10 11 12 13 14 15 16 17 18 19
0
10 0 0 0 0 0 0 0 170 236 255
5 11 0 0 0 0 0 0 0 0 250 254
12 0 0 0 0 0 0 0 0 255 255
10 13 0 0 0 0 0 0 0 0 240 245
14 0 0 0 0 0 0 0 0 245 236
15 15 0 0 0 0 0 0 0 0 251 253
16 0 0 0 0 0 0 0 124 255 255
20
17 0 0 0 0 0 0 0 210 254 120
25 18 0 0 0 0 0 0 0 165 230 0
19 0 0 0 0 0 0 120 154 220 0
0 5 10 15 20 25
Pixel in position [15, 15] is light

0s often have light patch in the middle
Contrast with the number 1 or 2, which often has the darker
values for these positions
Can we define a set of heuristics (i.e. rules based on our intuition),

to classify digits?
A heuristic for classifying “7”

0
5
1 2
10
Digit is a 7 if 𝑃1 > 128 and
𝑃2 > 128 and 𝑃3 > 128
15
20
3
25
0 5 10 15 20 25

But what if...
5
1 2
10
15 Slanted digit?
20
3
25
pixel 3 is no
longer dark! 0 5 10 15 20 25

An Improved Heuristic!
5
1 2
Digit is a 7 if 𝑃1 >
10
15 128 and 𝑃2 > 128 and

20
(𝑃3 > 128 or 𝑃4 > 128)
4 3
25
0 5 10 15 20 25

Not so fast...
5
1 2
the pixel 10
values are
completely 15 Digit shifted up?
different
20
4 3
25
0 5 10 15 20 25

Not as simple as we think!
Distortions, overlappings, underlinings, etc.
Heuristics can always be foiled.

Machine Learning Pipeline for Digit Recognition
Dataset Preprocessing Train Model Evaluate Model

MNIST
Modified National Institute of

Standards and Technology database
Handwritten digits
0 — 9 (10 classes)
70,000 images

Train, validation, and test sets

•Train set — used to adjust the parameters of the model
•Validation set — used to test how well we’re doing as we develop
• Prevents overfitting, something you will learn later!
• Also known as the development set
•Test set — used to evaluate the model once the model is done
Dataset
Train set Dev set Test set

Each 𝑥 in our dataset is called an input

𝑥 is represented by a 28 * 28 matrix of pixel
values, flattened into a one dimensional vector
(more on this later)
Each 𝑦 in our dataset is called a label
𝑦 is the corresponding answer/classification, one
of ten possibilities
Training Training
Data Labels
We refer to each (𝑥, 𝑦) as an example

This is a supervised learning task 𝑓ሚ (“model”)
Output
Loss function
Optimizer

A (Temporary) Simplification: Binary Classification
•Classifying MNIST digits requires predicting 1 of 10 possible values
•We’ll first look at a simpler task — binary classification problem
•Determine whether handwritten digit is a 2 or not a 2
•The first neural network for binary classification: the Perceptron
Is a 2
Not a 2

The Perceptron
𝑥1
Input: a vector of numbers 𝐱 = 𝑥1 , 𝑥2 , … 𝑥𝑛 𝑥2

𝑤1
𝑤 and 𝑏 are parameters of the perceptron 𝑤2 𝑏

𝑥3
𝑤3
•Parameters: values we adjust during Σ
𝑥4 𝑤4
learning
𝑤5
𝑥5
•Let Φ = 𝑤⋃𝑏 (the set of all 𝑤6
parameters) 𝑥6
𝑤7
𝑥7
Artificial Neuron (Perceptron)

20
Predicting with a Perceptron
✓ Multiply each input 𝑥𝑖 by its corresponding weight 𝑤𝑖 , sum them up.

✓ Add the bias 𝑏
✓ If the result value is greater than 0, return 1, otherwise return 0
𝑛
1, 𝑖𝑓 𝑏 + ෍ 𝑤𝑖 𝑥𝑖 > 0
𝑓Φ 𝑥 =
𝑖=0
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
✓ As a binary classifier, 1 indicates that 𝑥 is a member of the class and

0, not a member
Parameters
•Weights — the importance of each input to determining the output

•Weight near 0 imply this input has little influence on the output
•Negative weight means increasing the input will decrease the output
•Bias — the a priori likelihood of the positive class
•Ensures that even if all inputs are 0, there will be some result value

Bias: Geometric Explanation
The bias is essentially the b term in y = mx+b
only the line 𝑓 𝑥 = 2𝑥

with bias can 3
𝑓(𝑥) = 𝑥
fit the data 2
𝑓(𝑥) = 𝑥 + 1

Bias as special type of weight
Another way to think of bias is to represent it as an extra

weight for an input/feature that is always 1
𝑥0 , 𝑥1 , 𝑥2 , … 𝑥𝑛 ∙ 𝑤0 , 𝑤1 , 𝑤2 , … 𝑤𝑛 + 𝑏
= 𝑥0 , 𝑥1 , 𝑥2 , … 𝑥𝑛 , 1 ∙ 𝑤0 , 𝑤1 , 𝑤2 , … 𝑤𝑛 , 𝑏

Bias as special type of weight
Another way to think of bias is to represent it as an extra weight for

an input/feature that is always 1
𝑥1 𝑥1
𝑤1 𝑤1
𝑥2 𝑥2
𝑤2 𝑏 𝑤2
𝑥3 𝑥3
=
𝑤3 𝑤3
𝑥4 𝑤4 Σ 𝑥4 𝑤4 Σ
𝑤5 𝑤5
𝑥5 𝑥5
𝑤6 𝑤6
𝑥6 𝑥6
𝑤7 𝑤7
𝑥7 𝑥7 𝑏

A Binary Perceptron for MNIST

𝑥1
• Inputs 𝑥1 , 𝑥2 , … 𝑥𝑛 are all positive 𝑥2

𝑤1
𝑤2
𝑏
𝑥3
• 𝑛 = 784 (28 ∗ 28 pixel values) 𝑤3
𝑥4 𝑤4 Σ
• Output is either 0 or 1 𝑥5
𝑤5
⋮
⋮
• 0 → input is not the digit type we’re looking for
𝑤784
𝑥784
• 1 → input is the digit type we’re looking for

Training a Perceptron
•How do we set the parameters Φ = 𝑤⋃𝑏 to produce the highest accuracy on our
training set?
• Iterate over training set several times, feeding in each training example into the
model, producing an output, and adjusting the parameters according to whether
that output was right or wrong
• Stop once we either (a) get every training example right or (b) after 𝑁 iterations,
a number set by the programmer.
•𝑁 is known as the number of epochs, where each epoch is an iteration of going

through all data points in the training set
• As a general rule of thumb, 𝑁 grows with the number of parameters

The Perceptron Learning Algorithm

1. set 𝑤’s to 0.
2. for 𝑁 iterations, or until the weights do not change:
a. for each training example x𝑘 with label 𝑎𝑘
i. if 𝑎𝑘 − 𝑓 x𝑘 = 0 continue
ii. else for all weights 𝑤𝑖 , ∆𝑤𝑖 = 𝑎𝑘 − 𝑓 x 𝑘 𝑥𝑖𝑘
• 𝑏 = bias • 𝑎𝑘 = label for the k th example

• 𝑤𝑖 = weight for the ith input where
• 𝑤 = weights 𝑖≤𝑛
• 𝑁 = maximum number of training iterations • 𝑛 = number of pixels per image
• x 𝑘 = k th training example • 𝑥𝑖𝑘 = ith input of the example
where 𝑖 ≤ 𝑛


• If the output of our model matches the label, we continue

• If the correct label is 1, and our output is 1, 1 − 1 = 0
• If the correct label is 0, and our output is 0, 0 − 0 = 0


• If our label 𝑎𝑘 is a 1, and our model’s output is a 0, we update the 𝑖𝑡ℎ weight by:
• 1 − 0 ∙ 𝑥𝑖𝑘 = 𝑥𝑖𝑘
• Output was 0 and should have been 1, so make the output more positive
• If our label 𝑎𝑘 is a 0, and our model’s output is a 1, we update the 𝑖𝑡ℎ weight by:
• 0 − 1 ∙ 𝑥𝑖𝑘 = −𝑥𝑖𝑘
• Output was 1 and should have been 0, so make the output more negative

Example: Predict whether a digit is a “2”
Just look at the effect of these two pixels
𝑥1
𝑥2 𝑥1 = 0.8
𝑥2 = 0


• Start off training with all parameters as 0, so 𝑤1 = 0, 𝑤2 = 0, and 𝑏 = 0
𝑥1
• 𝑓 𝑥 = (𝑤1 ∙ 𝑥1 + 𝑤2 ∙ 𝑥2 + 𝑏) > 0
• 𝑓 𝑥 = 0 ∙ 0.8 + 0 ∙ 0 + 0 ∙ 1 > 0
• Return 0 because value is not greater than 0 𝑥2
• Predict that it is not a 2!
• Correct answer: it is a 2...
• Parameter update:
• ∆𝑤1 = 1 − 0 ∙ 0.8 = 0.8
• ∆𝑤2 = 1 − 0 ∙ 0 = 0 𝑥1 = 0.8
• ∆𝑏 = 1 − 0 ∙ 1 = 1 𝑥2 = 0
• Now
• 𝑤1 = 0.8
• 𝑤2 = 0
•𝑏 = 1
Next example:
𝑥1
𝑥1 = 0.9
𝑥2 𝑥2 = 0.9


• At end of last iteration:
• 𝑤1 = 0.8, 𝑤2 = 0, and 𝑏 = 1
• 𝑓 𝑥 = (𝑤1 ∙ 𝑥1 + 𝑤2 ∙ 𝑥2 + 𝑏) > 0 𝑥1
• 𝑓 𝑥 = (0.8 ∙ 0.9 + 0 ∙ 0.9 + 1 ∙ 1) > 0
• Return 1 because value is greater than 0 𝑥2
• Predict that it is a 2!
• Correct answer: it is not a 2...
• Parameter update:
• ∆𝑤1 = 0 − 1 ∙ 0.9 = −0.9
• ∆𝑤2 = 0 − 1 ∙ 0.9 = −0.9 𝑥1 = 0.9
• ∆𝑏 = 0 − 1 ∙ 1 = −1 𝑥2 = 0.9
• Now
• 𝑤1 = 0.8 − 0.9 = −0.1
• 𝑤2 = 0 − 0.9 = −0.9
•𝑏 = 1 − 1 = 0
Using multiple perceptrons
•We can extend perceptrons to multi-class problems by creating 𝑛

perceptrons, where 𝑛 = the number of classes
•For MNIST, we would have 10 perceptrons

•Each individual perceptron returns a value, so our model will return
the class whose perceptron value is the highest.
•Here, “perceptron value” refers to the value of the weighted sum

before being thresholded.
35

𝑥1
𝑤1
𝑥2
𝑤2
𝑥3
𝑤3
Perceptron for predicting whether

𝑥4 𝑤4 Σ 𝑜𝑢𝑡𝑝𝑢𝑡
handwritten digit is a 0
𝑤5
𝑥5
𝑤6
𝑥6
𝑤7
𝑥7
⋮𝑥1
⋮
𝑤1
𝑥2
𝑤2
𝑥3
Perceptron for predicting whether

𝑤3
𝑥4 𝑤4 Σ 𝑜𝑢𝑡𝑝𝑢𝑡
𝑥5
𝑤5
𝑤6
handwritten digit is a 9
𝑥6
𝑤7
𝑥7


𝑤1,1
𝑤1,2
𝑥1
𝑥2
𝑤1
𝑤2
Σ 𝑜𝑢𝑡𝑝𝑢𝑡1
𝑤1,3
𝑥3
𝑤3 𝑥1 𝑤2,1
𝑥4 𝑤4 𝑤2,2
𝑥5
𝑤5 𝑤2,3
𝑥6
𝑤6
𝑥2
𝑤7 𝑤3,1
𝑥7
𝑤3,2
𝑥1 𝑥3 𝑤3,3
𝑤1
𝑥2
=
𝑤2 𝑤4,1
𝑥3
𝑥4
𝑤3
𝑤4 Σ 𝑜𝑢𝑡𝑝𝑢𝑡2
𝑥4 𝑤4,2
𝑤4,3
𝑤5
𝑥5
𝑥5
𝑤6
𝑤5,1
𝑥6
𝑤7
𝑤5,2
𝑥7
𝑤5,3
𝑥1
𝑤1
𝑥6 𝑤6,1
𝑥2
𝑤2
𝑥3 𝑤6,2
𝑤3
𝑥4 𝑤4 𝑥7 𝑤6,3
𝑤5
𝑥5 Σ 𝑜𝑢𝑡𝑝𝑢𝑡3 𝑤7,1
𝑤6
𝑥6
𝑤7 𝑤7,2
𝑥7 𝑤7,3
Three separate perceptrons Three perceptrons sharing inputs

Activate Functions
1. Binary Step Function
f(x) = 1, x>=0
= 0, x<0

Activate Functions
2. Linear Function
f(x)=ax

Activate Functions
3. Sigmoid
f(x) = 1/(1+e^-x)

Activate Functions
4. Tanh
tanh(x)=2sigmoid(2x)-1

Activate Functions
5. ReLU (Rectified Linear Unit)
f(x)=max(0,x)

Activate Functions
6. Leaky ReLU
f(x)= 0.01x, x<0
= x, x>=0

Activate Functions
7. Parameterised ReLU
f(x) = x, x>=0
= ax, x<0

Activate Functions
8. Exponential Linear Unit
f(x) = x, x>=0
= a(e^x-1), x<0

Activate Functions
9. Swish Function
f(x) = x*sigmoid(x)
f(x) = x/(1-e^-x)

Activate Functions
10. Softmax Function

Choosing the right Activation Function

• Sigmoid functions and their combinations generally work better in the case of
classifiers.
• Sigmoids and tanh functions are sometimes avoided due to the vanishing gradient
problem.
• ReLU function is a general activation function and is used in most cases these days.
• If we encounter a case of dead neurons in our networks the leaky ReLU function is
the best choice.
• Always keep in mind that ReLU function should only be used in the hidden layers.
• As a rule of thumb, you can begin with using ReLU function and then move over to
other activation functions in case ReLU doesn’t provide with optimum results.

Ch1 - Introduction To Deep Learning

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ch1 - Introduction To Deep Learning

Uploaded by

Copyright:

Available Formats

Chapter 1

Introduction to Deep Learning

Exciting Recent Developments

HCM City Univ. of Technology, Faculty of Mechanical Engineering 2 Duong Van Tu

What do these all have in common?

Motivation: ZIP codes

In 1990s, great increase in documents on paper

HCM City Univ. of Technology, Faculty of Mechanical Engineering 4 Duong Van Tu

How do we know this is a three?

How does a computer know this is a three?

HCM City Univ. of Technology, Faculty of Mechanical Engineering 5 Duong Van Tu

what the computer sees

HCM City Univ. of Technology, Faculty of Mechanical Engineering 6 Duong Van Tu

Representing digits in the computer

How do we represent digits so that our machine can operate on them?

Represent image colors and intensities in a two-dimensional matrix of

0 is white, 255 is black, and numbers in between are shades of gray.

Sometimes the inverse of this scheme is used (0 = black, 255 = white)

HCM City Univ. of Technology, Faculty of Mechanical Engineering 7 Duong Van Tu

Numbers known as pixel values (a grid of discrete samples that

HCM City Univ. of Technology, Faculty of Mechanical Engineering 8 Duong Van Tu

Pixel in position [15, 15] is light

Can we define a set of heuristics (i.e. rules based on our intuition),

A heuristic for classifying “7”

HCM City Univ. of Technology, Faculty of Mechanical Engineering 10 Duong Van Tu

But what if...

HCM City Univ. of Technology, Faculty of Mechanical Engineering 11 Duong Van Tu

15 128 and 𝑃2 > 128 and

HCM City Univ. of Technology, Faculty of Mechanical Engineering 12 Duong Van Tu

HCM City Univ. of Technology, Faculty of Mechanical Engineering 13 Duong Van Tu

Not as simple as we think!

Distortions, overlappings, underlinings, etc.

Heuristics can always be foiled.

HCM City Univ. of Technology, Faculty of Mechanical Engineering 14 Duong Van Tu

Machine Learning Pipeline for Digit Recognition

Dataset Preprocessing Train Model Evaluate Model

HCM City Univ. of Technology, Faculty of Mechanical Engineering 15 Duong Van Tu

Modified National Institute of

HCM City Univ. of Technology, Faculty of Mechanical Engineering 16 Duong Van Tu

Train, validation, and test sets

Train set Dev set Test set

Each 𝑥 in our dataset is called an input

We refer to each (𝑥, 𝑦) as an example

HCM City Univ. of Technology, Faculty of Mechanical Engineering 18 Duong Van Tu

HCM City Univ. of Technology, Faculty of Mechanical Engineering 19 Duong Van Tu

Input: a vector of numbers 𝐱 = 𝑥1 , 𝑥2 , … 𝑥𝑛 𝑥2

𝑤 and 𝑏 are parameters of the perceptron 𝑤2 𝑏

Artificial Neuron (Perceptron)

Predicting with a Perceptron

✓ Multiply each input 𝑥𝑖 by its corresponding weight 𝑤𝑖 , sum them up.

✓ As a binary classifier, 1 indicates that 𝑥 is a member of the class and

•Weights — the importance of each input to determining the output

HCM City Univ. of Technology, Faculty of Mechanical Engineering 22 Duong Van Tu

The bias is essentially the b term in y = mx+b

only the line 𝑓 𝑥 = 2𝑥

HCM City Univ. of Technology, Faculty of Mechanical Engineering 23 Duong Van Tu

Another way to think of bias is to represent it as an extra

HCM City Univ. of Technology, Faculty of Mechanical Engineering 24 Duong Van Tu

Another way to think of bias is to represent it as an extra weight for

HCM City Univ. of Technology, Faculty of Mechanical Engineering 25 Duong Van Tu

A Binary Perceptron for MNIST

• Inputs 𝑥1 , 𝑥2 , … 𝑥𝑛 are all positive 𝑥2