DLCV Ch2 Neural Network

Deep Learning for
Computer Vision
CHAPTER 2
INTRODUCTION TO
NEURAL NETWORK
Prof. G.S. Jison Hsu 徐繼聖

• Artificial Vision Laboratory
• National Taiwan University of
Science and Technology
Deep Learning for Computer Vision

Outline
1) What is Neural Network?
2) Single-Layer Perceptron
3) Feedforward / Multi-Layer Perceptron (MLP )
4) What is an Activation Function?
5) Gradient Descent
6) Backpropagation
7) Fully/ Locally Connected Layer
8) Convolutional Layer
9) Classification Loss – Softmax and Cross-Entropy
Deep Learning for Computer Vision 2

What is Neural Network?
https://reurl.cc/eOyQNM [19:13]
https://reurl.cc/QZ3VD5
Every pixel (Neuron) holds a number, the number in the neuron is called activation.

• Every neuron holds a number between 1 to 0.
• The input neuron will flatten to 784 dimensions, and feed into the
neural network.
• The last layer neuron represents the predicted classification result.

• Hidden intermediate layers between the input and output layer and
process the data by applying complex non-linear functions to them.
• This sample has 2 hidden layers, each hidden layer has 16 neurons.

• The layer will learn the input image patterns.
• Different layers will learn different patterns.
• Neurons use weighted sums to represent a number.

• The activation functions in the neural network introduce the non-
linearity to the linear output.
• It defines the output of a layer, given data, meaning it sets the threshold
for deciding whether to pass the information or not.
https://reurl.cc/Do9D2d [17:03-18:30]
Sigmoid Activation Function
• In order to map predicted values to

probabilities, we use the Sigmoid
function.
• The function maps any real value
into another value between 0 and 1.
https://reurl.cc/Do9D2d
Sigmoid Activation Function

Bias in
• Bias is employed to adjust the outcome, enabling models to shift the
activation function either towards the positive or negative side

• Learning involves finding the optimal weights and biases in the
neural network.

Gradient Descent
https://www.youtube.com/watch?v=IHZwWFHWa-w [21:00]
Training progress
Utilizing testing data to calculate predicted accuracy.

Training progress
We aim to achieve maximum accuracy while minimizing loss.

Calculate the cost(loss)

Optimize the neural network
• Use the cost to optimize the neural network.

Gradient Descent
• See weights and biases as a single input, and use calculated to
minimize the cost (loss).

Gradient Descent

Gradient Descent
• Repeat the gradient descent for each training step, minimize the cost,
and get the best network.

What is Backpropagation?
Deep Learning for Computer Vision https://reurl.cc/l7dR09 22

What is Backpropagation?
https://www.youtube.com/watch?v=Ilg3gGewQ5U [13:53]
Backpropagation
https://reurl.cc/QZ3VD5
Backpropagation

Example 2.1
Please download 2-1_Activation_Function.zip and unzip it.
• Given an architecture with input size=2, hidden layer=2, and output size=1.
• Each layer is followed by the Leaky ReLU activation function.
• For more activation functions, please refer to the following link:
https://pytorch.org/docs/stable/nn.html#non-linear-activations-weighted-sum-nonlinearity
• In this case, we can how to use the activation function.

Example 2.1
Change this to use the Leaky ReLU activation function.

Example 2.2 Train A MLP Model
Please download 2-2_MLP_MNIST.zip and unzip it.
• Given an architecture with input size=28x28, hidden layer=3, and output size=10.
• The 1st hidden layer includes 500 neurons, the 2nd layer includes 250 neurons and the 3rd
layer includes 125 neurons.
• Each layer is followed by the ReLU activation function.
• Use the MNIST dataset as the input data: 90% for training and 10% for testing.
• Please follow the instructions to train an MLP-based handwritten digits classifier.

Normalize the image
Download the MNIST dataset for training and testing

Define the MLP network
Use the ReLU activation function

Fully Connected Layer
Deep Learning for Computer Vision

https://reurl.cc/QZ3VKq 31
Locally Connected Layer
https://reurl.cc/QZ3VKq
Convolutional Layer
https://reurl.cc/QZ3VKq
Convolutional Neural Networks

Convolutional Neural Networks (CNNs) Explained
ttps://reurl.cc/KQlDlj [8:36]
Convolution Animations
N.B.: Blue maps are inputs, and cyan maps are outputs.
No padding, no Arbitrary padding, no Half padding, no Full padding, no

strides strides strides strides
No padding, strides Padding, strides Padding, strides (odd)

Deep Learning for Computer Vision https://reurl.cc/qNvXYD 36
Strided Convolutions – Andrew Ng
https://reurl.cc/ZbVMol[10:44]
Pooling Layer
• Pooling is the operation of down-sampling
input, two reasons that we need to use
pooling:
➢To reduce the input size.
➢To achieve local translation and rotation
invariance.
0.5x down-sampling

Max Pooling in Convolutional Neural Networks
https://reurl.cc/4pd873 [10:49]
Pooling : Average Pooling
(1) (2) (3)
(4) (5) (6)
(7) (8) (9)
Computing the output values of a 3×3 average pooling operation on a 5×5 input using 1×1 strides.
Pooling : Max Pooling
(1) (2) (3)
(4) (5) (6)
(7) (8) (9)
Computing the output values of a 3×3 max pooling operation on a 5×5 input using 1×1 strides.
Example 2.3 Convolution Filter
1. Please download 2-3_Convolution_Example.zip and unzip it.
2. Given a face image Iimg, please generate Fconv , a 5x5 convolution filter with random
coefficients in a scale of [0, 1].
3. Apply “zero padding” on the given image, and convolve the image with unit stride.
4. Compute the output by convolving Iimg with Fconv and calculate the MSE loss (i.e.,
information loss) between the original image and the one after convolving.
KERNAL_SIZE = 5 Iimg Fconv
STRIDE = 1
PADDING = (KERNAL_SIZE - STRIDE)/2
PADDING = int(PADDING)
img = cv2.imread('./006_01_01_051_08.png')
img = cv2.resize(img,(28,32))
img = cv2.cvtColor(img,cv2.COLOR_RGB2GRAY)
Conv_Filter = np.random.rand(KERNAL_SIZE,KERNAL_SIZE)
Conv_Filter = Conv_Filter/np.sum(Conv_Filter)
Input Image Conv kernel
img_F = img

KERNAL_SIZE = 5
STRIDE = 1
PADDING = (KERNAL_SIZE - STRIDE)/2 cv2.cvtcolor(): method is used to
PADDING = int(PADDING)
img = cv2.imread('./006_01_01_051_08.png') convert an image from one color
img = cv2.resize(img,(28,32))
space to another.
img = cv2.cvtColor(img,cv2.COLOR_RGB2GRAY)
Conv_Filter = np.random.rand(KERNAL_SIZE,KERNAL_SIZE)
#Conv_Filter =
np.random.rand(): Return a
np.random.normal(mean,std,(KERNAL_SIZE,KERNAL_SIZE)) #Normal sample (or samples) from the
distribution
Conv_Filter = Conv_Filter/np.sum(Conv_Filter) “standard normal” distribution.
img_F = img

for h in range(int((H-KERNAL_SIZE)/STRIDE)+1):
for w in range(int((W-KERNAL_SIZE)/STRIDE)+1):
aa = img_F[h*STRIDE:h*STRIDE + (KERNAL_SIZE), w*STRIDE:w*STRIDE +
(KERNAL_SIZE)]*Conv_Filter
new_feature[h,w] = np.sum(aa)
img_S = img_F.astype(np.uint8)
img_new = new_feature.astype(np.uint8)
Implement the element-wise
multiplication of the portion of
cv2.rectangle(img_S, (int(w*STRIDE), int(h*STRIDE)), (int((w*STRIDE +
KERNAL_SIZE)), int((h*STRIDE + KERNAL_SIZE))), (255, 0, 0), 1) the image with Conv filter.
cv2.namedWindow('Conv_process', cv2.WINDOW_NORMAL) Do the summation to generate
cv2.resizeWindow("Conv_process", 300, 300)
cv2.imshow('Conv_process',img_S) the output feature value.
cv2.namedWindow('Conv_result', cv2.WINDOW_NORMAL) np.astype(): Copy of the array,
cv2.resizeWindow("Conv_result", 300, 300)
cv2.imshow('Conv_result',img_new)
cast to a specified type.
cv2.waitKey(100) cv2.rectangle(): Used to draw a
rectangle on any image.
Example 2.4 Max Pooling
1. Please download 2-4_Pooling_Example.zip and unzip it.
2. Given a face image Iimg, please generate Fconv , a 5x5 convolution filter with random
coefficients in a normal distribution (std = 0, mean=2)
3. Use “same zeros padding” with unit stride on the convolution kernel Fconv .
4. Use “max pooling” with 3x3 kernel size and unit stride.
5. Please compute the output by convolving Iimg with Fconv and calculate the MSE loss (i.e.,
information loss) between the original image and the one after convolving.
Convolution、Max Pooling
Use max pooling
Input Image Ouput Image

Which Activation Function Should I Use?
https://www.youtube.com/watch?v=-7scQpJT7uo [8:58]
ReLU Activation Function
• The ReLU is half rectified (from the

bottom). f(z) is zero when z is less than
zero and f(z) is equal to z when z is above
or equal to zero.
• Issue :
➢ All the negative values become zero
which decreases the ability of the
model to fit or train from the data
properly.
https://www.youtube.com/watch?v=-7scQpJT7uo
Leaky ReLU Activation Function
It is an attempt to solve the dying ReLU problem.
The leak helps to increase the range of the ReLU function.

When a is not 0.01 then it is called Randomized ReLU.
Deep Learning for Computer Vision https://www.youtube.com/watch?v=-7scQpJT7uo 48
Loss Functions Explained
https://www.youtube.com/watch?v=IVVVjBSk9N0 [12:55]
Loss Functions
• The Loss function is used to measure the performance of a prediction
model in terms of the difference between the prediction and ground-
truth output.
• From a very simplified perspective, the loss function J can be defined as
a function of the prediction and ground-true output.
https://reurl.cc/V4XQpY
Common Loss Functions
https://www.renom.jp/notebooks/tutorial/basic_algorithm/lossfunction/no
Deep Learning for Computer Vision tebook.html 51
Classification Loss
• When a neural network is used to predict a
class label, we can consider it a classification
model.
• The number of the output nodes is the number
of classes in the data.
• Each node will represent a single class.
• The value of each output node represents the
probability of that class being the correct class.
https://www.v7labs.com/blog/cross-entropy-loss-guide
Classification Loss
The purpose of Softmax is to normalize to facilitate subsequent cross entropy with one-hot.
https://www.v7labs.com/blog/cross-entropy-loss-guide
Softmax Function
• The softmax function
➢ As known as soft-argmax or normalized exponential function
➢ It is a function that takes as input a vector of 𝐾 real numbers and
normalizes it into a probability distribution consisting of 𝐾
probabilities.
• Softmax is often used in Neural Network, to map the non-normalized
output of a network to a probability distribution over predicted output
classes.
• The standard (unit) softmax function 𝜎: ℝ𝐾 → ℝ𝐾 is define as follows:
𝑒 𝑧𝑖 𝐾
𝜎(𝑧)𝑖 = 𝑘 𝑓𝑜𝑟 𝑖 = 1, … , 𝑘 𝑎𝑛𝑑 𝑧 = (𝑧1 , … , 𝑧𝑘 ) ∈ ℝ
σ𝑗=1 𝑒 𝑧𝑗
Softmax Function
• We apply the standard exponential function to each element 𝑧𝑖 of the
input vector 𝑧 and normalize these values by dividing by the sum of all
these exponentials; this normalization ensures that the sum of the
components of the output vector 𝜎(𝑧) is 1.
• Instead of 𝑒, a different base 𝑏 > 0 can be used; choosing a larger value
of 𝑏 will create a probability distribution that is more concentrated
around the positions of the largest input values. Writing 𝑏 = 𝑒 𝛽 or 𝑏 =
𝑒 −𝛽 (for real 𝛽) yields the expressions:
𝑒 𝛽𝑧𝑖 𝑒 −𝛽𝑧𝑖
𝜎(𝑧)𝑖 = 𝑜𝑟 𝜎(𝑧)𝑖 = 𝑓𝑜𝑟 𝑖 = 1, … , 𝑘
σ𝑘𝑗=1 𝑒 𝛽𝑧𝑗 σ𝑘𝑗=1 𝑒 −𝛽𝑧𝑗

Cross Entropy
Ground Truth
• Calculate cross entropy loss (CE loss)
1
0
0
0
1
𝐶𝐸 𝑙𝑜𝑠𝑠 = ෍ −[𝑦𝑖 𝑙𝑜𝑔𝑦ො𝑖 + (1 − 𝑦𝑖 )log(1 − 𝑦ො𝑖 )]
𝑁
𝑖
1
= − { 1 × 𝑙𝑜𝑔0.775 + 1 − 1 log 1 − 0.775
4
+ 0 × 𝑙𝑜𝑔0.116 + 1 − 0 log(1 − 0.116)
+ 0 × 𝑙𝑜𝑔0.039 + 1 − 0 log(1 − 0.039)
+[0 × 𝑙𝑜𝑔0.07 + (1 − 0)log(1 − 0.07)]}
= 0.05
Cross Entropy
The derivative of loss function to back-propagate.
• If the loss function were MSE (Mean Square Error), then its derivative
would be easy (expected and predicted output).
• Things become more complex when the error function is cross entropy.
1
Binary cross-entropy loss 𝐸 𝑦, 𝑦ො = 𝑁 σ𝑖 −[𝑦𝑖 𝑙𝑜𝑔𝑦ො𝑖 + (1 − 𝑦𝑖 )log(1 − 𝑦ො𝑖 )]
𝑦𝑖 : Target value of sample 𝑖 (1 for positive class, 0 for negative class)
𝑦ො𝑖 : The predicted probability
𝑁: Number of data points
Reference: https://reurl.cc/5pda66
Mutual Information
https://reurl.cc/l7d4Lq
Mutual Information
• Describes the relationship between two random variables (𝑋, 𝑌) numerically.
• 𝑝 𝑥, 𝑦 can be two events that occur coincidentally in probability.
• Similar to the entropy function.
• 𝐼(𝑋, 𝑌) represents how strongly X and Y are related in numbers.
𝑝(𝑥, 𝑦)
𝑀𝑢𝑡𝑢𝑎𝑙 𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝐼(𝑋, 𝑌) = ෍ ෍ 𝑝 𝑥, 𝑦 𝑙𝑜𝑔
𝑝 𝑥 𝑝(𝑦)
𝑥∈𝑋 𝑦∈𝑌
1
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = ෍ 𝑝 𝑥 𝑙𝑜𝑔
𝑝(𝑥)
https://reurl.cc/l7d4Lq
Mutual Information Calculation
• We select two subjects Furry and Herbivorous as random variables
Furry Herbivorous Warm-blooded

Snack False False False
Sheep True True True
Lion True False True
Alligator False False False
Ladybug False True False
Information table

• Calculate probability table from the information table
Furry Herbivorous Furry(True) Furry(False) P(Herbivorous)

Snack False False Herbivorous 1ൗ 1ൗ 2ൗ
5 5 5
Sheep True True (True)
Herbivorous 1ൗ 2ൗ 3ൗ
Lion True False 5 5 5
(False)
Alligator False False
P(Furry) 2ൗ 3ൗ
Ladybug False True 5 5
Information table Probability table

• Calculate all the probability of variables in 𝐼(𝑋, 𝑌)
1. 𝑝 𝐹𝑢𝑟𝑟𝑦 , 2. 𝑝 𝐻𝑒𝑟𝑏𝑖𝑣𝑜𝑟𝑜𝑢𝑠 , 3. 𝑝(𝐹𝑢𝑟𝑟𝑦, 𝐻𝑒𝑟𝑏𝑖𝑣𝑜𝑟𝑜𝑢𝑠)
𝑝(𝐹, 𝐻) F = Furry
𝐼(𝐹, 𝐻) = ෍ ෍ 𝑝 𝐹, 𝐻 𝑙𝑜𝑔
𝑝 𝐹 𝑝(𝐻) H = Herbivorous
𝑥∈𝐹 𝑦∈𝐻
𝑝 𝐹 = 𝑇𝑟𝑢𝑒, 𝐻 = 𝑇𝑟𝑢𝑒
= 𝑝 𝐹 = 𝑇𝑟𝑢𝑒, 𝐻 = 𝑇𝑟𝑢𝑒 log +⋯
𝑝 𝐹 = 𝑇𝑟𝑢𝑒 𝑝 𝐻 = 𝑇𝑟𝑢𝑒
F(True) F(False) P(H)
H(True) 1ൗ 1ൗ 2ൗ
5 5 5
H(False) 1ൗ 2ൗ 3ൗ
5 5 5
P(F) 2ൗ 3ൗ
5 5
Probability table
• Calculate all the probability of combinations (𝐶) in 𝐼(𝑋, 𝑌)
𝑝 𝐹, 𝐻
𝐼 𝐹, 𝐻 = ෍ ෍ 𝑝 𝐹, 𝐻 𝑙𝑜𝑔
𝑝 𝐹 𝑝 𝐻
= 𝐶 𝐹 = 𝑇𝑟𝑢𝑒, 𝐻 = 𝑇𝑟𝑢𝑒 + 𝐶 𝐹 = 𝑇𝑟𝑢𝑒, 𝐻 = 𝐹𝑎𝑙𝑠𝑒 + 𝐶 𝐹 = 𝐹𝑎𝑙𝑠𝑒, 𝐻 = 𝑇𝑟𝑢𝑒

+ 𝐶 𝐹 = 𝐹𝑎𝑙𝑠𝑒, 𝐻 = 𝐹𝑎𝑙𝑠𝑒
1ൗ H(True) 1ൗ 1ൗ 2ൗ
= 1ൗ5 log 5 +⋯ 5 5 5
2ൗ × 2ൗ H(False) 1ൗ 2ൗ 3ൗ
5 5 5 5 5
P(F) 2ൗ 3ൗ
5 5
Probability table
• Complete calculation of 𝐼(𝐹, 𝐻)
𝑝(𝐹, 𝐻)
𝐼(𝐹, 𝐻) = ෍ ෍ 𝑝 𝐹, 𝐻 𝑙𝑜𝑔
𝑝 𝐹 𝑝(𝐻)
1ൗ 1ൗ 1ൗ 2ൗ
= 1ൗ5 log 5 + 1ൗ5 log 5 + 1ൗ5 log 5 + 2ൗ5 log 5
2ൗ × 2ൗ 3ൗ × 2ൗ 2ൗ × 2ൗ 3ൗ × 3ൗ
5 5 5 5 5 5 5 5
= 0.019 + −0.016 + 0.019 + 0.018
= 0.04 H(True) 1ൗ 1ൗ 2ൗ
5 5 5
• 𝐼 𝐹, 𝐻 = 0.04 means that Furry and H(False) 1ൗ
5
2ൗ
5
3ൗ
5
Herbivorous have a low relationship. P(F) 2ൗ
5
3ൗ
5
Probability table
Mutual Information Loss
• Derive from entropy function 𝐻(𝑋)
𝐼 𝑦, 𝐺(𝑦)
ො = 𝐻 𝑦 − 𝐻(𝑦|𝐺(𝑦))
ො
• 𝑦 is real data, 𝐺 𝑦ො is generated data from a model 𝐺 .

• Implement 𝑀𝐼𝐿𝑜𝑠𝑠 to measure the similarity between real and
fake data.
𝑀𝐼𝐿𝑜𝑠𝑠 𝑦, 𝐺(𝑦)
ො = −(𝐻 𝑦 − 𝐻(𝑦|𝐺(𝑦)))
ො

Example 2.5 Train the LeNet Network
Please download 2-5_CNN_MNIST.zip and unzip it.
• Given a LeNet network shown in the figure below.
• Use the “Softmax with Cross-Entropy” loss as the objective function.
• Train a handwritten digits (MNIST) classifier using the LeNet network, and
compute the confusion matrix for each class.
LeNet (1998)
Define the LeNet network

In the MNIST dataset, we have 10,000 test images with the ground-truth digits 1 to 10, where the
GT denotes the ground-truth of each image, and Pred denotes the predicted digit number.
GT_1 GT_2 GT_3 GT_4 GT_5 GT_6 GT_7 GT_8 GT_9 GT_10
Pred_1 938 0 15 5 1 19 18 5 7 11
Pred_2 0 1099 27 4 5 6 4 27 16 7
Pred_3 5 3 855 24 5 9 16 23 21 7
Pred_4 2 5 25 874 1 76 1 2 38 9
Pred_5 0 0 21 2 882 12 22 8 15 99
Pred_6 25 1 8 49 2 686 26 2 47 16
Pred_7 8 3 28 1 14 26 869 0 16 1
Pred_8 1 2 14 20 1 13 0 906 10 37
Pred_9 1 22 34 27 6 36 2 8 779 9
Pred_10 0 0 5 4 65 9 0 47 25 813
These numbers denotes the no. of data. (10,000 in total) GT: ground truth
Pred: prediction

DLCV Ch2 Neural Network

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DLCV Ch2 Neural Network

Uploaded by

Copyright:

Available Formats

Deep Learning for

Prof. G.S. Jison Hsu 徐繼聖

Deep Learning for Computer Vision

Deep Learning for Computer Vision 2

Deep Learning for Computer Vision 5

Deep Learning for Computer Vision 6

Deep Learning for Computer Vision 7

Deep Learning for Computer Vision 8

• In order to map predicted values to

Deep Learning for Computer Vision 11

Deep Learning for Computer Vision 12

Deep Learning for Computer Vision 13

Deep Learning for Computer Vision 15

Deep Learning for Computer Vision 16

Deep Learning for Computer Vision 17

Deep Learning for Computer Vision 18

Deep Learning for Computer Vision 19

Deep Learning for Computer Vision 20

Deep Learning for Computer Vision 21

Deep Learning for Computer Vision https://reurl.cc/l7dR09 22

Deep Learning for Computer Vision 25

Deep Learning for Computer Vision 26

Change this to use the Leaky ReLU activation function.

Deep Learning for Computer Vision 27

Deep Learning for Computer Vision 28

Normalize the image

Download the MNIST dataset for training and testing

Deep Learning for Computer Vision 29

Define the MLP network

Use the ReLU activation function

Deep Learning for Computer Vision 30

Deep Learning for Computer Vision

Deep Learning for Computer Vision 34

No padding, no Arbitrary padding, no Half padding, no Full padding, no

No padding, strides Padding, strides Padding, strides (odd)

Deep Learning for Computer Vision 38

(1) (2) (3)

(4) (5) (6)

(7) (8) (9)

(1) (2) (3)

(4) (5) (6)

(7) (8) (9)

Deep Learning for Computer Vision 42

Deep Learning for Computer Vision 43

Deep Learning for Computer Vision 45

• The ReLU is half rectified (from the

The leak helps to increase the range of the ReLU function.

Deep Learning for Computer Vision 55

Furry Herbivorous Warm-blooded

Deep Learning for Computer Vision 60

Furry Herbivorous Furry(True) Furry(False) P(Herbivorous)

Information table Probability table

Deep Learning for Computer Vision 61

= 𝐶 𝐹 = 𝑇𝑟𝑢𝑒, 𝐻 = 𝑇𝑟𝑢𝑒 + 𝐶 𝐹 = 𝑇𝑟𝑢𝑒, 𝐻 = 𝐹𝑎𝑙𝑠𝑒 + 𝐶 𝐹 = 𝐹𝑎𝑙𝑠𝑒, 𝐻 = 𝑇𝑟𝑢𝑒

• 𝑦 is real data, 𝐺 𝑦ො is generated data from a model 𝐺 .

Deep Learning for Computer Vision 65

Define the LeNet network

Deep Learning for Computer Vision 67

You might also like