You are on page 1of 68

Deep Learning for

Computer Vision

CHAPTER 2

INTRODUCTION TO
NEURAL NETWORK

Prof. G.S. Jison Hsu 徐繼聖


• Artificial Vision Laboratory
• National Taiwan University of
Science and Technology

Deep Learning for Computer Vision


Outline
1) What is Neural Network?
2) Single-Layer Perceptron
3) Feedforward / Multi-Layer Perceptron (MLP )
4) What is an Activation Function?
5) Gradient Descent
6) Backpropagation
7) Fully/ Locally Connected Layer
8) Convolutional Layer
9) Classification Loss – Softmax and Cross-Entropy

Deep Learning for Computer Vision 2


What is Neural Network?

https://reurl.cc/eOyQNM [19:13]
Deep Learning for Computer Vision 3
https://reurl.cc/QZ3VD5
Deep Learning for Computer Vision 4
Every pixel (Neuron) holds a number, the number in the neuron is called activation.

Deep Learning for Computer Vision 5


• Every neuron holds a number between 1 to 0.
• The input neuron will flatten to 784 dimensions, and feed into the
neural network.
• The last layer neuron represents the predicted classification result.

Deep Learning for Computer Vision 6


• Hidden intermediate layers between the input and output layer and
process the data by applying complex non-linear functions to them.
• This sample has 2 hidden layers, each hidden layer has 16 neurons.

Deep Learning for Computer Vision 7


• The layer will learn the input image patterns.
• Different layers will learn different patterns.
• Neurons use weighted sums to represent a number.

Deep Learning for Computer Vision 8


• The activation functions in the neural network introduce the non-
linearity to the linear output.
• It defines the output of a layer, given data, meaning it sets the threshold
for deciding whether to pass the information or not.

https://reurl.cc/Do9D2d [17:03-18:30]
Deep Learning for Computer Vision 9
Sigmoid Activation Function

• In order to map predicted values to


probabilities, we use the Sigmoid
function.
• The function maps any real value
into another value between 0 and 1.

https://reurl.cc/Do9D2d
Deep Learning for Computer Vision 10
Sigmoid Activation Function

Deep Learning for Computer Vision 11


Bias in
• Bias is employed to adjust the outcome, enabling models to shift the
activation function either towards the positive or negative side

Deep Learning for Computer Vision 12


• Learning involves finding the optimal weights and biases in the
neural network.

Deep Learning for Computer Vision 13


Gradient Descent

https://www.youtube.com/watch?v=IHZwWFHWa-w [21:00]
Deep Learning for Computer Vision 14
Training progress
Utilizing testing data to calculate predicted accuracy.

Deep Learning for Computer Vision 15


Training progress
We aim to achieve maximum accuracy while minimizing loss.

Deep Learning for Computer Vision 16


Calculate the cost(loss)

Deep Learning for Computer Vision 17


Optimize the neural network
• Use the cost to optimize the neural network.

Deep Learning for Computer Vision 18


Gradient Descent
• See weights and biases as a single input, and use calculated to
minimize the cost (loss).

Deep Learning for Computer Vision 19


Gradient Descent

Deep Learning for Computer Vision 20


Gradient Descent
• Repeat the gradient descent for each training step, minimize the cost,
and get the best network.

Deep Learning for Computer Vision 21


What is Backpropagation?

Deep Learning for Computer Vision https://reurl.cc/l7dR09 22


What is Backpropagation?

https://www.youtube.com/watch?v=Ilg3gGewQ5U [13:53]
Deep Learning for Computer Vision 23
Backpropagation

https://reurl.cc/QZ3VD5
Deep Learning for Computer Vision 24
Backpropagation

Deep Learning for Computer Vision 25


Example 2.1
Please download 2-1_Activation_Function.zip and unzip it.
• Given an architecture with input size=2, hidden layer=2, and output size=1.
• Each layer is followed by the Leaky ReLU activation function.
• For more activation functions, please refer to the following link:
https://pytorch.org/docs/stable/nn.html#non-linear-activations-weighted-sum-nonlinearity
• In this case, we can how to use the activation function.

Deep Learning for Computer Vision 26


Example 2.1

Change this to use the Leaky ReLU activation function.

Deep Learning for Computer Vision 27


Example 2.2 Train A MLP Model
Please download 2-2_MLP_MNIST.zip and unzip it.
• Given an architecture with input size=28x28, hidden layer=3, and output size=10.
• The 1st hidden layer includes 500 neurons, the 2nd layer includes 250 neurons and the 3rd
layer includes 125 neurons.
• Each layer is followed by the ReLU activation function.
• Use the MNIST dataset as the input data: 90% for training and 10% for testing.
• Please follow the instructions to train an MLP-based handwritten digits classifier.

Deep Learning for Computer Vision 28


Example 2.2 Train A MLP Model

Normalize the image

Download the MNIST dataset for training and testing

Deep Learning for Computer Vision 29


Example 2.2 Train A MLP Model

Define the MLP network

Use the ReLU activation function

Deep Learning for Computer Vision 30


Fully Connected Layer

Deep Learning for Computer Vision


https://reurl.cc/QZ3VKq 31
Locally Connected Layer

https://reurl.cc/QZ3VKq
Deep Learning for Computer Vision 32
Convolutional Layer

https://reurl.cc/QZ3VKq
Deep Learning for Computer Vision 33
Convolutional Neural Networks

Deep Learning for Computer Vision 34


Convolutional Neural Networks (CNNs) Explained

ttps://reurl.cc/KQlDlj [8:36]
Deep Learning for Computer Vision 35
Convolution Animations
N.B.: Blue maps are inputs, and cyan maps are outputs.

No padding, no Arbitrary padding, no Half padding, no Full padding, no


strides strides strides strides

No padding, strides Padding, strides Padding, strides (odd)


Deep Learning for Computer Vision https://reurl.cc/qNvXYD 36
Strided Convolutions – Andrew Ng

https://reurl.cc/ZbVMol[10:44]
Deep Learning for Computer Vision 37
Pooling Layer
• Pooling is the operation of down-sampling
input, two reasons that we need to use
pooling:
➢To reduce the input size.
➢To achieve local translation and rotation
invariance.

0.5x down-sampling

Deep Learning for Computer Vision 38


Max Pooling in Convolutional Neural Networks

https://reurl.cc/4pd873 [10:49]
Deep Learning for Computer Vision 39
Pooling : Average Pooling

(1) (2) (3)

(4) (5) (6)

(7) (8) (9)

Computing the output values of a 3×3 average pooling operation on a 5×5 input using 1×1 strides.
Deep Learning for Computer Vision 40
Pooling : Max Pooling

(1) (2) (3)

(4) (5) (6)

(7) (8) (9)

Computing the output values of a 3×3 max pooling operation on a 5×5 input using 1×1 strides.
Deep Learning for Computer Vision 41
Example 2.3 Convolution Filter
1. Please download 2-3_Convolution_Example.zip and unzip it.
2. Given a face image Iimg, please generate Fconv , a 5x5 convolution filter with random
coefficients in a scale of [0, 1].
3. Apply “zero padding” on the given image, and convolve the image with unit stride.
4. Compute the output by convolving Iimg with Fconv and calculate the MSE loss (i.e.,
information loss) between the original image and the one after convolving.
KERNAL_SIZE = 5 Iimg Fconv
STRIDE = 1
PADDING = (KERNAL_SIZE - STRIDE)/2
PADDING = int(PADDING)
img = cv2.imread('./006_01_01_051_08.png')
img = cv2.resize(img,(28,32))

img = cv2.cvtColor(img,cv2.COLOR_RGB2GRAY)
Conv_Filter = np.random.rand(KERNAL_SIZE,KERNAL_SIZE)
Conv_Filter = Conv_Filter/np.sum(Conv_Filter)
Input Image Conv kernel
img_F = img

Deep Learning for Computer Vision 42


Example 2.3 Convolution Filter

KERNAL_SIZE = 5
STRIDE = 1
PADDING = (KERNAL_SIZE - STRIDE)/2 cv2.cvtcolor(): method is used to
PADDING = int(PADDING)
img = cv2.imread('./006_01_01_051_08.png') convert an image from one color
img = cv2.resize(img,(28,32))
space to another.
img = cv2.cvtColor(img,cv2.COLOR_RGB2GRAY)
Conv_Filter = np.random.rand(KERNAL_SIZE,KERNAL_SIZE)
#Conv_Filter =
np.random.rand(): Return a
np.random.normal(mean,std,(KERNAL_SIZE,KERNAL_SIZE)) #Normal sample (or samples) from the
distribution
Conv_Filter = Conv_Filter/np.sum(Conv_Filter) “standard normal” distribution.
img_F = img

Deep Learning for Computer Vision 43


Example 2.3 Convolution Filter
for h in range(int((H-KERNAL_SIZE)/STRIDE)+1):
for w in range(int((W-KERNAL_SIZE)/STRIDE)+1):
aa = img_F[h*STRIDE:h*STRIDE + (KERNAL_SIZE), w*STRIDE:w*STRIDE +
(KERNAL_SIZE)]*Conv_Filter

new_feature[h,w] = np.sum(aa)

img_S = img_F.astype(np.uint8)
img_new = new_feature.astype(np.uint8)
Implement the element-wise
multiplication of the portion of
cv2.rectangle(img_S, (int(w*STRIDE), int(h*STRIDE)), (int((w*STRIDE +
KERNAL_SIZE)), int((h*STRIDE + KERNAL_SIZE))), (255, 0, 0), 1) the image with Conv filter.
cv2.namedWindow('Conv_process', cv2.WINDOW_NORMAL) Do the summation to generate
cv2.resizeWindow("Conv_process", 300, 300)
cv2.imshow('Conv_process',img_S) the output feature value.
cv2.namedWindow('Conv_result', cv2.WINDOW_NORMAL) np.astype(): Copy of the array,
cv2.resizeWindow("Conv_result", 300, 300)
cv2.imshow('Conv_result',img_new)
cast to a specified type.
cv2.waitKey(100) cv2.rectangle(): Used to draw a
rectangle on any image.
Deep Learning for Computer Vision 44
Example 2.4 Max Pooling
1. Please download 2-4_Pooling_Example.zip and unzip it.
2. Given a face image Iimg, please generate Fconv , a 5x5 convolution filter with random
coefficients in a normal distribution (std = 0, mean=2)
3. Use “same zeros padding” with unit stride on the convolution kernel Fconv .
4. Use “max pooling” with 3x3 kernel size and unit stride.
5. Please compute the output by convolving Iimg with Fconv and calculate the MSE loss (i.e.,
information loss) between the original image and the one after convolving.

Convolution、Max Pooling
Use max pooling
Input Image Ouput Image

Deep Learning for Computer Vision 45


Which Activation Function Should I Use?

https://www.youtube.com/watch?v=-7scQpJT7uo [8:58]
Deep Learning for Computer Vision 46
ReLU Activation Function

• The ReLU is half rectified (from the


bottom). f(z) is zero when z is less than
zero and f(z) is equal to z when z is above
or equal to zero.

• Issue :
➢ All the negative values become zero
which decreases the ability of the
model to fit or train from the data
properly.

https://www.youtube.com/watch?v=-7scQpJT7uo
Deep Learning for Computer Vision 47
Leaky ReLU Activation Function
It is an attempt to solve the dying ReLU problem.

The leak helps to increase the range of the ReLU function.


When a is not 0.01 then it is called Randomized ReLU.
Deep Learning for Computer Vision https://www.youtube.com/watch?v=-7scQpJT7uo 48
Loss Functions Explained

https://www.youtube.com/watch?v=IVVVjBSk9N0 [12:55]
Deep Learning for Computer Vision 49
Loss Functions
• The Loss function is used to measure the performance of a prediction
model in terms of the difference between the prediction and ground-
truth output.
• From a very simplified perspective, the loss function J can be defined as
a function of the prediction and ground-true output.

https://reurl.cc/V4XQpY
Deep Learning for Computer Vision 50
Common Loss Functions

https://www.renom.jp/notebooks/tutorial/basic_algorithm/lossfunction/no
Deep Learning for Computer Vision tebook.html 51
Classification Loss
• When a neural network is used to predict a
class label, we can consider it a classification
model.
• The number of the output nodes is the number
of classes in the data.
• Each node will represent a single class.
• The value of each output node represents the
probability of that class being the correct class.

https://www.v7labs.com/blog/cross-entropy-loss-guide
Deep Learning for Computer Vision 52
Classification Loss

The purpose of Softmax is to normalize to facilitate subsequent cross entropy with one-hot.

https://www.v7labs.com/blog/cross-entropy-loss-guide
Deep Learning for Computer Vision 53
Softmax Function
• The softmax function
➢ As known as soft-argmax or normalized exponential function
➢ It is a function that takes as input a vector of 𝐾 real numbers and
normalizes it into a probability distribution consisting of 𝐾
probabilities.
• Softmax is often used in Neural Network, to map the non-normalized
output of a network to a probability distribution over predicted output
classes.
• The standard (unit) softmax function 𝜎: ℝ𝐾 → ℝ𝐾 is define as follows:
𝑒 𝑧𝑖 𝐾
𝜎(𝑧)𝑖 = 𝑘 𝑓𝑜𝑟 𝑖 = 1, … , 𝑘 𝑎𝑛𝑑 𝑧 = (𝑧1 , … , 𝑧𝑘 ) ∈ ℝ
σ𝑗=1 𝑒 𝑧𝑗
Deep Learning for Computer Vision 54
Softmax Function
• We apply the standard exponential function to each element 𝑧𝑖 of the
input vector 𝑧 and normalize these values by dividing by the sum of all
these exponentials; this normalization ensures that the sum of the
components of the output vector 𝜎(𝑧) is 1.
• Instead of 𝑒, a different base 𝑏 > 0 can be used; choosing a larger value
of 𝑏 will create a probability distribution that is more concentrated
around the positions of the largest input values. Writing 𝑏 = 𝑒 𝛽 or 𝑏 =
𝑒 −𝛽 (for real 𝛽) yields the expressions:
𝑒 𝛽𝑧𝑖 𝑒 −𝛽𝑧𝑖
𝜎(𝑧)𝑖 = 𝑜𝑟 𝜎(𝑧)𝑖 = 𝑓𝑜𝑟 𝑖 = 1, … , 𝑘
σ𝑘𝑗=1 𝑒 𝛽𝑧𝑗 σ𝑘𝑗=1 𝑒 −𝛽𝑧𝑗

Deep Learning for Computer Vision 55


Cross Entropy
Ground Truth
• Calculate cross entropy loss (CE loss)
1
0
0
0
1
𝐶𝐸 𝑙𝑜𝑠𝑠 = ෍ −[𝑦𝑖 𝑙𝑜𝑔𝑦ො𝑖 + (1 − 𝑦𝑖 )log(1 − 𝑦ො𝑖 )]
𝑁
𝑖
1
= − { 1 × 𝑙𝑜𝑔0.775 + 1 − 1 log 1 − 0.775
4
+ 0 × 𝑙𝑜𝑔0.116 + 1 − 0 log(1 − 0.116)
+ 0 × 𝑙𝑜𝑔0.039 + 1 − 0 log(1 − 0.039)
+[0 × 𝑙𝑜𝑔0.07 + (1 − 0)log(1 − 0.07)]}
= 0.05
Deep Learning for Computer Vision 56
Cross Entropy
The derivative of loss function to back-propagate.
• If the loss function were MSE (Mean Square Error), then its derivative
would be easy (expected and predicted output).
• Things become more complex when the error function is cross entropy.

1
Binary cross-entropy loss 𝐸 𝑦, 𝑦ො = 𝑁 σ𝑖 −[𝑦𝑖 𝑙𝑜𝑔𝑦ො𝑖 + (1 − 𝑦𝑖 )log(1 − 𝑦ො𝑖 )]
𝑦𝑖 : Target value of sample 𝑖 (1 for positive class, 0 for negative class)
𝑦ො𝑖 : The predicted probability
𝑁: Number of data points
Reference: https://reurl.cc/5pda66
Deep Learning for Computer Vision 57
Mutual Information

https://reurl.cc/l7d4Lq
Deep Learning for Computer Vision 58
Mutual Information
• Describes the relationship between two random variables (𝑋, 𝑌) numerically.
• 𝑝 𝑥, 𝑦 can be two events that occur coincidentally in probability.
• Similar to the entropy function.
• 𝐼(𝑋, 𝑌) represents how strongly X and Y are related in numbers.

𝑝(𝑥, 𝑦)
𝑀𝑢𝑡𝑢𝑎𝑙 𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝐼(𝑋, 𝑌) = ෍ ෍ 𝑝 𝑥, 𝑦 𝑙𝑜𝑔
𝑝 𝑥 𝑝(𝑦)
𝑥∈𝑋 𝑦∈𝑌

1
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = ෍ 𝑝 𝑥 𝑙𝑜𝑔
𝑝(𝑥)

https://reurl.cc/l7d4Lq
Deep Learning for Computer Vision 59
Mutual Information Calculation
• We select two subjects Furry and Herbivorous as random variables

Furry Herbivorous Warm-blooded


Snack False False False
Sheep True True True
Lion True False True
Alligator False False False
Ladybug False True False

Information table

Deep Learning for Computer Vision 60


Mutual Information Calculation
• Calculate probability table from the information table

Furry Herbivorous Furry(True) Furry(False) P(Herbivorous)


Snack False False Herbivorous 1ൗ 1ൗ 2ൗ
5 5 5
Sheep True True (True)
Herbivorous 1ൗ 2ൗ 3ൗ
Lion True False 5 5 5
(False)
Alligator False False
P(Furry) 2ൗ 3ൗ
Ladybug False True 5 5

Information table Probability table

Deep Learning for Computer Vision 61


Mutual Information Calculation
• Calculate all the probability of variables in 𝐼(𝑋, 𝑌)
1. 𝑝 𝐹𝑢𝑟𝑟𝑦 , 2. 𝑝 𝐻𝑒𝑟𝑏𝑖𝑣𝑜𝑟𝑜𝑢𝑠 , 3. 𝑝(𝐹𝑢𝑟𝑟𝑦, 𝐻𝑒𝑟𝑏𝑖𝑣𝑜𝑟𝑜𝑢𝑠)
𝑝(𝐹, 𝐻) F = Furry
𝐼(𝐹, 𝐻) = ෍ ෍ 𝑝 𝐹, 𝐻 𝑙𝑜𝑔
𝑝 𝐹 𝑝(𝐻) H = Herbivorous
𝑥∈𝐹 𝑦∈𝐻
𝑝 𝐹 = 𝑇𝑟𝑢𝑒, 𝐻 = 𝑇𝑟𝑢𝑒
= 𝑝 𝐹 = 𝑇𝑟𝑢𝑒, 𝐻 = 𝑇𝑟𝑢𝑒 log +⋯
𝑝 𝐹 = 𝑇𝑟𝑢𝑒 𝑝 𝐻 = 𝑇𝑟𝑢𝑒
F(True) F(False) P(H)
H(True) 1ൗ 1ൗ 2ൗ
5 5 5
H(False) 1ൗ 2ൗ 3ൗ
5 5 5
P(F) 2ൗ 3ൗ
5 5
Probability table
Deep Learning for Computer Vision 62
Mutual Information Calculation
• Calculate all the probability of combinations (𝐶) in 𝐼(𝑋, 𝑌)

𝑝 𝐹, 𝐻
𝐼 𝐹, 𝐻 = ෍ ෍ 𝑝 𝐹, 𝐻 𝑙𝑜𝑔
𝑝 𝐹 𝑝 𝐻
𝑥∈𝐹 𝑦∈𝐻

= 𝐶 𝐹 = 𝑇𝑟𝑢𝑒, 𝐻 = 𝑇𝑟𝑢𝑒 + 𝐶 𝐹 = 𝑇𝑟𝑢𝑒, 𝐻 = 𝐹𝑎𝑙𝑠𝑒 + 𝐶 𝐹 = 𝐹𝑎𝑙𝑠𝑒, 𝐻 = 𝑇𝑟𝑢𝑒


+ 𝐶 𝐹 = 𝐹𝑎𝑙𝑠𝑒, 𝐻 = 𝐹𝑎𝑙𝑠𝑒
F(True) F(False) P(H)
1ൗ H(True) 1ൗ 1ൗ 2ൗ
= 1ൗ5 log 5 +⋯ 5 5 5
2ൗ × 2ൗ H(False) 1ൗ 2ൗ 3ൗ
5 5 5 5 5
P(F) 2ൗ 3ൗ
5 5
Probability table
Deep Learning for Computer Vision 63
Mutual Information Calculation
• Complete calculation of 𝐼(𝐹, 𝐻)
𝑝(𝐹, 𝐻)
𝐼(𝐹, 𝐻) = ෍ ෍ 𝑝 𝐹, 𝐻 𝑙𝑜𝑔
𝑝 𝐹 𝑝(𝐻)
𝑥∈𝐹 𝑦∈𝐻
1ൗ 1ൗ 1ൗ 2ൗ
= 1ൗ5 log 5 + 1ൗ5 log 5 + 1ൗ5 log 5 + 2ൗ5 log 5
2ൗ × 2ൗ 3ൗ × 2ൗ 2ൗ × 2ൗ 3ൗ × 3ൗ
5 5 5 5 5 5 5 5
= 0.019 + −0.016 + 0.019 + 0.018
F(True) F(False) P(H)
= 0.04 H(True) 1ൗ 1ൗ 2ൗ
5 5 5
• 𝐼 𝐹, 𝐻 = 0.04 means that Furry and H(False) 1ൗ
5
2ൗ
5
3ൗ
5
Herbivorous have a low relationship. P(F) 2ൗ
5
3ൗ
5

Probability table
Deep Learning for Computer Vision 64
Mutual Information Loss
• Derive from entropy function 𝐻(𝑋)
𝐼 𝑦, 𝐺(𝑦)
ො = 𝐻 𝑦 − 𝐻(𝑦|𝐺(𝑦))

• 𝑦 is real data, 𝐺 𝑦ො is generated data from a model 𝐺 .


• Implement 𝑀𝐼𝐿𝑜𝑠𝑠 to measure the similarity between real and
fake data.
𝑀𝐼𝐿𝑜𝑠𝑠 𝑦, 𝐺(𝑦)
ො = −(𝐻 𝑦 − 𝐻(𝑦|𝐺(𝑦)))

Deep Learning for Computer Vision 65


Example 2.5 Train the LeNet Network
Please download 2-5_CNN_MNIST.zip and unzip it.
• Given a LeNet network shown in the figure below.
• Use the “Softmax with Cross-Entropy” loss as the objective function.
• Train a handwritten digits (MNIST) classifier using the LeNet network, and
compute the confusion matrix for each class.

LeNet (1998)
Deep Learning for Computer Vision 66
Example 2.5 Train the LeNet Network

Define the LeNet network

Deep Learning for Computer Vision 67


Example 2.5 Train the LeNet Network
In the MNIST dataset, we have 10,000 test images with the ground-truth digits 1 to 10, where the
GT denotes the ground-truth of each image, and Pred denotes the predicted digit number.
GT_1 GT_2 GT_3 GT_4 GT_5 GT_6 GT_7 GT_8 GT_9 GT_10
Pred_1 938 0 15 5 1 19 18 5 7 11
Pred_2 0 1099 27 4 5 6 4 27 16 7
Pred_3 5 3 855 24 5 9 16 23 21 7
Pred_4 2 5 25 874 1 76 1 2 38 9
Pred_5 0 0 21 2 882 12 22 8 15 99
Pred_6 25 1 8 49 2 686 26 2 47 16
Pred_7 8 3 28 1 14 26 869 0 16 1
Pred_8 1 2 14 20 1 13 0 906 10 37
Pred_9 1 22 34 27 6 36 2 8 779 9
Pred_10 0 0 5 4 65 9 0 47 25 813

These numbers denotes the no. of data. (10,000 in total) GT: ground truth
Pred: prediction
Deep Learning for Computer Vision 68

You might also like