You are on page 1of 34

CSE489: Machine Vision

(Sheet 7)

Yehia Zakaria
yehia.Zakaria@eng.asu.edu.eg
Question 2
• Compare between Bayes’ and SVM classification techniques.

Bayes Classifier SVM Classifier


Probabilistic classifier Non-probabilistic classifier
It models the posterior probability from the class
based on a discriminant function given
PDF. So the output is a probability of belonging to
by 𝑦 = 𝑤. 𝑥 + 𝑏.
a class.
It tries to find a hyperplane that maximizes the
PDF is usually assumed, and its parameters are
margin and there is optimization function in this
found during the training process.
regard.
General Rule: General Rule:
𝑖𝑓 𝑃𝐶1 × 𝑝(𝑥|𝑐1 ) > 𝑃𝑐2 × 𝑝 (𝑥|𝑐2 ) 𝑓 (𝑥, 𝑊) = 𝑠𝑖𝑔𝑛(𝑊. 𝑋 + 𝑏)
𝑡ℎ𝑒𝑛 𝑥 ∈ 𝑐1 𝑖𝑓 𝑓 = +1 𝑥 ∈ 𝑐1
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 𝑥 ∈ 𝑐2 𝑒𝑙𝑠𝑒 𝑖𝑓 𝑓 = −1 𝑥 ∈ 𝑐2
2
Question 3
• Give a mathematical formulation for the classification problem.

X Label
F(.)

In classification problem we try to find a prediction


function f that can give a feature vector X a label
corresponding to its right class.

3
Question 4
• Write a computer algorithm for a simple linear classifier approach.

𝑥1
𝑋𝑖 = 𝑥 𝑦𝑖 = 𝑦 𝒙𝟐
2
Feature Vector label

2 𝑦1 = −1
𝑋1 =
4
7 𝑦2 = +1 𝟒
𝑋2 = 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤0 = 0
7
⋮ ⋮
𝑥1𝑛
𝑋𝑛 = 𝑥 𝑦𝑛 = ⋯ 𝟐 𝒙𝟏
2𝑛

4
Question 4
• Write a computer algorithm for a simple linear classifier approach.
𝒙𝟑 𝒃

𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥3 + 𝑤0 = 0

𝒈 𝒙𝟐

𝒙𝟏 𝒓
5
Question 4
• Write a computer algorithm for a simple linear classifier approach.
• Given labelled training samples {(𝑋1, 𝑦1), … , (𝑋𝑁, 𝑦𝑁)} where 𝑋𝑖 is the feature
vector and 𝑦𝑖 is the label.
1. Initialize weight vector (𝑊) randomly
2. Calculate the classification error such that:
𝜀 (𝑊)= ∑max(0, −𝑦𝑖𝑊 𝑇𝑋 𝑖)
3. Update the weights using any optimization technique.
𝜕𝜀
4. Repeat steps 2 and 3 until ≈0
𝜕𝑊
• After training given unknown sample 𝑋𝑢
o If 𝑊 𝑇 𝑋 𝑢 > 0 then 𝑋𝑢 𝜖 𝐶1
o Otherwise 𝑋𝑢 𝜖 𝐶2

6
Question 5
• Describe algorithmically how a multi-class SVM works. What is the role of optimization in
the problem? You need to write an expression for the penalizing objective function.

𝑓1 = 𝑤11 𝑥1 + 𝑤21 𝑥2 + 𝑤01

𝑓2 = 𝑤12 𝑥1 + 𝑤22 𝑥2 + 𝑤02

𝑓3 = 𝑤13 𝑥1 + 𝑤23 𝑥2 + 𝑤03



𝑓𝑛 = 𝑤1𝑛 𝑥1 + 𝑤2𝑛 𝑥2 + 𝑤0𝑛

𝑓(𝑥,𝑊) = 𝑊𝑥 + 𝑏
7
Question 5
• Describe algorithmically how a multi-class SVM works. What is the role of optimization in
the problem? You need to write an expression for the penalizing objective function.
During the training:
• Initialize the weights for each class randomly.
• Calculate the scores of each class in the training data such
that: 𝐒𝐢 = 𝒇(𝑿, 𝑾𝒊 ) = 𝑾𝒊 . 𝑿
• Define a loss function that represent the amount of
error in the training data.
• Hing loss function with the following form is used:

8
Question 5
• Describe algorithmically how a multi-class SVM works. What is the role of optimization in
the problem? You need to write an expression for the penalizing objective function.
During the training:
• After calculating loss of each class the total loss is
calculated using the following eqn:

• Where R(W) is regularization term to make the model


simpler so that it works on the testing data.
The role of optimization is to efficiently find parameters
(weights) to minimize the loss function.

9
Question 5
• Describe algorithmically how a multi-class SVM works. What is the role of optimization in
the problem? You need to write an expression for the penalizing objective function.
During Testing:
• Given an unknown sample 𝑋 and a trained classifier for
classes {𝐶1, 𝐶2, … , 𝐶𝑛} with weights 𝑊1, 𝑊2, … , 𝑊𝑛 the
sample is classified based on the maximum value of the
dot product 𝑊𝑖. 𝑋

10
Question 6
Neural network is one of the machine learning techniques. Describe this and illustrate the
universality theorem.
Machine learning framework can be divided into two stages:
Training: Given a training set of labeled examples: {(𝑥1, 𝑦1), … , (𝑥𝑁, 𝑦𝑁)} estimate the prediction
function 𝒇 by minimizing the prediction error on the training set.
Testing: Apply 𝒇 to unseen test example 𝑥𝑢 and output the predicted value 𝑦𝑢 = 𝑓(𝑥𝑢 ) to classify 𝑥𝑢 .

Universality theorem:
Any continuous function f such that:
f : R N → RM
Can be realized using a network with one hidden layer given enough number of hidden neurons.
That proves that neural network can realize prediction function and hence it’s one of the
machine learning techniques.

11
Question 7

For each training sample 𝑖 calculate the loss as the following


C1
(1,2) (2,1) (2,2)
𝑆1 5 5 7
𝑆2 -0.5 -0.5 -0.7
𝑆3 -30 10 -10
For (1,2): 𝑆1 ≥ 𝑆2 + 1 5 ≥ −0.5 + 1 Loss = 0
Loss 0 6 0
𝑆1 ≥ 𝑆3 + 1 5 ≥ −30 + 1 Loss = 0
Sum of Loss = 0 12
Question 7

For each training sample 𝑖 calculate the loss as the following


C1
(1,2) (2,1) (2,2)
𝑆1 5 5 7
𝑆2 -0.5 -0.5 -0.7
𝑆3 -30 10 -10
For (2,2): 𝑆1 ≥ 𝑆2 + 1 5 ≥ −0.5 + 1 Loss = 0
Loss 0 6 0
𝑆1 ≥ 𝑆3 + 1 5 ≥ 10 + 1 Loss = 10-5+1=6
Sum of Loss = 6 13
Question 7

For each training sample 𝑖 calculate the loss as the following


C2
(-1,-2) (-2,-1) (-2,-2)
𝑆1 -7 -7 -9
𝑆2 0.7 0.7 0.9
𝑆3 10 -30 -10
Loss 10.3 0 0

14
Question 7

For each training sample 𝑖 calculate the loss as the following


C3
(-1,2) (-2,1) (-2,2)
𝑆1 1 -3 -1
𝑆2 -0.1 0.3 0.1
𝑆3 -70 -70 -90
Loss 142.9 139.3 181.1

15
Question 7

For each training sample 𝑖 calculate the loss as the following

(-3,5)
𝑆1 3
𝑆2 -0.3
𝑆3 -170
Class Class 1

16
Question 10
Why thin and tall networks are preferred to fat and short networks?
Because:
• They can automatically Learn high-level features.
• Ability to do transfer learning.
• Modularity of Deep Neural Network can be used as building blocks like LEGO

17
Q12: Vanishing Gradient Problem
Describe how to overcome the vanishing gradients problem? How does this problem affect the
training of neural networks?
Vanishing gradients problems happens due to lots of reason, one of them is that the activation function have
low or zero gradient values, as you can see in the gradient of sigmoid and Tanh only small range of input
values will have a gradient value, and in case of sigmoid it’s less than 0.25,

18
Q12: Vanishing Gradient Problem
Describe how to overcome the vanishing gradients problem? How does this problem affect the
training of neural networks?

This can be overcome by:


• Choosing activation functions with higher gradient value.
• Batch normalizations technique.
• Gradient clipping.
• Better weights initialization techniques.

19
Q13: Overfitting problem
• What is the overfitting problem? How to avoid such a problem when training neural
networks?
Overfitting happens when the Neural Network starts to over tune its
parameters that it becomes specific for only the training dataset. In this
case, the error in predicting the training samples become very low
(almost zero) while the performance in the testing dataset increases. It’s
said that the network failed to generalize.

epochs

20
Q13: Overfitting problem
• What is the overfitting problem? How to avoid such a problem when training neural
networks?
This can be avoided by:
• Using a validation dataset to check how generalized is the network.
• Data augmentation to increase variations in the training dataset.
• Using Drop out and batch normalization techniques.

epochs

21
Question 14
What is the relation between ReLUs and drop out?
The output of ReLU is 0 for all negative numbers, which means that this node is “shut down” during
calculation of this sample or batch. It’s doesn’t contribute neither in output nor in backpropagation.
This behavior is similar to the drop out technique which involves shutting down a percentage of the nodes
during both feedforward and backpropagation.
Both are proven to increase the ability of the network to generalize and achieve better results in the testing
dataset.

22
Q17: Describe how network parameters are initialized

Parameter initialization in Neural Networks depends mainly on randomness. Different approaches


suggest the random distribution and its parameters, as example:

1 1
• Uniform distribution in range − ,
𝑟 𝑟
1
• Gaussian distribution with mean of 0 and standard deviation of 𝑟

Where 𝑟 is the number of input nodes

• Xavier initialization suggests initialization of weights by Gaussian distribution of mean 0 and standard
deviation that is different in each layer according to following equation
2
𝑟𝑖𝑛 + 𝑟𝑜𝑢𝑡

Where 𝑟𝑖 𝑛 and 𝑟𝑜 𝑢 𝑡 are the number of inputs and outputs of the layer respectively
23
Question 18
• What are the main three properties of CNN? Which part of the network is related to which property?

The main three properties of CNN:


1. Locally Connected: As some patterns are smaller than the whole
image, a neuron doesn’t have to see the whole image to detect the
pattern. Such neuron are connected to only part of the previous layer
output (less parameters)

24
Question 18
• What are the main three properties of CNN? Which part of the network is related to which property?

The main three properties of CNN:


2. Parameter sharing: as the same pattern may exist in different regions of image, two or more neurons
may end up doing almost the same thing. These neurons can share the same set of parameters.

25
Question 18
• What are the main three properties of CNN? Which part of the network is related to which property?

The main three properties of CNN:


3. Subsampling(Pooling): As subsampling the image will not change the object. We can subsample the
pixels to get smaller image which means less network parameters.

26
Question 18
• What are the main three properties of CNN? Which part of the network is related to which property?

The first two properties are related to the convolution layers while the last property is related to
the pooling layer.

27
Question 21
The following CNN architecture is called AlexNet:-

Discuss the network stages different structures listing


i) filter sizes, ii) stride and padding amounts, and iii) max-pooling sizes.
Write a KERAS code illustrating the sequence of steps applied on an input
image to get the final output vector of the CNN.
28
Question 21
The following CNN architecture is called AlexNet:-

29
Question 21
• AlexNet Explained

30
Question 20
For the VGGNET structure given below, assume that the filter size is 3X3 in the convolutional layers with
a stride and a padding amount of 1. Discuss and calculate the size of each stage.

31
Layer Input size Output size Parameters

F = 3 , S=1 , P=1
1st Conv. Layer [64] [224x224x3] 𝟐𝟐𝟒 − 𝟑 + 𝟐 𝟏 [[3x3x3]+1]x64
+𝟏
𝟏
[224x224x64]
2nd Conv. Layer [64] [224x224x64] [224x224x64] [[3x3x64]+1]x64
F = 2 , S=2 , P=0
𝟐𝟐𝟒 − 𝟐 + 𝟐 𝟎
+𝟏
𝟐
Max pooling [224x224x64] [112x112x64] 0
3rd Conv. Layer [128] [112x112x64] [112x112x128] [[3x3x64]+1]x128
4th Conv. Layer [128] [112x112x128] [112x112x128] [[3x3x128]+1]x128
Max pooling [112x112x128] [56x56x128] 0
5th Conv. Layer [256] [56x56x128] [56x56x256] [[3x3x128]+1]x256
32
Question 20

Layer Input size Output size Parameters


6th Conv. Layer [256] [56x56x256] [56x56x256] [[3x3x256]+1]x256
Max pooling [56x56x256] [28x28x256] 0
7th Conv. Layer [512] [28x28x256] [28x28x512] [[3x3x256]+1]x512
8th Conv. Layer [512] [28x28x512] [28x28x512] [[3x3x512]+1]x512
Max pooling [28x28x512] [14x14x512] 0
9th Conv. Layer [512] [14x14x512] [14x14x512] [[3x3x512]+1]x512
10th Conv. Layer [512] [14x14x512] [14x14x512] [[3x3x512]+1]x512
Max pooling [14x14x512] [7x7x512] 0
Flatten [7x7x512] [25088x1] 0
33
Question 20

Layer Input size Output size Parameters


1st FC [4096] [25088x1] [4096x1] [25088+1]x4096
2nd FC [4096] [4096x1] [4096x1] [4096+1]x4096
3rd FC [1000] [4096x1] [1000x1] [4096+1]x1000
SoftMax [1000x1] [1000x1] 0

34

You might also like