You are on page 1of 30

VIETNAM GENERAL CONFERDERATION OF LABOUR

TON DUC THANG UNIVERSITY


FACULTY OF INFORMATION TECHNOLOGY

Project for Middle Term

Instructor: TS LE ANH CUONG


Performer: TRAN THIEN PHONG – 519H0336
VI THI NGOC YEN – 519H0264
Group :6
Course : 23

HO CHI MINH CITY, YEAR 2021


i

THANK YOU
Thanks professor Le Anh Cuong! Thank you for always helping
us from the day we first joined the Introduction to Machine
Learning. We always look up to you because of your dedicate and
hard work.
ii

THE PROJECT IS FINISHED IN


TON DUC THANG UNIVERSITY

We guarantee that this is our project and with the guidance of professor Le
Anh Cuong. All of the content and consequence in this report is honest and have not
been declared in any formal before. Every data in this table is served for the
canalizations and comments by any authors with sources that written in the reference
files.
If there are any cheating, we will be responsible for the content of my report.
Ton Duc Thang University does not relate to the copyright infringement that we did in
the proccess.
Ho Chi Minh city, date 13 month 11 year 2021
Author
(Signature and full name)

Phong Yen

Tran Thien Phong Vi Thi Ngoc Yen


iii

INSTRUCTOR’S ASSESMENT

_________________________________________________________
_________________________________________________________
_________________________________________________________
_________________________________________________________
_________________________________________________________
_________________________________________________________
_________________________________________________________
Ho Chi Minh city, date 26 month 04 year 2021
(Signature and full name)
iv

SUMMARY

Recapitulate the problems that are researched, solve the problems and the
results that gained, basic discover in 1 - 5 pages.
1

PROJECT

PART 1
Present the theory of Feedforward Neural Nework including the issues

1) Introduction:
*) Before going to the of Feedforward Neural Nework, let’s find out what is
a Neural Network:
- Neural network is a network that has neurons circular lines go around the area.
The connections of the neurons are specified in the architecture as weights. A
positive weight presents for an excitatory connection, while an inhibitory is
shown by negative weight.
- These are used for predictive modeling, adaptive control and applications
where they can be trained as a dataset.
*) So what is Feedforward Neural Nework?
- "Feed forward" is used when you input something at the input layer and
it travels from input to hidden and from hidden to output layer.
- Feed- algorithm to calculate output vector from input vector. Input for feed-
forward is input_vector, forward is output is output_vector.
*) So how does Feedforward Neural Network work?
- Feedforward neural networks were among the first and most successful
learning algorithms. They are also called deep networks, multi-layer perceptron
(MLP), or simply neural networks. As data travels through the network’s
artificial mesh, each layer processes an aspect of the data, filters outliers, spots
familiar entities and produces the final output.

1.1) The general architecture:


2

- Input layer: This layer consists of the neurons that receive inputs and pass
them on to the other layers. The number of neurons in the input layer should be
equal to the attributes or features in the dataset.
- Hidden layer: In between the input and output layer, there are hidden layers
based on the type of model. Hidden layers contain a vast number of neurons
which apply transformations to the inputs before passing them. As the network is
trained, the weights are updated to be more predictive.
- Neuron weights: Weights refer to the strength or amplitude of a connection
between two neurons. If you are familiar with linear regression, you can compare
weights on inputs like coefficients. Weights are often initialized to small random
values, such as values in the range 0 to 1.
- Output layer: The output layer is the predicted feature and depends on the
type of model you’re building.

There are four steps of doing the FFNN:


1. Multiplication of weights and inputs:
- The inputs are multiplied by the assigned weight values:
For example:

(x1* w1)
3

(x2* w2)
(xn* wn)

2. Adding biases:
- In the next step, the result that found in the next step is added to their
respective biases. But just sum to a single value.
For example:

(x1* w1) + b1
(x2* w2) + b2
(xn* wn) + bn
=> weighted_sum = (x1 * w1) + b1 + (x2* w2) + b2 +(xn* wn) + bn.

3. Activation:
- An activation is a function is the mapping of summed weighted input to the
output of the neuron. It is called an activation/transfer function.

4. Output signal:
- Finally, the weighted sum obtained is turned into an output signal by
feeding the weighted sum into an activation function (also called transfer
function) .
- There are several activation functions for different use cases. The most
commonly used activation functions are relu, tanh and softmax.

Atributes:
4

- X: Input nodes(from the input layer).


- Connection: A weighted relationship between the node of a layer to another
layer’s node.
- W: The weight of a Connection.
- B: Bias nodes (a constant, typically set equal to 1.0).
- H: Hidden nodes (from hidden layers).
- Y: Ouput nodes (a weight sum of last hidden layer).
- E: Total difference between the output of the network and the desired values
(total error is typically measured by estimators such as mean squared error,
entrophy,etc).

1.2) The backpropagation algorithm:

1.2.1) What is backpropagation:


- The backpropagation algorithm is a gradient descent technique. Gradient
descent aims to find a local minimum of a function by iteratively moving in the
opposite direction of the gradient (i.e., the slope) of the function at the current
point. The goal of a learning in neural networks is to minimize the cost function
given the training set. The cost function is a function of network weights and
biases of all the neurons in all the layers. Backpropagation iteratively computes
the gradient of cost function relative to each weight and bias, then updates the
weights and biases in the opposite direction of the gradient, to find a local
minimum.

1.2.2) How backpropagation algorithms works:


- How the algorithm works is best explained based on a simple network, like the
one given in the next figure. It only has an input layer with 2 inputs (X1 and X2),
and an output layer with 1 output. There are no hidden layers.

-The weights of the inputs are W1 and W2, respectively.  The bias is treated as a
new input neuron to the output neuron which has a fixed value +1 and a weight
b. Both the weights and biases could be referred to as parameters.
5

Let’s assume that output layer uses the sigmoid activation function defined by
the following equation:

Where s is the sum of products (SOP) between each input and its corresponding
weight:
s=X1* W1+ X2*W2+b
Example: We have X 1 = 0.1 , X 2 = 0.3 , W 1 = 0.5 , W 2 = 0.2, NS = 1.83,
Desired Output = 0.03
Return output of the neuron:
s = X1 *W1 + X2* W 2 + b
s = 0.1 * 0.5 + 0.3* 0.2 +1.83
s = 1.94

Applied to the sigmoid:

Result the value 0.874352143


6

The next, we can measure the error of our network as follows:

Result ~ 0.357 , The error just gives us an indication of how far the predicted
results are from the desired results. 
We calculate the error, then the forward pass ends, and we should start
the backward pass to calculate the derivatives and update the parameters.

Parameters update equation

The parameters can be changed according to the next equation:

W(n+1)=W(n)+η[d(n)-Y(n)]X(n)

For our network, these parameters have the following values:

 n: 0
 W(n): [1.83, 0.5, 0.2]
 η: Because it is a hyperparameter, then we can choose it 0.01 for example.
 d(n): [0.03].
 Y(n): [0.874352143].
 X(n): [+1, 0.1, 0.3]. First value (+1) is for the bias.

Applying the formula we have:


= [1.83, 0.5, 0.2]+0.01[0.03-0.874352143][+1, 0.1, 0.3]
=[1.821556479, 0.499155648, 0.197466943]
We have the new parameters :
W1 = 0.197466943 , W2 = 0.499155648 , b = 1.821556479
Based on the new parameters, we will recalculate the predicted output. The new
predicted output is used to calculate the new network error. The network
parameters are updated according to the calculated error. The process continues
to update the parameters and recalculate the predicted output until it reaches an
acceptable value for the error.

Partial derivative
7

One important operation used in the backward pass is to calculate derivatives.


Before getting into the calculations of derivatives in the backward pass, we can
start with a simple example to make things easier.

We have partial derivatives:

Note that everything except X is regarded as a constant. Therefore, H is replaced


by 0 after calculating a partial derivative. Here, ∂X means a tiny change of
variable X. Similarly, ∂Y means a tiny change of Y. The change of Y is the
result of changing X. 
The small change can be an increase or decrease by a tiny value. By substituting
the different values of X, we can find how Ychanges with respect to X.
The same procedure can be followed to learn how the NN prediction error
changes W.R.T changes in network weights. So, our target is to calculate
∂E/W1 and ∂E/W2 as we have just two weights W1 and W2.

Derivatives of the Prediction Error W.R.T Parameters

Looking at this equation, Y=X2Z+H, it seems straightforward to calculate the


partial derivative ∂Y/∂X because there’s an equation relating both Yand X. In
our case, there’s no direct equation in which both the prediction error and the
weights exist. So, we’re going to use the multivariate chain rule to find the
partial derivative of Y W.R.T X.

Prediction Error to Parameters Chain

The prediction error is calculated based on this equation:

The desired term in the previous equation is a constant, so there’s no chance for


reaching parameters through it. The predicted term is calculated based on the
sigmoid function, like in the next equation:
8

Next we apply the formula again:


s=X1* W1+ X2*W2+b
Once we’ve reached an equation that has the parameters (weights and biases),
we’ve reached the end of the derivative chain. The next figure presents the chain
of derivatives to follow to calculate the derivative of the error W.R.T the
parameters. 

Note that the derivative of s W.R.T the bias b (∂s/W1) is 0, so it can be omitted. 

According to the previous figure, to know how prediction error changes W.R.T
changes in the parameters we should find the following intermediate derivatives:

1. Network error W.R.T the predicted output. 


2. Predicted output W.R.T the SOP. 
3. SOP W.R.T each of the 3 parameters. 

As a total, there are four intermediate partial derivatives:

∂E/∂Predicted, ∂Predicted/∂s, ∂s/W1 and ∂s/W2

To calculate the derivative of the error W.R.T the weights.

∂E/W1=∂E/∂Predicted* ∂Predicted/∂s* ∂s/W1

∂EW2=∂E/∂Predicted* ∂Predicted/∂s* ∂s/W2

Because this equation seems complex to calculate the derivative of the error
W.R.T the parameters directly, it’s preferred to use the multivariate chain rule
for simplicity.

E=1/2(desired-1/(1+e-(X1* W1+ X2*W2+b))2


9

Calculating partial derivatives values by substitution

For the derivative of the error W.R.T the predicted output:

∂E/∂Predicted=∂/∂Predicted(1/2(desired-predicted)2)

=2*1/2(desired-predicted)2-1*(0-1)

=(desired-predicted)*(-1)

=predicted-desired

By substituting by the values:

∂E/∂Predicted=predicted-desired=0.874352143-0.03

∂E/∂Predicted=0.844352143

For the derivative of the predicted output W.R.T the SOP:

∂Predicted/∂s=∂/∂s(1/(1+e-s))

∂Predicted/∂s=1/(1+e-s)(1-1/(1+e-s))

=1/(1+e-1.94)(1-1/(1+e-1.94))

=0.874352143(0.125647857)

∂Predicted/∂s=0.109860473

For the derivative of SOP W.R.T W1:

∂s/W1=∂/∂W1(X1* W1+ X2*W2+b)

=1*X1*(W1)(1-1)+ 0+0

=X1*(W1)(0)

=X1(1)

∂s/W1=X1
10

By substituting by the values:

∂s/W1=X1=0.1

For the derivative of SOP W.R.T W2:

∂s/W2=∂/∂W2(X1* W1+ X2*W2+b)

=0+1*X2*(W2)(1-1)+0

=X2*(W2)(0)

=X2(1)

∂s/W2=X2

By substituting by the values:

∂s/W2=X2=0.3

After calculating the individual derivatives in all chains, we can multiply all of
them to calculate the desired derivatives.

For the derivative of the error W.R.T W1:

∂E/W1=0.844352143*0.109860473*0.1

∂E/W1=0.009276093

For the derivative of the error W.R.T W2:

∂E/W2=0.844352143*0.109860473*0.3

∂E/W2=0.027828278

Finally, there are two values reflecting how the prediction error changes with
respect to the weights:

0.009276093 for W1 

0.027828278 for W2
11

Interpreting results of backpropagation

 Because the result of the ∂E/W1 derivative is positive, this means if W1


increases by 1, then the total error increases by 0.009276093. 
 Because the result of the ∂E/W2 derivative is positive, this means that if
W2 increases by 1 then the total error increases by 0.027828278. 

Updating weights

For W1:

W1new=W1-η*∂E/W1

=0.5-0.01*0.009276093

W1new=0.49990723907

For W2:

W2new=W2-η*∂E/W2

=0.2-0.01*0.027828278

W2new= 0.1997217172

The new values for the weights are:

 W1=0.49990723907
 W2= 0.1997217172

Here are the new forward pass calculations:

s=X1* W1+ X2*W2+b

s=0.1*0.49990723907+ 0.3*0.1997217172+1.821556479

s=1.931463718067
12

f(s)=1/(1+e-s)

f(s)=1/(1+e-1.931463718067)

f(s)=0.873411342830056

E=1/2(0.03-0.873411342830056)2

E=0.35567134660719907

When comparing the new error (0.35567134660719907) with the old error
(0.356465271), there’s a reduction of 0.0007939243928009043. As long as
there’s a reduction, we’re moving in the right direction. 

The forward and backward passes should be repeated until the error is 0 or for a
number of epochs 

Advanced Issues:
1.3) Overfiting problem and how to solve it:

1.3.1) Overlifting problem:


13

- Overfitting means when we train a lot of data into a statistical model. It occurs
when a function is too closely fit to a limited set of the model training data.
- It might be slightly different if a machine has been taught to scan for more
specific data, but when the same proccess is applied to a new set of data, the
results would be incorrect.
- This problem leads us to value missing of the model, comparing to its inherent
data set. Moreover, it would cause some degree of error or random noise with it.
Thus, attempting to make the model conform too closely to slightly inaccurate
data can infect the model with substantial errors and reduce its predictive power.
So overfitting would make the statistics of the model had some redundant or
needless features, which making the results much more complicated and less
effective.
=> Inconclusion, overfitting leads to high variance and low bias.

1.3.2) How to solve it:


There are many ways to solve this problem, typically:
- Reduce the complexity of the model.
- Decrease training data.
- Data preparation (data cleaning, converting words to numbers,..)
- Use dropout layers (randomly remove certain figures by setting them to zero).
14

There are techniques that would be useful to reduce the problems


complexity:

1. Hold-out (data)
We can simply split our dataset into two sets: training and testing. A common
split ratio is 80% for training and 20% for testing.We train our model until it
performs well not only on the training set but also for the testing set. 

2. Cross-validation (data)
Cross-validation allows all data to be eventually used for training but is also more
computationally expensive than hold-out.

3. Data augmentation (data)


A larger dataset would reduce overfitting. If we cannot gather more data and are
constrained to the data we have in our current dataset, we can apply data
augmentation to artificially increase the size of our dataset.

4. Feature selection (data)


- We should only select the most important features for training so that our model
doesn’t need to learn for so many features and eventually overfit. 

5. L1 / L2 regularization (learning algorithm)


- Regularization is a technique to constrain our network from learning a model
that is too complex, which may therefore overfit. In L1 or L2 regularization, we
can add a penalty term on the cost function to push the estimated coefficients
towards zero .

6. Remove layers / number of units per layer (model)


- We can directly reduce the model’s complexity by removing layers and reduce
the size of our model. We may further reduce complexity by decreasing the
number of neurons in the fully-connected layers.

7. Dropout (model)
- We can reduce interdependent learning among units, which may have led to
overfitting. However, with dropout, we would need more epochs for our model to
converge. 
15

8. Early stopping (model)


- We stop the training and save the current model. We can implement this either
by monitoring the loss graph or set an early stopping trigger. The saved model
would be the optimal model for generalization among different training epoch
values.

1.4) The Adam optimizer:


- Adam optimizer (Adam Moment Estimation) is an algorithm for optimization
technique for gradient descent. It’s used when working with large problem
involving a lot of data.
- It is a combination of two others optimizer algorithms ,which are Momentum
and RMSprop.
But first we have to know about the other optimizer algorithms before going
further:
- Basically, optimizer algorithms is the foundation of building a neural network.
The goal is to “learn” the features or pattern of the head data and find suitable
couples of weights and bias to optimize the model. But it is hard to guess or
random and hope the problem will be solved. So we need to have optimzer
algorithms to handle it.

1.Gradient Decent(GD):
- It is a algorithm to find the minium of a function.
- Start with a random point on the function and move in the negative direction of
the function to reach the local/global minima.
Fomula: xnew = xold – learning_rate*gradient(x)
For example:
We have: y = (x + 5)^2
16

Taking a random point as a starter: xold = -3


Then find the gradient of the function by using derivative: dy/dx = 2*(x+5)
While learning_rate is 0.01
=> Iteration 1: xnew = xold - learning_rate*gradient(x)
xnew = -3 – (0,01)*(2*((-3) + 5)) = -3,04.
Iteration 2: xnew = xold - learning_rate*gradient(x)
xnew = -3.06 – (0,01)*(2*((-3.06) + 6)) = -3,0792.
Keep repeating the fomula as above and the results would reach the minimum
number.

2.Stochastic Gradient Descent(SGD):


- Stochastic is a variant of Gradient Descent.
- If solving big data, SGD would be prefered than the GD, although SGD
provide much more data.
- If GD algorithms is a straight line to the result, SGD will perform a zig zag.
SGD randomly picks one data point from the whole data set at each iteration to
reduce the computations enormously.

3.Momentum:
- So from the limitation of GD, momentum is created to reach the global
minimum while GD just approach the local minimum.
17

- Momentum will provide a parameter which is called gama.v to make the


solving proccess faster and more effective.

Fomula: xnew = xold -(gama.v + learning_rate.gradient)

4.Adagrad:
- Different from the algorithms above, the learning_rate almost are the same in
training data proccess(learning_rate is a constant), Adagard make the
learinng_rate as a parameter. It means that the learning_rate will change when it
get to the another epoch.

n : constance.
gt : gradient ai t
ϵ : error avoidance factor
G : a diagonal matrix where each element on the diagonal (i, i) is the square of
the derivative of the parameter vector at epoch t.

5.Root Mean Square Propagation(RMSProp):


- RMSProp is an extension of GD and Adagrad version that uses a decaying
average of partial gradients in the adaptation of the step size for each parameter.
- This will solve the slowdown learning of the Adagard problems(making the
training data slowdown, could lead to freeze the training proccess).

6.Adam:
- Adam optimizer algorithms is a combination of Momentum and RMSProp.
- Adam is a replacement optimization algorithm for stochastic gradient descent
for training deep learning models.
- Adam combines the best properties of the AdaGrad and RMSProp algorithms
to provide an optimization algorithm that can handle sparse gradients on noisy
problems.
- Adam algorithms use internal states momentum (m) and squared momentum(v)
of gradient for parameters.
18

Updating m and v:

Updating gradient:

Updating equation:

x1← xt-1 ← θt

Part 2
Study the library in Keras

1.Introduction:
Keras is one of the leading high-level neural networks APIs. It is written in
Python and supports multiple back-end neural network computation engines. It
was developed with a focus on enabling fast experiments. Being able to go from
idea to result as fast as possible and doing good researches.

Why Keras:
- Keras makes it easy to turn models into products.
- Keras has strong multi-GPU & distributed training support.
-Keras has broad adoption in the industry and the research community.

2. The Sequential model:


19

Setting up:
Setting up enviroment:
Using this command to install Keras:

Note: A Squential model is appropriate for a plain stack of layers where each


layer has exactly one input tensor and one output tensor.

- creating a Sequential model by using list.

- Accessing layers by layers( ) fuction:

- Another way to create a Sequential model by using add( ) fuction


20

- Using pop( ) to remove layers

- Sequential models and layers are can be named for convinience:

- The weights function is to Create weights depends on the shape of the inputs.
21

- summary( ): Display information of the model.

- input(shape=( )): Passing an object to model, so that it knows its input shape
from the start.
22

- Input_shape( ): Has the same use as input(shape=( )) function but applied


directly when adding a layer.

Debugging workflows using Conv2D and MaxPooling2D funtion.


23

3. Dense Layer.
Introduction:
- Dense Layer is a neural network layer are connected. That’s mean each
neural’s head in the Dense Layer is taken by the heads of the neurals of the
previous Layer.
- Dense Layer implement the vector’s matrix multiplication. The values that are
used actually the parameters that can be trained and update with the help of
backpropagation.
-The output of dense layer is a vector with m dimension. So that, dense layer
basically will change the the size of the vector.
24

Setting up:
Using the command bellow to install:

Creating a Dense Layer:

- units: Positive integer, dimensionality of the output space.


- activation: Activation function to use. If you don't specify anything, no
activation is applied.
- use_bias: Boolean, whether the layer uses a bias vector.
- kernel_initializer: Initializer for the kernel weights matrix.
- bias_initializer: Initializer for the bias vector.
- kernel_regularizer: Regularizer function applied to the kernel
weights matrix.
- bias_regularizer: Regularizer function applied to the bias vector.
- activity_regularizer: Regularizer function applied to the output of the layer
(its "activation").
25

- kernel_constraint: Constraint function applied to the kernel weights matrix.


- bias_constraint: Constraint function applied to the bias vector.

You might also like