Professional Documents
Culture Documents
THANK YOU
Thanks professor Le Anh Cuong! Thank you for always helping
us from the day we first joined the Introduction to Machine
Learning. We always look up to you because of your dedicate and
hard work.
ii
We guarantee that this is our project and with the guidance of professor Le
Anh Cuong. All of the content and consequence in this report is honest and have not
been declared in any formal before. Every data in this table is served for the
canalizations and comments by any authors with sources that written in the reference
files.
If there are any cheating, we will be responsible for the content of my report.
Ton Duc Thang University does not relate to the copyright infringement that we did in
the proccess.
Ho Chi Minh city, date 13 month 11 year 2021
Author
(Signature and full name)
Phong Yen
INSTRUCTOR’S ASSESMENT
_________________________________________________________
_________________________________________________________
_________________________________________________________
_________________________________________________________
_________________________________________________________
_________________________________________________________
_________________________________________________________
Ho Chi Minh city, date 26 month 04 year 2021
(Signature and full name)
iv
SUMMARY
Recapitulate the problems that are researched, solve the problems and the
results that gained, basic discover in 1 - 5 pages.
1
PROJECT
PART 1
Present the theory of Feedforward Neural Nework including the issues
1) Introduction:
*) Before going to the of Feedforward Neural Nework, let’s find out what is
a Neural Network:
- Neural network is a network that has neurons circular lines go around the area.
The connections of the neurons are specified in the architecture as weights. A
positive weight presents for an excitatory connection, while an inhibitory is
shown by negative weight.
- These are used for predictive modeling, adaptive control and applications
where they can be trained as a dataset.
*) So what is Feedforward Neural Nework?
- "Feed forward" is used when you input something at the input layer and
it travels from input to hidden and from hidden to output layer.
- Feed- algorithm to calculate output vector from input vector. Input for feed-
forward is input_vector, forward is output is output_vector.
*) So how does Feedforward Neural Network work?
- Feedforward neural networks were among the first and most successful
learning algorithms. They are also called deep networks, multi-layer perceptron
(MLP), or simply neural networks. As data travels through the network’s
artificial mesh, each layer processes an aspect of the data, filters outliers, spots
familiar entities and produces the final output.
- Input layer: This layer consists of the neurons that receive inputs and pass
them on to the other layers. The number of neurons in the input layer should be
equal to the attributes or features in the dataset.
- Hidden layer: In between the input and output layer, there are hidden layers
based on the type of model. Hidden layers contain a vast number of neurons
which apply transformations to the inputs before passing them. As the network is
trained, the weights are updated to be more predictive.
- Neuron weights: Weights refer to the strength or amplitude of a connection
between two neurons. If you are familiar with linear regression, you can compare
weights on inputs like coefficients. Weights are often initialized to small random
values, such as values in the range 0 to 1.
- Output layer: The output layer is the predicted feature and depends on the
type of model you’re building.
(x1* w1)
3
(x2* w2)
(xn* wn)
2. Adding biases:
- In the next step, the result that found in the next step is added to their
respective biases. But just sum to a single value.
For example:
(x1* w1) + b1
(x2* w2) + b2
(xn* wn) + bn
=> weighted_sum = (x1 * w1) + b1 + (x2* w2) + b2 +(xn* wn) + bn.
3. Activation:
- An activation is a function is the mapping of summed weighted input to the
output of the neuron. It is called an activation/transfer function.
4. Output signal:
- Finally, the weighted sum obtained is turned into an output signal by
feeding the weighted sum into an activation function (also called transfer
function) .
- There are several activation functions for different use cases. The most
commonly used activation functions are relu, tanh and softmax.
Atributes:
4
-The weights of the inputs are W1 and W2, respectively. The bias is treated as a
new input neuron to the output neuron which has a fixed value +1 and a weight
b. Both the weights and biases could be referred to as parameters.
5
Let’s assume that output layer uses the sigmoid activation function defined by
the following equation:
Where s is the sum of products (SOP) between each input and its corresponding
weight:
s=X1* W1+ X2*W2+b
Example: We have X 1 = 0.1 , X 2 = 0.3 , W 1 = 0.5 , W 2 = 0.2, NS = 1.83,
Desired Output = 0.03
Return output of the neuron:
s = X1 *W1 + X2* W 2 + b
s = 0.1 * 0.5 + 0.3* 0.2 +1.83
s = 1.94
Result ~ 0.357 , The error just gives us an indication of how far the predicted
results are from the desired results.
We calculate the error, then the forward pass ends, and we should start
the backward pass to calculate the derivatives and update the parameters.
W(n+1)=W(n)+η[d(n)-Y(n)]X(n)
n: 0
W(n): [1.83, 0.5, 0.2]
η: Because it is a hyperparameter, then we can choose it 0.01 for example.
d(n): [0.03].
Y(n): [0.874352143].
X(n): [+1, 0.1, 0.3]. First value (+1) is for the bias.
Partial derivative
7
Note that the derivative of s W.R.T the bias b (∂s/W1) is 0, so it can be omitted.
According to the previous figure, to know how prediction error changes W.R.T
changes in the parameters we should find the following intermediate derivatives:
∂E/∂Predicted, ∂Predicted/∂s, ∂s/W1 and ∂s/W2
Because this equation seems complex to calculate the derivative of the error
W.R.T the parameters directly, it’s preferred to use the multivariate chain rule
for simplicity.
∂E/∂Predicted=∂/∂Predicted(1/2(desired-predicted)2)
=2*1/2(desired-predicted)2-1*(0-1)
=(desired-predicted)*(-1)
=predicted-desired
∂E/∂Predicted=predicted-desired=0.874352143-0.03
∂E/∂Predicted=0.844352143
∂Predicted/∂s=∂/∂s(1/(1+e-s))
∂Predicted/∂s=1/(1+e-s)(1-1/(1+e-s))
=1/(1+e-1.94)(1-1/(1+e-1.94))
=0.874352143(0.125647857)
∂Predicted/∂s=0.109860473
=1*X1*(W1)(1-1)+ 0+0
=X1*(W1)(0)
=X1(1)
∂s/W1=X1
10
∂s/W1=X1=0.1
=0+1*X2*(W2)(1-1)+0
=X2*(W2)(0)
=X2(1)
∂s/W2=X2
∂s/W2=X2=0.3
After calculating the individual derivatives in all chains, we can multiply all of
them to calculate the desired derivatives.
∂E/W1=0.844352143*0.109860473*0.1
∂E/W1=0.009276093
∂E/W2=0.844352143*0.109860473*0.3
∂E/W2=0.027828278
Finally, there are two values reflecting how the prediction error changes with
respect to the weights:
0.027828278 for W2
11
Updating weights
For W1:
W1new=W1-η*∂E/W1
=0.5-0.01*0.009276093
W1new=0.49990723907
For W2:
W2new=W2-η*∂E/W2
=0.2-0.01*0.027828278
W2new= 0.1997217172
W1=0.49990723907
W2= 0.1997217172
s=0.1*0.49990723907+ 0.3*0.1997217172+1.821556479
s=1.931463718067
12
f(s)=1/(1+e-s)
f(s)=1/(1+e-1.931463718067)
f(s)=0.873411342830056
E=1/2(0.03-0.873411342830056)2
E=0.35567134660719907
When comparing the new error (0.35567134660719907) with the old error
(0.356465271), there’s a reduction of 0.0007939243928009043. As long as
there’s a reduction, we’re moving in the right direction.
The forward and backward passes should be repeated until the error is 0 or for a
number of epochs
Advanced Issues:
1.3) Overfiting problem and how to solve it:
- Overfitting means when we train a lot of data into a statistical model. It occurs
when a function is too closely fit to a limited set of the model training data.
- It might be slightly different if a machine has been taught to scan for more
specific data, but when the same proccess is applied to a new set of data, the
results would be incorrect.
- This problem leads us to value missing of the model, comparing to its inherent
data set. Moreover, it would cause some degree of error or random noise with it.
Thus, attempting to make the model conform too closely to slightly inaccurate
data can infect the model with substantial errors and reduce its predictive power.
So overfitting would make the statistics of the model had some redundant or
needless features, which making the results much more complicated and less
effective.
=> Inconclusion, overfitting leads to high variance and low bias.
1. Hold-out (data)
We can simply split our dataset into two sets: training and testing. A common
split ratio is 80% for training and 20% for testing.We train our model until it
performs well not only on the training set but also for the testing set.
2. Cross-validation (data)
Cross-validation allows all data to be eventually used for training but is also more
computationally expensive than hold-out.
7. Dropout (model)
- We can reduce interdependent learning among units, which may have led to
overfitting. However, with dropout, we would need more epochs for our model to
converge.
15
1.Gradient Decent(GD):
- It is a algorithm to find the minium of a function.
- Start with a random point on the function and move in the negative direction of
the function to reach the local/global minima.
Fomula: xnew = xold – learning_rate*gradient(x)
For example:
We have: y = (x + 5)^2
16
3.Momentum:
- So from the limitation of GD, momentum is created to reach the global
minimum while GD just approach the local minimum.
17
4.Adagrad:
- Different from the algorithms above, the learning_rate almost are the same in
training data proccess(learning_rate is a constant), Adagard make the
learinng_rate as a parameter. It means that the learning_rate will change when it
get to the another epoch.
n : constance.
gt : gradient ai t
ϵ : error avoidance factor
G : a diagonal matrix where each element on the diagonal (i, i) is the square of
the derivative of the parameter vector at epoch t.
6.Adam:
- Adam optimizer algorithms is a combination of Momentum and RMSProp.
- Adam is a replacement optimization algorithm for stochastic gradient descent
for training deep learning models.
- Adam combines the best properties of the AdaGrad and RMSProp algorithms
to provide an optimization algorithm that can handle sparse gradients on noisy
problems.
- Adam algorithms use internal states momentum (m) and squared momentum(v)
of gradient for parameters.
18
Updating m and v:
Updating gradient:
Updating equation:
x1← xt-1 ← θt
Part 2
Study the library in Keras
1.Introduction:
Keras is one of the leading high-level neural networks APIs. It is written in
Python and supports multiple back-end neural network computation engines. It
was developed with a focus on enabling fast experiments. Being able to go from
idea to result as fast as possible and doing good researches.
Why Keras:
- Keras makes it easy to turn models into products.
- Keras has strong multi-GPU & distributed training support.
-Keras has broad adoption in the industry and the research community.
Setting up:
Setting up enviroment:
Using this command to install Keras:
- The weights function is to Create weights depends on the shape of the inputs.
21
- input(shape=( )): Passing an object to model, so that it knows its input shape
from the start.
22
3. Dense Layer.
Introduction:
- Dense Layer is a neural network layer are connected. That’s mean each
neural’s head in the Dense Layer is taken by the heads of the neurals of the
previous Layer.
- Dense Layer implement the vector’s matrix multiplication. The values that are
used actually the parameters that can be trained and update with the help of
backpropagation.
-The output of dense layer is a vector with m dimension. So that, dense layer
basically will change the the size of the vector.
24
Setting up:
Using the command bellow to install: