Professional Documents
Culture Documents
Neurons
Machine Learning
Deep Learning
Traditional ML Vs DL
Artificial neuron vs Biological neuron
The most fundamental unit of a deep
neural network is called an artificial
y
neuron
Why is it called a neuron ? Where does
σ
the inspiration come from ?
The inspiration comes from biology
w1 w2 w3
(more specifically, from the brain)
x1 x2 x3
biological neurons = neural cells = neural
Artificial
Neuron
processing units
We will first see what a biological neuron
Artificial neuron vs Biological neuron
The most fundamental unit of a deep
neural network is called an artificial
y
neuron
Why is it called a neuron ? Where does
σ
the inspiration come from ?
The inspiration comes from biology
w1 w2 w3
(more specifically, from the brain)
x1 x2 x3
biological neurons = neural cells = neural
Artificial
Neuron
processing units
We will first see what a biological neuron
• dendrite: receives signals from
other neurons
• synapse: point of connection to
other neurons
• soma: processes the information
• axon: transmits the output of this
Biological
Neurons∗
neuron
∗
Image adapted from
https://cdn.vectorstock.com/i/composite/12,25/neuron-cell-vector-81225.jpg
• Of course, in reality, it is not just a single
neuron which does all this
• There is a massively parallel interconnected
net- work of neurons
• The sense organs relay information to the
lowest layer of neurons
• Some of these neurons may fire (in red) in
re- sponse to this information and in turn
relay inform- ation to other neurons they
are connected to
• These neurons may also fire (again, in red)
and the process continues eventually
resulting in a re- sponse (laughter in this
case)
• This massively parallel network also
ensures that there is division of work
• Each neuron may perform a certain role or
respond to a certain stimulus
A simplified
illustration
• The neurons in the brain are
arranged in a hierarchy
• We illustrate this with the help of
visual cortex (part of the brain)
which deals with processing visual
information
• Starting from the retina, the
information is relayed to several
layers (follow the ar- rows)
• We observe that the layers V 1, V 2
to AIT form a hierarchy (from
identifying simple visual forms to
high level objects)
25
Sample illustration of
hierarchical processing∗
Neurons
• ANN’s are built upon simple signal processing elements (Neuron) that are connected
together into a large mesh.
θ 3 1
x1 x2 x3 x1 x2 x3 x1 x2 x3
1 0 0
x1 x2 x1 x2 x1
x1 AND !x2∗ NOR NOT
∗
function
circle at the end indicates inhibitory input: if any inhibitory input is 1 the output function
will be 0
Limitations Of M-P Neuron
• What about non-boolean (say, real) inputs?
• Do we always need to hand code the threshold?
• Are all inputs equal? What if we want to assign more importance to some inputs?
• What about functions which are not linearly separable? Say XOR function.
The Perceptron
• A perceptron is a very simple learning machine.
• It takes few inputs, each of which has a weight to signify how important it is, and generate an
output decision of “0” or “1”.
• When combined with many other perceptron's, it forms an artificial neural
network.
The perceptron
• The most basic form of an activation function is a simple binary function that has only
two possible results.
• This function returns 1 if the input is positive or zero, and 0 for any negative input. A
neuron whose activation function is a function like this is called a perceptron.
•
Threshold Logic Unit (TLU)
• In a Threshold Logic Unit (TLU) the output of the unit y in response to
a particular input pattern is calculated in two stages.
• First the activation is calculated.
• The activation a is the weighted sum of the inputs:
inputs
x1 w1
weights output
w2 activation
x2
Σ θ
y
. Σ
. wn
a= i=1
n
wi xi
.
xn
y= { 1 if a ≥ θ
0 if a < θ
Linear Unit
• Perceptron is Machine Learning algorithm for supervised learning of various binary
classification tasks.
• Further, Perceptron is also understood as an Artificial Neuron or neural network unit that
helps to detect certain input data computations in business intelligence .
• Perceptron model is also treated as one of the best and simplest types of Artificial Neural
networks. However, it is a supervised learning algorithm of binary classifiers.
• Hence, we can consider it as a single-layer neural network with four main parameters,
i.e., input values, weights and Bias, net sum, and an activation function.
inputs
x1 w1 weights
activation output
w2
x2 y
.
Σ
. wn
a= Σ
i=1
n
wi xi y= a = Σ
i=1
n
wi xi
.x
n
Training ANNs
• Training set S of examples {x,t}
• x is an input vector and
• t the desired target vector
• Example: Logical And
S = {(0,0),0}, {(0,1),0}, {(1,0),0}, {(1,1),1}
• Iterative process
• Present a training example x , compute network output y , compare
output y with target t, adjust weights and thresholds
• Learning rule
• Specifies how to change the weights w and thresholds θ of the
network as a function of the inputs x, output y and target t.
Perceptron Learning Rule
• w’=w + α (t-y) x
Or in components
• w’i = wi + Δwi = wi + α (t-y) xi (i=1..n+1)
With wn+1 = θ and xn+1=-1
• The parameter α is called the learning rate. It determines the
magnitude of weight updates Δwi .
• If the output is correct (t=y) the weights are not changed (Δwi
=0).
• If the output is incorrect (t ≠ y) the weights wi are changed
such that the output of the TLU for the new weights w’i is
closer/further to the input xi.
Perceptron Training Algorithm
Repeat
for each training vector pair (x,t)
evaluate the output y when x is the input
if y≠t then
form a new weight vector w’ according
to w’=w + α (t-y) x
else
do nothing
end if
end for
Until y=t for all training vector pairs
Perceptron Learning Rule
Perceptron Learning Rule
Perceptron Learning Rule
Perceptron Learning Rule
Perceptron Learning Rule
Perceptron Learning Rule
Perceptron Learning Rule
Perceptron Learning Rule
Perceptron Learning Rule
Perceptron Learning Rule
Perceptron Learning Rule
Perceptron Learning Rule
Perceptron Convergence Theorem
The algorithm converges to the correct classification
•if the training data is linearly separable
•and α is sufficiently small
• If two classes of vectors X1 and X2 are linearly separable, the
application of the perceptron training algorithm will eventually result
in a weight vector w0, such that w0 defines a TLU whose decision
hyper-plane separates X1 and X2 (Rosenblatt 1962).
• Solution w0 is not unique, since if w0 x =0 defines a hyper-plane, so
does w’0 = k w0.
• regularize means to make things regular or acceptable
• Regularization refers to a set of different techniques that
lower the complexity of a neural network model during
training, and thus prevent the overfitting
• Regualrization penalizes the weight matrices of the
nodes
56
What is Overfitting
• The training data contains information about the
regularities in the mapping from input to output. But it
also contains sampling error.
• There will be accidental regularities because of the
particular training cases that were choosen.
• When we fit the model, It cannot tell which regularities
are real and which are caused by sampling error.
• So it fits both kinds of regularity. If the model is very
flexible it can model the sampling error really well.
• This means the model will not generalize well to unseen
data
57
Diagnosing Overfitting
58
Regularization Techniques
• L2 Regualrizartion / Ridge Regularization
• L1 Regualrizartion / Lasso Regularization
• Dropout
• Early Stopping
Salary
•
Experience
60
L1 vs L2 Regularization Methods
• L1 Regularization, also called a Lasso regression, adds the “absolute value of
magnitude” of the coefficient as a penalty term to the loss function.
• L2 Regularization, also called a Ridge regression, adds the “squared magnitude” of the
coefficient as the penalty term to the loss function.
• The key difference between these two is the penalty term.
61
Ridge Regularization
Salary
•
Experience
62
Steep slope
Salary
•
Experience
63
• Assume lamda =1
• Slope = 1.3
• Then cost = 0 + 1(1.3)2
• = 1.69
• Assume lamda =1
• Slope = 1.1
• Then cost = 0 + 1(1.3)2
• = 1.21
64
•
65
Lasso Regression
• This help in feature selection too
66
Dropout
• This is the one of the most interesting types of
regularization techniques.
• It also produces very good results and is consequently
the most frequently used regularization technique in
the field of deep learning
• To understand dropout, let’s say our neural network
structure
67
• At every iteration, it randomly selects some nodes
and removes them along with all of their incoming
and outgoing connections as shown below
68
• So each iteration has a different set of nodes and this
results in a different set of outputs. It can also be
thought of as an ensemble technique in machine
learning.
• Ensemble models usually perform better than a single
model as they capture more randomness. Similarly,
dropout also performs better than a normal neural
network model
• Due to these reasons, dropout is usually preferred
when we have a large neural network structure in
order to introduce more randomness.
69
Early stopping
70
• In the above image, we will stop training at the dotted
line since after that our model will start overfitting on
the training data
71
Why Training a Neural Network Is Hard
82
• For increased hidden layers the amount of error
information propagated back to earlier layers is
dramatically reduced.
• Weights in hidden layers close to the output layer are
updated normally, whereas weights in hidden layers
close to the input layer are updated minimally or not
at all.
• Generally, this problem prevented the training of very
deep neural networks and was referred to as
the vanishing gradient problem
83
• Pretraining :
• add a new hidden layer to a model.
• Allow the newly added model to learn the inputs from the existing hidden layer, keeping
the weights for the existing hidden layers fixed.
• This gives the technique the name “layer-wise” as the model is trained one layer at a
time.
• Greedy algorithm:
• Breaks the problem into many components, then solve for the optimal version of each
component in isolation
• Pretraining is based on the assumption that it is easier to train a shallow network instead
of a deep network and contrives a layer-wise training process that we are always only
ever fitting a shallow model
84
85
Pre-training and fine tuning
● Using dataset A train model M
● Pre-training:
● You have a dataset B
● Before training the model, initialize some of the
parameters of M with model trained on A
● Fine-tuning:
● You train M on B
● This is one form of transfer learning
87
• Training a deep structure is difficult due to high dependencies across
layers’ parameters , i.e. the relation between parts of pictures and
pixels.
• To resolve this problem, two things are suggested
• Adapting lower layers to feed good input to the upper layers
• Adjust upper layers to make use of that end setting of lowerr layers
88
Greedy Algorithm
● Greedy algorithms break a problem into many
components, then solve for the optimal version of
each component in isolation
95
Gradient Descent Algorithm
• A gradient measures how much the output of a
function changes if you change the inputs a little
bit."
• In mathematical terms, a gradient is a partial
derivative with respect to its inputs.
• Gradient Descent is an optimization algorithm for
finding a local minimum of a differentiable function.
• Gradient descent is simply used to find the values of
a function's parameters (coefficients) that minimize a
cost function as far as possible.
96
• the lowest point on the parabola
occurs at x = 1.
• The objective of gradient descent
algorithm is to find the value of “x”
such that “y” is minimum
•. “y” here is termed as the objective
function that the gradient descent
algorithm operates upon, to descend
to the lowest point
97
• Find the slope of the objective function with respect to
each parameter/feature. In other words, compute the
gradient of the function.
• Pick a random initial value for the parameters.
• Update the gradient function by plugging in the
parameter values.
• Calculate the step sizes for each feature as : step size =
gradient * learning rate.
• delta = - learning_rate * gradient
99
The learning rate should never be too high or too low for this
reason.
100
Downsides of the gradient descent algorithm
• Consider we have 10,000 data points and 10 features.
• We need to compute the derivative 10000 * 10 = 100,000
computations per iteration.
• It is common to take 1000 iterations, in effect we have 100,000 *
1000 = 100000000 computations to complete the algorithm.
• Hence gradient descent is slow on huge data
101
Stochastic Gradient Descent (SGD)
• randomly picks one data point from the whole data set at each
iteration to reduce the computations enormously.
• Mini-batch tries to strike a balance between the goodness of gradient
descent and speed of SGD
102
Momentum
• some additional processing of the gradients to be
faster and better
• in addition to the regular gradient, it also adds on the
movement from the previous step
• sum_of_gradient = gradient + previous_sum_of_gradient *
decay_rate
• delta = -learning_rate * sum_of_gradient
• theta += delta
103
• Momentum simply moves faster
• Momentum has a shot at escaping local minima
(because the momentum may propel it out of a local
minimum
104
Adaptive Gradient Descent(AdaGrad)
• One of the disadvantages of all the optimizers is that the learning rate is constant
for all parameters and for each cycle.
• It changes the learning rate ‘η’ for each parameter and at every time step ‘t’.
105
• Instead of keeping track of the sum of gradient,
AdaGrad for s keeps track of the sum of
gradient squared and uses that to adapt the gradient
in different directions.
• Sum_of_gradient_squared =
previous_sum_of_gradient_squared + gradient²
• delta = -learning_rate * gradient /
sqrt(sum_of_gradient_squared)
• theta += delta
106
where
• θ is the parameter to be updated,
• η is the initial learning rate,
• ε is some small quantity that used to avoid the division of zero,
• I is the identity matrix,
• gt is the gradient estimate in time-step t that we can get with the
equation
107
Root Mean Square Propagation
• AdaGrad is incredibly slow, because the sum of gradient squared only
grows and never shrinks.
• RMSprob adds a decay factor.
• sum_of_gradient_squared = previous_sum_of_gradient_squared *
decay_rate+ gradient² * (1- decay_rate)
• delta = -learning_rate * gradient / sqrt(sum_of_gradient_squared)
• theta += delta
108
Adaptive Moment estimation.
109
• Instead of adapting the parameter learning rates
based on the average first moment (the mean) as in
RMSProp, Adam also makes use of the average of the
second moments of the gradients (the uncentered
variance).
• Specifically, the algorithm calculates an exponential
moving average of the gradient and the squared
gradient, and the parameters beta1 and beta2 control
the decay rates of these moving averages.
110
• sum_of_gradient = previous_sum_of_gradient * beta1 + gradient * (1
- beta1) [Momentum]
• sum_of_gradient_squared = previous_sum_of_gradient_squared *
beta2 + gradient² * (1- beta2) [RMSProp]
• delta = -learning_rate * sum_of_gradient /
sqrt(sum_of_gradient_squared)
• theta += delta
• https://www.simplilearn.com/tutorials/deep-learning-tutorial/what-i
s-deep-learning
• https://machinelearningmastery.com/what-is-deep-learning/