Week 14 (NN) (2)

Neural
Network
Logistic Regression (Recap)
• In linear regression:
• In logistic regression: (activation function)
• Where (sigmoid function or the logistic function)
• Hence the name logistic regression
• But it is a classifier that is extended from linear regression
• Finally
• The use of activation function is to
• introduce non-linearity
• transform the linear combination of input and weights into the form that can be interpreted as
probabilities in the range of [0, 1]
• Activation function makes logistic regression suitable for classification.
1
• Task is to select parameters to fit the date
0.5
0
• Process:
• Scale the data
• Initialize parameter
• Compute the cost function
• Update parameters to reduce cost (gradient descent)
Parameters or
Weights
Sigmoid (logistic) activation function.

(neuron [say]) Analogous to neurons body
Neural Network
• Why do we need another learning algo or hypothesis?
• To learn complex nonlinear hypothesis.
• As the number of features are increased in a problem, complexity
increases.
Example
Example (Cont.)
Neural Network
• The training of an NN has two parts
• A feedforward neural network, aka multi-layer perceptron (MLP), is a series of logistic

regression models stacked on top of each other, with the final layer being either another logistic
regression or a linear regression model, depending on whether we are solving a classification or
regression problem.
• A backpropagation model which is used to compute the gradient vectors
• A perceptron is the basic block on which an NN is built.
• Perceptron: a model that assigns weights to the inputs combine them in a linear
fashion and applies activation function to give output.
NN and the Brain
Neuron model: Logistic unit
Parameters or
Weights
Sigmoid (logistic) activation function.

(neuron [say]) Analogous to neurons body
Andrew Ng
Neural Network
Layer 1 Layer 2 Layer 3

Andrew Ng
Neural Network
“activation” of unit in layer
matrix of weights controlling
function mapping from layer to
layer
If network has units in layer , units in layer , then

will be of dimension .
Andrew Ng
Forward propagation: Vectorized implementation
Add .
Andrew Ng
Neural Network learning its own features
Layer 1 Layer 2 Layer 3
Andrew Ng
Other network architectures
Layer 1 Layer 2 Layer 3 Layer 4
Input Output
Hidden Layer
Layer Layer
Andrew Ng
Non-linear classification example: XOR/XNOR x1, x2 Features
, are binary (0 or 1).
x2
x2
x1
x1
Andrew Ng
Simple example: AND 0.99 g(z)
0.5
0.01
-30 -4.0 4.0 z
20
20
0 0 g(-30) ≈ 0
= g(-30 + 20 + 20 ) 0 1 g(-10) ≈ 0
1 0 g(-10) ≈ 0
1 1 g(10) ≈ 1
≈ AND
Andrew Ng
Example: OR function
-10
20 0 0
20 0 1
1 0
1 1
Andrew Ng
Negation:
10
-20
0 nor
Andrew Ng
Putting it together:
-30 10 -10
20 -20 20
20 -20 20
0 0
0 1
1 0
1 1
Andrew Ng
Neural Network intuition
Andrew Ng
Multiple output units: One-vs-all.
Pedestrian Car Motorcycle Truck
Want , , , etc.
when pedestrian when car when motorcycle
Andrew Ng
Multiple output units: One-vs-all.
Want , , , etc.
when pedestrian when car when motorcycle
Training set:
one of ,, ,
pedestrian car motorcycle truck
Andrew Ng
Neural
Networks
Cost Function
Neural Network (Classification)
total no. of layers in network
no. of units (not counting bias unit) in
layer
Binary classification Multi-class classification (K classes)

E.g. , , ,
pedestrian car motorcycle truck
1 output unit K output units
Andrew Ng
Cost function
Logistic regression:
Neural network:
Andrew Ng
Neural Networks
Backpropagation
Algorithm
Gradient computation
Need to compute:
-
-
Gradient computation
Given one training example ( , ):
Forward propagation:

Gradient computation: Backpropagation algorithm
Intuition: “error” of node in layer .
For each output unit (layer L = 4)

Backpropagation Algorithm
Training set
Set (for all ).
For
Set
Perform forward propagation to compute for
Using , compute
Compute
Derivative
Backpropagation
Intuition
Forward Propagation
Forward Propagation
(2 )
Θ 10
(2 )
𝑥1
(𝑖) Θ 11
(𝑖)
𝑥2
Andrew Ng
Forward Propagation
Propagating delta back and

computing new delta at layer L-1
𝛿
(3)
1
𝛿( 4 )
(3)
𝛿2
“error” of cost for (unit in layer ).

Formally, (for ), where
Andrew Ng
Training a neural network
Pick a network architecture (connectivity pattern between neurons)
No. of input units: Dimension of features

No. output units: Number of classes
Reasonable default: 1 hidden layer, or if >1 hidden layer, have same no. of hidden
units in every layer (usually the more the better)
Andrew Ng
Activation
Function
Why do NNs Need an Activation Function?
• To add non-linearity to the neural network.
• Suppose a neural network without the activation functions.

• In that case, every neuron will only be performing a linear transformation on the
inputs using the weights and biases.
• It doesn’t matter how many hidden layers we attach in the neural network, all
layers will behave in the same way because the composition of two linear
functions is a linear function itself.
• Although the neural network becomes simpler, learning any complex task is
impossible, and our model would be just a linear regression model.
Types of Activation Functions
Sigmoid / Logistic Function
• It maps any real-valued number to a value between 0 and
1.
• for large positive input values, the output approaches 1,
and for negative values, the output approaches 0.
• It is commonly used for models where we have to predict
the probability as an output. Since probability of anything
exists only between the range of 0 and 1, sigmoid is the
right choice because of its range.
• The function is differentiable and provides a smooth
gradient, i.e., preventing jumps in output values. This is
represented by an S-shape of the sigmoid activation
function.
• The limitations of sigmoid function are discussed below:
Sigmoid / Logistic Function
• Limitations
• Derivative f'(x) = sigmoid(x)*(1-sigmoid(x)).
• The gradient values are only significant for range -3
to 3.
• for values beyond ±3, the function will have very
small gradients
• As the gradient value approaches zero, the network
ceases to learn and suffers from the Vanishing
gradient problem.
• The output of the logistic function is not symmetric
around zero. So the output of all the neurons will be
of the same sign. This makes the training of the
neural network more difficult and unstable.
Tanh Function
• Tanh function is very similar to the sigmoid
function with the difference in output range of -1
to 1.
• Its output is zero centered; hence we can easily
map the output values as strongly negative,
neutral, or strongly positive.
• Usually used in hidden layers of a neural network
as its values lie between -1 to 1; therefore, the
mean for the hidden layer comes out to be 0 or
very close to it. It helps in centering the data and
makes learning for the next layer much easier.
Tanh Function
• Limitation
• it also faces the problem of vanishing gradients
similar to the sigmoid activation function.
Although both sigmoid and tanh face vanishing

gradient issue, tanh is zero centered, and the
gradients are not restricted to move in a certain
direction. Therefore, in practice, tanh nonlinearity is
always preferred to sigmoid nonlinearity.
Rectified Linear Unit (ReLU)
• Although it gives an impression of a linear function,
ReLU has a derivative function and allows for
backpropagation while simultaneously making it
computationally efficient.
• The main catch here is that the ReLU function does not
activate all the neurons at the same time.
• The neurons will only be activated if the output of the

linear transformation is greater than 1.
ReLU
• More computationally efficient compared to the sigmoid and
tanh since only a certain number of neurons are activated
• ReLU accelerates the convergence of gradient descent
towards the global minimum of the loss function due to its
linear, non-saturating property.
• The limitations faced by ReLU are:
• The Dying ReLU problem
• The negative side of the graph makes the gradient value
zero.
• Due to this reason, during the backpropagation process,
the weights and biases for some neurons are not updated.
• This can create dead neurons which never get activated.
• All the negative input values become zero immediately, which
decreases the model’s ability to fit or train from the data
properly.
Softmax Function
• The Softmax function is described as a combination of
multiple sigmoids.
• It calculates the relative probabilities.
• Similar to the sigmoid/logistic activation function, the
SoftMax function returns the probability of each class.
• It is most commonly used as an activation function for
the last layer of the neural network in the case of multi-
class classification.
•
Softmax Function
• The output of the sigmoid function is in the range of 0 to 1.
• Suppose the output of the neurons is [1.8, 0.9, 0.6].
• Using sigmoid, the outcome is [0.86, 0.71, 0.64]
• After applying softmax function over these values, the outcome is: [0.39,
0.32, 0.29].
• Softmax returns 1 for the largest probability index while it returns 0 for the
other two array indexes.
• Here, giving full weight to index 0 and no weight to index 1 and index 2.
• So the output would be the class corresponding to the 1st neuron(index 0)
out of three.
How to choose the right Activation Function?
• ReLU activation function should only be used in the hidden layers.
• Sigmoid/Logistic and Tanh functions should not be used in hidden
layers as they make the model more susceptible to problems during
training (due to vanishing gradients).

Week 14 (NN) (2)

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Week 14 (NN) (2)

Uploaded by

Copyright:

Available Formats

Neural

Sigmoid (logistic) activation function.

• A feedforward neural network, aka multi-layer perceptron (MLP), is a series of logistic

• A backpropagation model which is used to compute the gradient vectors

• A perceptron is the basic block on which an NN is built.

Sigmoid (logistic) activation function.

Layer 1 Layer 2 Layer 3

If network has units in layer , units in layer , then

Layer 1 Layer 2 Layer 3

Layer 1 Layer 2 Layer 3 Layer 4

, are binary (0 or 1).

Layer 1 Layer 2 Layer 3 Layer 4

Pedestrian Car Motorcycle Truck

Binary classification Multi-class classification (K classes)

pedestrian car motorcycle truck

1 output unit K output units

Layer 1 Layer 2 Layer 3 Layer 4

For each output unit (layer L = 4)

Layer 1 Layer 2 Layer 3 Layer 4

Propagating delta back and

“error” of cost for (unit in layer ).

No. of input units: Dimension of features

• Suppose a neural network without the activation functions.

Although both sigmoid and tanh face vanishing

• The neurons will only be activated if the output of the

You might also like