You are on page 1of 27

Deep learning

• Course Code:
• Unit 1
Introduction to Deep learning
• Lecture 4
Activation function & Loss
function
Activation function

• Decide whether the neuron's


input to the network is
important or not in the process
of prediction using simpler
mathematical operations

Source: https://medium.com/@MrBam44/activation-functions-in-deep-
learning-models-how-to-choose-3ad007eaf998
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Why Activation Function?
• Application of the activation function tells
us that which neurons in each layer will be
triggered. Only the neurons with some
relevant information are activated in every
layer.
• The activation takes place depending on
some rule or threshold
• The purpose of the activation function is
to introduce non-linearity into the
network. Example : Separating green points
• As most of the data in real life is non from red points in the graph.
linear.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Can we do without an
activation function? • As it introduces an additional step at each
layer during the forward propagation,
increases complexity
• In that case, every neuron will only perform a
linear transformation on the inputs using the
weights and biases that make it simpler and
unable to learn the complex patterns from
data.
• without an activation function it is just a
linear regression model.
• activation function introduces non-linearity in
the network.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Importance of Activation Functions
• Separating green points from red points using linear function results into under
fitting problem.
• No matter how deep and how large is the network if using linear activation
function it just composing lines on top of lines to get another line.
• Whereas using non linear function generates non linear boundaries in the network
which is extremely powerful for classification task.

Non-linearities
Linear Activation
allow us to
functions produce
approximate
linear decisions
arbitrarily
no matter the complex
network size functions
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Types of Activation Function

Binary
Linear Sigmoid
step

Leaky
Tanh ReLu
ReLu

Softmax

Amity Centre for Artificial Intelligence, Amity University, Noida, India


• Threshold based classifier
Binary Step Python Code:
• If the input to the activation function is
def
> threshold, then neuron is activated,
f(x) = 1, x>=0 binary_step(x): else deactivated, i.e. its output is not
= 0, x<0
if x<0: considered for next hidden layer
return 0 • Useful for binary class only, not multi
else: class
return 1 • The gradient of the step function is ‘0’
binary_step(5),
binary_step(-1)
which causes a hindrance in the back
Output:
propagation process.
(5,0) f'(x) = 0, for all x

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Linear
Function • The activation is proportional to the
f(x)=ax
Python Code: input. The variable ‘a’ in this case can
def be any constant value
linear_function(x):
• Differentiate the function with
return 4*x
respect to x, the result is the
linear_function(4),
linear_function(-2) coefficient of x, which is a constant.
Output: • Although the gradient here does not
(16, -8) become zero, but it is a constant
f'(x) = a

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Linear Function (Gradient)

Python Code: • Implies during the backpropagation


f'(x) = a
def process weights & biases get
linear_function(x): updated with same updating factor.
return 4*x • Neural network not improve the
linear_function(4),
linear_function(-2)
error as gradient is same for every
Output:
iteration.
(16, -8)
• Not suitable to capture the complex
patterns from the data.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Sigmoid • The sigmoid function is not symmetric
Python Code: around 0, thus output of all neurons is
f(x) = 1/(1+e-x)
import numpy as np of same sign.
def sigmoid_function(x):
• This can be addressed by scaling the
z = (1/(1 + np.exp(-
x))) sigmoid function which is exactly what
return z happens in the tanh function
sigmoid_function(7),sigm • As input values move away from 0, the
oid_function(-22)

Output:
output value becomes less sensitive.
(0.9990889488055994, Even a large change in input values
2.7894680920908113e-10)
results in little to no change in the
output value

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Sigmoid (gradient)
Python Code:
import numpy as np • Gradient values are significant for
F’(x) = 1/(1+e-x)
def sigmoid_function(x): range -3 and 3 but graph is much flatter
x)))
z = (1/(1 + np.exp(-
in other regions. This implies that for
return z values > 3 or < -3, will have very small
sigmoid_function(7),sigm
oid_function(-22)
gradients. As the gradient value
Output:
approaches zero, the network is not
(0.9990889488055994, really learning.
2.7894680920908113e-10)

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Tanh
Python Code:
tanh(x)= 2/(1+e ) -1
-2x
import numpy as np • Symmetric around the origin. The
def tanh_function(x): range of values in this case is from -1
z = (2/(1 + np.exp(-
2*x))) -1 to 1.
return z • Thus the inputs to the next layers will
tanh_function(0.5),
tanh_function(-1)
not always be of the same sign.
Output:
(0.4621171572600098, -
0.7615941559557646)

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Tanh (gradient)
Python Code:
tanh(x)= 2/(1+e ) -1
-2x
import numpy as np
• The gradient is steeper as compared
def tanh_function(x):
z = (2/(1 + np.exp(- to the sigmoid function.
2*x))) -1
• Usually, tanh is preferred over the
return z
tanh_function(0.5),
sigmoid function since it is zero
tanh_function(-1)
centered & gradients are not
Output:
(0.4621171572600098, -
restricted to move in a certain
0.7615941559557646) direction.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


ReLu (Rectified linear Unit)
Python Code:
• As compared to other activation
f(x)=max(0,x)
def relu_function(x): function it does not activate all the
if x<0: neurons at same time
return 0
• The neurons will only be deactivated
else:
return x if the output of linear transformation
relu_function(7), is < 0.
relu_function(-7)
Output:
• Since only a certain no. of neurons
(7, 0) are activated, the ReLU is far more
computationally efficient compared
to sigmoid & tanh function

Amity Centre for Artificial Intelligence, Amity University, Noida, India


ReLu (gradient)
f'(x) = 1, x>=0
= 0, x<0 Python Code: • At negative side of graph, the gradient
def relu_function(x): value is 0. Thus during the
if x<0:
backpropogation process, the weights
return 0
else: and biases for some neurons are not
return x updated.
relu_function(7),
relu_function(-7)
• This may create dead neurons which
Output: never get activated. This is taken care
(7, 0) of by the ‘Leaky’ ReLU function.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Leaky ReLu
f(x)= 0.01x, x<0 Python Code:
• The ReLU function, the gradient is 0
= x, x>=0 def for x<0, which would deactivate the
leaky_relu_function(x):
if x<0:
neurons in that region
return 0.01*x • Leaky ReLU is defined to address this
else: problem. Instead of defining the Relu
return x
function as 0 for negative values of x,
leaky_relu_function(7),
leaky_relu_function(-7) we define it as an extremely small
Output:
linear component of x
(7, -0.07)
• Hence we would no longer encounter
dead neurons in that region

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Leaky ReLu (gradient)
f'(x) = 1, x>=0
Python Code: • By making this small modification, the
=0.01, x<0 def
leaky_relu_function(x): gradient of the left side of the graph
if x<0: comes out to be a non-zero value.
return 0.01*x
• Hence it no longer encounter dead
else:
return x neurons in that region.
leaky_relu_function(7),
leaky_relu_function(-7)
Output:
(7, -0.07)

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Softmax
• Softmax function is a combination of
multiple sigmoids. Since, sigmoid
Python Code:
returns values between 0 and 1,
def softmax_function(x):
which can be treated as probabilities
z = np.exp(x) of a data point belonging to a
z_ = z/z.sum()
particular class..
return z_
softmax_function([0.8,
• The softmax function can be used for
1.2, 3.1]) multiclass classification problems.
Output: This function returns the probability
array([0.08021815,
0.11967141, 0.80011044]) for a datapoint belonging to each
individual class

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Softmax • For a multiclass problem, the output
layer would have as many neurons as
the number of classes in the target. if
Python Code: you have three classes, there would be
def softmax_function(x): three neurons in the output layer. Let
z = np.exp(x) the output from the neurons as [1.2 ,
z_ = z/z.sum()
return z_
0.9 , 0.75].
softmax_function([0.8, • Applying the softmax function over
1.2, 3.1])
these values, you will get the following
Output:
array([0.08021815,
result – [0.42 , 0.31, 0.27].
0.11967141, 0.80011044]) • These represent the probability for the
data point belonging to each class.
Note that the sum of all the values is 1
Amity Centre for Artificial Intelligence, Amity University, Noida, India
What is Loss Function?

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Loss Function
• Compares the target and predicted output values to measures how well the neural
network models the training data.
• The aim is to minimize this loss between predicted & target outputs.
• Majorly 2 types of loss function
Note ( Loss function
vs cost function)

Loss Cost
Function Function
Is loss for
a single
Is the
average
Regression loss Classification Loss
training loss over
example/ the entire • MSE (Mean square error ) • Binary cross-entropy
input training
dataset. • MAE (Mean absolute error) • Categorical cross-entropy

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Mean Squared Error (MSE)

• MSE finds the average of squared differences b/w


the target and predicted outputs
• The difference is squared, which means it does not
matter whether the predicted value is above or
below the target value; however, values with a
large error are penalized.
• MSE is also a convex function with a clearly defined
global minimum.
• This allows to more easily utilize gradient descent
optimization to set the weight values.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Mean Absolute Error (MAE)
• MAE finds the average of the absolute
differences between the target and the
predicted outputs.

• As MSE is highly sensitive to outliers,


which can dramatically affect the loss
because the distance is squared. MAE
is used in cases when the training data
has a large number of outliers to
mitigate this.
mae = tf.keras.losses.MeanAbsoluteError()
mae(y_true, y_pred)

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Binary cross-entropy/Log Loss
• Binary cross entropy compares each
of the predicted probabilities to the
actual class output which can be
either 0 or 1.
• It then calculates the score that
penalizes the probabilities based on
the distance from the expected
High Low Low High value. That means how close or far
penalty penalty penalty penalty from the actual value.
• Advantage –A cost function is a
differential.
• Disadvantage –Multiple local
minima, Not intuitive

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Categorical cross-entropy
• Also called Softmax Loss. It is a
One-hot encoding combination Of Softmax activation
plus a Cross-Entropy loss.
• It is used for multi-class classification.
• In the specific (and usual) case of
Multi-Class classification the labels are
one-hot encoded.
• Sparse Categorical Cross Entropy Loss
Function: ​
• Used when number of classes is
too large (eg 1000)​
• Avoids one-hot encoding, which
requires large memory​

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Categorical Cross-Entropy ​ Binary Cross-Entropy ​
• Multiclass classification ​ • Binary classification ​

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Summary
• Activation function decide whether the neuron's input to the network is
important or not in the process of prediction using simpler mathematical
operations.
• Types of Activation Functions:
• Binary step, Linear, Relu, Leaky Relu, Tanh, Sigmoid, Softmax.
• Loss Function compares the target and predicted output values to measures
how well the neural network models the training data.
• Types of Loss function:
• Regression Loss
• Classification Loss

Amity Centre for Artificial Intelligence, Amity University, Noida, India

You might also like