You are on page 1of 11

CNN Learning: Nonlinearity Functions - Loss Functions

Nonlinearity Functions
• A non-linearity layer in a convolutional neural network consists of an activation function that takes the feature
map generated by the convolutional layer and creates the activation map as its output.
• The activation function is an element-wise operation over the input volume and therefore the dimensions of
the input and the output are identical.
• In other words; let layer l be a non-linearity layer, it takes the feature volume Y(l−1)i from a convolutional layer
(l−1) and generates the activation volume Y(l)i :
Non-Linear activation functions
Sigmoid
• The main reason why we use the sigmoid function is that it exists between (0 to 1). Therefore, it is especially
used for models where we have to predict the probability as an output since the probability of anything
exists only between the range of 0 and 1, sigmoid is the right choice.
• The logistic sigmoid function can cause a neural network to get stuck at the training time. Therefore, it is
mostly used in the output layer. (in case of binary classification)
Tanh
• Tanh is a shifted version from the sigmoid function where its range is between -1 and 1. The mean
of the activations that come out of the hidden layer are closer to having a zero mean therefore data
is more centered which makes the learning for the next layer easier and faster.
• One of the downsides of both sigmoid and tanh is if our weighted sum input(z) is either very large
or very small, then the gradient ( also called derivative or slope) of this function becomes very
small and ends up being very close to zero. This can slow down gradient descent.
Relu
• Relu is increasingly the default choice of activation functions. If you are not sure what to use in the
hidden layers, just use Relu activation function or one of its variants. It is a bit faster to compute
than other activation functions, and gradient descent does not get stuck and it does not saturate for
the large input values as opposed to the logistic function or the hyperbolic tangent function which
saturate at 1.
• One disadvantage of Relu is that the derivative is equal to zero when (z: weighted some input) is
negative. The problem is known as the dying Relu. If the weights in the network always lead to
negative inputs into a Relu neuron, that neuron won't be effectively contributing to the network
training.
LeakyReLU
• The LeakyRelu activation function usually works better than relu. But it is not that much used in practice. To
resolve the dead neuron issues, the leaky ReLU was proposed in 2013 as an activation function that
introduces some small negative slope to the ReLU to sustain and keep the weight updates alive during the
entire propagation process.
The LeakyReLU is defined as Ua(z) = max(az, z)
• The hyperparameter alpha(a) defines how much the function “leaks”. It is the slope for the function for z<0,
and it is typically set to 0.01. The LReLU has an identical result when compared to the standard ReLU with
an exception that it has non-zero gradients over the entire duration
• Also consider Parametric ReLU (PReLU) which is a type of leakyReLU that, instead of having a
predetermined slope like 0.01, makes it a parameter for the neural network to figure out itself: y = ax when x
< 0.
Softmax
• The softmax activation function is used in neural networks when we want to build a
multi-class classifier which solves the problem of assigning an instance to one class when
the number of possible classes is larger than two(otherwise we can simply use sigmoid if
possible classes=2). It is also possible to use Sigmoid in the case of multi-class
classification, but it all depends on the use case and the user preferences.
• The basic practical difference between Sigmoid and Softmax is that while both give
output in [0,1] range, softmax ensures that the sum of outputs along channels is 1 while
Sigmoid just makes each class output between 0 and 1 (in case of dog cat classification
task, Sigmoid would give dog=0.95, cat =0.1[sum of probabilities is not 1 but each class
output should be in [01]range] | with Softmax it would give dog=0.98, cat=0.02 [sum of
probabilities is 1])
Loss Functions

• A loss function is a function that compares the target and predicted output values; measures how
well the neural network models the training data.
• When training, we aim to minimize this loss between the predicted and target outputs.
• Each training input is loaded into the neural network in a process called forward propagation.
• Once the model has produced an output, this predicted output is compared against the given target
output in a process called backpropagation — the hyperparameters of the model are then adjusted
so that it now outputs a result closer to the target output.
• The hyperparameters are adjusted to minimize the average loss — we find the weights, wT, and
biases, b, that minimize the value of J (average loss).
Types of Loss Functions
In supervised learning, there are two main types of loss functions — these correlate to
the 2 major types of neural networks: regression and classification loss functions
1.Regression Loss Functions — used in regression neural networks; given an input
value, the model predicts a corresponding output value (rather than pre-selected
labels); Ex. Mean Squared Error, Mean Absolute Error
2.Classification Loss Functions — used in classification neural networks; given an
input, the neural network produces a vector of probabilities of the input belonging to
various pre-set categories — can then select the category with the highest
probability of belonging; Ex. Binary Cross-Entropy, Categorical Cross-Entropy
Mean Squared Error (MSE)
• One of the most popular loss functions, MSE finds the average of the squared differences
between the target and the predicted outputs.

• MSE is also a convex function with a clearly defined global minimum — this allows us to
more easily utilize gradient descent optimization to set the weight values.
Mean Absolute Error (MAE)
• MAE finds the average of the absolute differences between the target and the predicted
outputs.

• MAE is used in cases when the training data has a large number of outliers to mitigate
this.
Binary Cross-Entropy/Log Loss
• This is the loss function used in binary classification models — where the model
takes in an input and has to classify it into one of two pre-set categories.

• Binary cross-entropy is a special case of categorical cross-entropy, where M = 2


— the number of categories is 2.
Categorical Cross-Entropy Loss
• In cases where the number of classes is greater than two, we utilize categorical
cross-entropy — this follows a very similar process to binary cross-entropy.
Regularization - Optimizers - Gradient Computation.
Refer Unit-1

You might also like