You are on page 1of 17

Practical issues in Neural

Networks

1
Objectives
• Choice number of output nodes

• Choice of loss function

• Other issues

• Deep learning V.S machine learning

2
Choice number of output nodes
• Choice number of output nodes is tied to activation function and
depends on the application at hand.
• Activation function in output layer is called softmax function, which is
used for multi-class classification problems.
• “It takes the input vector of K real numbers, and normalizes it into
a probability distribution consisting of K probabilities”.
• If we have outputs:
• the softmax activation function is defined as follows:
Softmax layer
Choice of loss function
 The choice of loss function is sensitive to application at hand.
 “It is important to remember is that, the nature of output nodes,
activation function and the loss function depend on the application at
hand. Furthermore , these choices also depend on one another.”
 We have many forms of loss function, such as:
• Squared error loss :
For problems involving data points with numeric labels y ∈ R, in
regression problems, a widely used (first) choice for the loss function
can be the “squared error loss”
It is suitable for single training instance.
Choice of loss function(Cont.)
• Hinge loss:
Has the form
It is used for and real valued predictions ”used for support
vector machine” .
 For multi-way(multi-classes) prediction:
For multi-classes, Softmax activation function is useful as an output
layer, but it requires different types of loss functions depending on
whether the prediction is binary or multi- way:

6
Choice of loss function(Cont.)
• Binary targets(logistic regression): in this case, the observed value y is {+1,-1}
and the predictions is numerical(with identity activation) :

• Categorical cross-entropy : For a multi-class classification task, cross-entropy (or


categorical cross-entropy) for single instance can be simply extended as follows:

• Not 1: For discrete –valued outputs: it is common to use softmax activation with
cross- entropy loss.
• Note 2: For real – valued outputs: it is common to use linear activation with
squared loss.
• Note 3: One can use a various combination of activation and loss function to
achieve the same results
Practical issues in Neural networks training
• Problem of overfitting
• The vanishing and exploding gradient problems
• Difficulties in convergence
• Local and spurious optima
• Computational challenges

8
Problem of overfitting
 Problem of overfitting:
fitting a model to a particular training dataset doesn’t guarantee that it will
provide good prediction on unseen data even if it predicts the target on the
training data perfectly.
- Number of parameters greater than number of data points. or
- lack of training data
“Thus , we can’t generalize our model”
In order to mitigate the impact of overfitting:
1. Regularization: constrain the model to use fewer non zero parameters
(remove less important (e.g. noisy) patterns) . It is important when the
available amount of data is limited
2. Parameter sharing: using the same parameters to learn characteristics in
many local positions of data(CNN with images)
9
Cont.
3- Early stopping: gradient descent is ended after only a few iterations
(this action restricts the parameter space to extend).
4- Trading off Breadth for depth: Turning out the neural network with
more layers(e.g., greater depth) tends to require fewer
units(consequently, few parameters) in each layer. => deep learning
5- Data augmentation: This technique changes the sample data slightly
every time the model processes it.
6- Ensemble methods: In these methods, we combine predictions from
several separate algorithms. For example: dropout and dropconnect
methods can be combined with many neural network architectures to
obtain an additional accuracy improvement.

10
The vanishing and exploding gradient
 The vanishing and exploding gradient problems:
While increasing the depth often reduces the number of parameters, it
leads to the updated value can either be small(vanishing gradient) or
increasingly large(exploding gradient) in certain types of NNs.
Solutions:
- Using ReLU activation function (its derivative value is 1 at the positive
value of argument) is better in some cases.
- Using adaptive learning rate (in some cases).
- Using batch normalization, etc(in some cases).

11
Difficulties in convergence
 Difficulties in convergence:
Fast convergence of optimization process is difficult with very deep
networks.
“ It is somewhat related to the vanishing gradient problem”.
Solutions:
- Using gating networks or residual networks(will be discussed later).

12
Local and spurious optima
 Local and spurious optima:
-The optimization process in NN is non linear, which has a lots of local optima.
- When the parameter space is large, this leads to many local optima.
Solutions:
- Good initialization can improve this problem by using so called pretraining.
- Pretraining is accomplished by using supervised and unsupervised training
on shallow sub-network of the original network to create initial weights.
-This type of pretraining is done in greedy and layerwise fashion.
- In this fashion, a single layer of the network is trained at one
time to learn initial points of that layer.

• This procedure ignore irrelevant parts in parameters space. furthermore,


unsupervised learning avoid overfitting problem.
13
Computational challenges
 Computational challenges:
The significant challenge in NN is the running time required to train the
network.
Solution:
Using advanced hardware (e.g., Graphics Processor Units GPU) has
helped in reducing the required time for NN training

14
Stopping rules
• These are the rules that determine when to stop training multilayer
networks (these settings are ignored when the radial basis function algorithm is
used). Training can be stopped according to the following criteria:
• Use maximum training time (per component model):
You can choose whether to specify a maximum number of minutes for the
algorithm to run.
• Customize number of maximum training cycles:
If the maximum number of cycles is exceeded, then training stops.
• Use minimum accuracy:
With this option, training will continue until the specified accuracy is attained.
This may never happen, but you can interrupt training at any point and save the
net with the best accuracy achieved so far.
• If the error does not decrease after each cycle:
If the relative change in the training error is small, or if the ratio of the current
training error is small compared to the initial error.
15
Power of function composition
• The power of deep learning arises from the fact:
“ repeated composition of some types of nonlinear functions
increases the representation power of the network, and therefore
reduce the parameter space required for training ”.

- Neural networks with linear activation function doesn't gain from


increasing number of layer.

16
Deep learning V.S machine learning
• The accuracy of models highly depends
on the size of the input dataset that is fed
to the machines. When the dataset is
small, traditional ML models are
preferable. Similarly, when the dataset is
large, deep learning models are
preferable.
• Thus, deep learning become more
attractive than traditional machine
learning algorithms when sufficient
data/computational power is available

You might also like