Professional Documents
Culture Documents
An Introduction To Neural Networks
An Introduction To Neural Networks
CS -5105
Deep Learning
1
07/06/2022
• In the case of the perceptron, the choice of the sign activation function is motivated by the fact that a binary
class label needs to be predicted
• if the target variable to be predicted is real, then it makes sense to use the identity activation function
• If it is desirable to predict a probability of a binary class, it makes sense to use a sigmoid function
• The tanh function has a shape similar to that of the sigmoid function, except that it is horizontally re-scaled
and vertically translated/re-scaled to [−1, 1]
2
07/06/2022
The importance of nonlinear activation functions becomes significant when one moves from the single-layered
perceptron to the multi-layered architectures
3
07/06/2022
A multi-layer network that uses only the identity activation function in all its layers reduces to a single-layer
network performing linear regression
• In other words, there is always a gap between the training and test data performance, which is particularly
large when the models are complex and the data set is too small or too large
4
07/06/2022
• Regularization is a set of techniques that can prevent overfitting in neural networks and thus improve the accuracy of a
Deep Learning model when facing completely new data from the problem domain.
• Regularization refers to techniques that are used to calibrate machine learning models in order to minimize the adjusted
loss function and prevent overfitting or underfitting.
• Using Regularization, we can fit our machine learning model appropriately on a given test set and hence reduce the errors
in it
• The most effective way of building a neural network is by constructing the architecture of the neural
network after giving some thought to the underlying data domain.
• For example, the successive words in a sentence are often related to one another, whereas the nearby pixels
in an image are typically related.
• Furthermore, many of the parameters might be shared. For example, a convolutional neural network uses
the same set of parameters to learn the characteristics of a local block of the image.
5
07/06/2022
• One way to decide the stopping point is by holding out a part of the training
data, and then testing the error of the model on the held-out set.
• Early stopping essentially reduces the size of the parameter space to a smaller
neighborhood within the initial values of the parameters.
• Network with more layers (i.e., greater depth) tend to require far fewer units per layer
• A two-layer neural network can be used as a universal function approximator if a large number of hidden
units are used within the hidden layer.
• Increased depth is a form of regularization, as the features in later layers are forced to obey a particular type
of structure imposed by the earlier layers.
• Increased constraints reduce the capacity of the network, which is helpful when there are limitations on the
amount of available data.
• The number of units in each layer can typically be reduced to such an extent that a deep network often has
far fewer parameters even when added up over the greater number of layers. This observation has led to an
explosion in research on the topic of deep learning.
6
07/06/2022
• Ensemble methods usually produces more accurate solutions than a single model would
• Variety of ensemble methods like bagging are used in order to increase the generalization power of the
model.
• Two other methods include Dropout and Dropconnect. These methods can be combined with many neural
network architectures to obtain an additional accuracy improvement of about 2% in many real settings.
However, the precise improvement depends to the type of data and the nature of the underlying training. F
• While increasing depth often reduces the number of parameters of the network, it leads to different types of
practical issues.
• Propagating backwards using the chain rule has its drawbacks in networks with a large number of layers in
terms of the stability of the updates.
• In particular, the updates in earlier layers can either be negligibly small (vanishing gradient) or they can be
increasingly large (exploding gradient) in certain types of neural network architectures.
7
07/06/2022
• Sufficiently fast convergence of the optimization process is difficult to achieve with very deep networks.
• As depth leads to increased resistance to the training process in terms of letting the gradients smoothly flow
through the network.
• This problem is somewhat related to the vanishing gradient problem, but has its own unique characteristics.
• The optimization function of a neural network is highly nonlinear with lots of local optima.
• Picking good initialization points when the parameter space is large and there may be many local optima
• One such method for improving neural network initialization is referred to as pretraining.
8
07/06/2022
• A significant challenge in neural network design is the running time required to train the network.
• It is not uncommon to require weeks to train neural networks in the text and image domains.
• GPUs are specialized hardware processors that can significantly speed up the kinds of operations commonly
used in neural networks.
• In this sense, some algorithmic frameworks like Torch are particularly convenient because they have GPU
support tightly integrated into the platform.
• An example of an unconventional architecture in which inputs occur to layers other than the first hidden layer.
As long as the neural network is acyclic (or can be transformed into an acyclic representation), the weights of
the underlying computation graph can be learned using dynamic programming (backpropagation).
9
07/06/2022
10
07/06/2022
11