You are on page 1of 28

Artificial

Neural Networks
Linear Models
Linear Models
Linear Models
Statistical Models
Linear Models
Linear Models
Greedy Hill Descent
Greedy Hill Descent
1. Initial random
parameters.
2. Repeatedly find
direction to change
parameters in order to
reduce loss function
and update in that
direction based on local
information and
learning rate.
Greedy Hill Descent
Multinomial Logistic Regression
Generalizable to multinomial cases.
One set of parameters per value:
Feature Transformations &
Non-Linear Models

Consider the OLS linear


regression model to the right,
generated from the dataset
below.
X Y
1 4
2 9
3 10
4 5
Feature Transformations &
Non-Linear Models
X Y X´=X X´´=X^2 Y
1 4 Feature Transformation 1 1 4
2 9 2 3 9
3 10 3 9 10
4 5 4 16 5

Original Data Transformed Data

We take a non-linear transformation of our features to generate a new,


transformed data set: T(X)= [X, X^2].
A feature transformation is just a function of the input features.
Feature Transformations &
Non-Linear Models
We now make an OLS linear
regression model from the
new transformed data. The
model is, of course, still linear.

X´=X X´´=X^2 Y
1 1 4
2 3 9
3 9 10
4 16 5
Feature Transformations &
Non-Linear Models
Most of the plane represent
impossible points. Only one
curve, where X´^2=X´´
represents possible points.

X´=X X´´=X^2 Y
1 1 4
2 3 9
3 9 10
4 16 5
Feature Transformations &
Non-Linear Models
We can take that curve and
plot the X'=X vs Y values into
our original two dimensional
space. And we have a non-
linear model!
X Y
1 4
2 9
3 10
4 5
Feature Transformations &
Non-Linear Models
We can create non-linear models by:
• Performing non-linear feature transformations
• Creating a linear model on the transformed data
• Taking the curve of possible values
Most non-linear models are of this sort...
… including neural networks and deep learning models.
Artificial Neural Networks (ANNs)
Basic ANNs are a series of one or more feature transformations
followed by linear regression (for regression tasks) or logistic
regression (for classification tasks).
• The transformations are parameterized, and the values of the
parameters are fit during training.
• ANNs learn the transformation used from the data.
Note: Multinomial logistic regression is often called softmax in the
context of ANNs.
Artificial Neural Networks
• Input, hidden and output layers
• Dummy variables (X0, H0)
• Parameters are weights on edges
• Hidden layers are feature
transformations of previous layer
• Can have multiple hidden layers
• Output layer: Linear/logistic
regression
Artificial Neural Networks
• Transformations are nonlinear
functions of the weighted sum of
the previous layer's variables.
• The sums weights are those
associated with the incoming
edges.
• The non-linear function is called
the 'activation' function.
Artificial Neural Networks

Some activation functions:


• Logistic (a.k.a. Sigmoid)
• Hyperbolic Tangent
• Rectifier
Artificial Neural Networks

Some activation functions:


• Logistic (a.k.a. Sigmoid)
• Hyperbolic Tangent
• Rectifier
Artificial Neural Networks
Some activation functions:
• Logistic (a.k.a. Sigmoid)
• Hyperbolic Tangent
• Rectifier

Logistic and tanh activation


functions have problems in deep
networks (the 'vanishing gradient
problem'). In such cases, the
rectifier is the most popular
activation function.
Artificial Neural Networks
The linear model at the end of the series
of feature transformations has the
coefficients given on the final layer of
edges.
Multinomial logistic regression (softmax)
is a little more complicated, with a set of
coefficients for each value of the output
variable. This can be represented by an
additional layer.
Artificial Neural Networks
Training an ANN given training data is an optimization problem.
Most Popular Optimization Method:
Gradient Descent
Artificial Neural Networks
• Each iteration, we update all parameters based on their partial
derivatives regarding the loss function (their gradients).
• This changes both the feature transformations, and the linear output model.
• The loss function will have many local optima. Common to run
multiple restarts.
• Can include additional components affecting the updates, such as
momentum.
• Backpropagation is a method for efficiently calculating the partial
derivatives of each parameter in a neural network.
Artifical Neural Networks
• Calculating the
gradients of the loss
function of the entire
training data is very
time consuming.
• Common to calculate
gradients on subsets
of the training data,
called batches.

Trade off: Faster to calculate iterations vs noisier convergence.


In diagram: Stochastic - batch size 1. Batch - full data. Mini-batch - batches
moderately sized subsets of data.
Artificial Neural Networks
Neural networks are prone to overfitting. Regularization should be
used.
Regularization 'smooths' the resulting function (regression curve or
decision boundary). Common methods:
• Add noise to inputs
• Add penalty to loss function, penalizing large parameter sizes
• Drop out – at each iteration of the gradient descent, randomly
remove a subset of nodes from hidden layers.
• Stop before convergence of gradient descent algorithm

You might also like