Artificial Neural Networks

Artificial
Neural Networks
Linear Models
Linear Models
Linear Models
Statistical Models
Linear Models
Linear Models
Greedy Hill Descent
Greedy Hill Descent
1. Initial random
parameters.
2. Repeatedly find
direction to change
parameters in order to
reduce loss function
and update in that
direction based on local
information and
learning rate.
Greedy Hill Descent
Multinomial Logistic Regression
Generalizable to multinomial cases.
One set of parameters per value:
Feature Transformations &
Non-Linear Models
Consider the OLS linear

regression model to the right,
generated from the dataset
below.
X Y
1 4
2 9
3 10
4 5
Non-Linear Models
X Y X´=X X´´=X^2 Y
1 4 Feature Transformation 1 1 4
2 9 2 3 9
3 10 3 9 10
4 5 4 16 5
Original Data Transformed Data
We take a non-linear transformation of our features to generate a new,

transformed data set: T(X)= [X, X^2].
A feature transformation is just a function of the input features.
Non-Linear Models
We now make an OLS linear
regression model from the
new transformed data. The
model is, of course, still linear.
X´=X X´´=X^2 Y
1 1 4
2 3 9
3 9 10
4 16 5
Non-Linear Models
Most of the plane represent
impossible points. Only one
curve, where X´^2=X´´
represents possible points.
X´=X X´´=X^2 Y
1 1 4
2 3 9
3 9 10
4 16 5
Non-Linear Models
We can take that curve and
plot the X'=X vs Y values into
our original two dimensional
space. And we have a non-
linear model!
X Y
1 4
2 9
3 10
4 5
Non-Linear Models
We can create non-linear models by:
• Performing non-linear feature transformations
• Creating a linear model on the transformed data
• Taking the curve of possible values
Most non-linear models are of this sort...
… including neural networks and deep learning models.
Artificial Neural Networks (ANNs)
Basic ANNs are a series of one or more feature transformations
followed by linear regression (for regression tasks) or logistic
regression (for classification tasks).
• The transformations are parameterized, and the values of the
parameters are fit during training.
• ANNs learn the transformation used from the data.
Note: Multinomial logistic regression is often called softmax in the
context of ANNs.
Artificial Neural Networks
• Input, hidden and output layers
• Dummy variables (X0, H0)
• Parameters are weights on edges
• Hidden layers are feature
transformations of previous layer
• Can have multiple hidden layers
• Output layer: Linear/logistic
regression
• Transformations are nonlinear
functions of the weighted sum of
the previous layer's variables.
• The sums weights are those
associated with the incoming
edges.
• The non-linear function is called
the 'activation' function.
Some activation functions:

• Logistic (a.k.a. Sigmoid)
• Hyperbolic Tangent
• Rectifier

• Rectifier
• Rectifier
Logistic and tanh activation

functions have problems in deep
networks (the 'vanishing gradient
problem'). In such cases, the
rectifier is the most popular
activation function.
The linear model at the end of the series
of feature transformations has the
coefficients given on the final layer of
edges.
Multinomial logistic regression (softmax)
is a little more complicated, with a set of
coefficients for each value of the output
variable. This can be represented by an
additional layer.
Training an ANN given training data is an optimization problem.
Most Popular Optimization Method:
Gradient Descent
• Each iteration, we update all parameters based on their partial
derivatives regarding the loss function (their gradients).
• This changes both the feature transformations, and the linear output model.
• The loss function will have many local optima. Common to run
multiple restarts.
• Can include additional components affecting the updates, such as
momentum.
• Backpropagation is a method for efficiently calculating the partial
derivatives of each parameter in a neural network.
Artifical Neural Networks
• Calculating the
gradients of the loss
function of the entire
training data is very
time consuming.
• Common to calculate
gradients on subsets
of the training data,
called batches.
Trade off: Faster to calculate iterations vs noisier convergence.

In diagram: Stochastic - batch size 1. Batch - full data. Mini-batch - batches
moderately sized subsets of data.
Neural networks are prone to overfitting. Regularization should be
used.
Regularization 'smooths' the resulting function (regression curve or
decision boundary). Common methods:
• Add noise to inputs
• Add penalty to loss function, penalizing large parameter sizes
• Drop out – at each iteration of the gradient descent, randomly
remove a subset of nodes from hidden layers.
• Stop before convergence of gradient descent algorithm

Artificial Neural Networks

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Artificial Neural Networks

Uploaded by

Copyright:

Available Formats

Artificial

Consider the OLS linear

Original Data Transformed Data

We take a non-linear transformation of our features to generate a new,

Some activation functions:

Some activation functions:

Logistic and tanh activation

Trade off: Faster to calculate iterations vs noisier convergence.

You might also like