NeuralNetworks 2022

NEURAL NETWORKS
Anush Sankaran
Senior Research Scientist
Microsoft
WHAT IS ML? TOY EXAMPLE
Height (cm) Weight (kg) Gender?
Person #1 178 81 Male
Person #2 163 59 Female
Person #x 177 79 ???
2
Person #x 177 79 ???
3
Person #x 177 79 ???
4
Person #x 177 79 ???
5
Person #x 177 79 ???
6
Person #x 177 79 ???
7
Person #x 177 79 ??? Mal

e
How did you
say?
8
Person #x 177 79 ???
x1 x2 y
Mathematical form: f(x1, x2) = y What is f(.)

9
Basic ML Formulation
Various possible solutions:

Mathematical form: f(x1, x2) = y
● Rule-based systems
What is f(.)
● Decision Trees
● Random Forests
● Logistic Regression
● Neural Networks (Multi layer Perceptron)
● Support Vector Machines (SVM)
● Gradient Boosted Trees
● ….
10
Height (cm) Weight (kg) Gender? Mathematical form: f(x1, x2) = y
Person #1 178 81 (100.4) Male

f(w1 * x1 + w2 * x2)
Person #2 163 59 (79.8) Female
- f is a non-linear function
Person #3 181 78 (98.6) Male
thresh(0.2 * x1 + 0.8 * x2)
Person #4 166 62 (82.8) Female
- (0.2*x1 + 0.8*x2) > 90 = Male
Person #x 177 79 (98.6) ??? - (0.2*x1 + 0.8*x2) < 90 = Female
x1 x2 y
Mathematical form: f(x1, x2) = y What is f(.)

11
Generalized Form: A Single Neuron
Neuron pre-activation h(x) = (Σi wixi)

Neuron g
activation
- Input features: x1 , x2, … , xn
Forward Pass
Neuron pre- - Parameters: w1 , w2, … , wn
activation h
w1 w2 wn
……
Neuron activation f = g(h(x)) = g(Σi wixi)
- g is a non-linear activation function
- Examples: sigmoid, tanh, RBF, Relu etc.
x1 x2 xn
……
12
12
Generalized Form: Loss Function
Compute Loss (y,
y’) Neuron activation y’ = f(x) = g(Σi wixi)
Neuron g
activation Loss function or Error function
Forward Pass
Neuron pre- L(x, y) = e(y, y’)
h
activation = e(y, g(Σi wixi))
w1 w2 wn
……
- If loss or error is small, then the prediction is
good enough
- If loss or error is large, then the prediction is not
x1 x2 xn good enough
……
13
Generalized Form: Optimization Function
Compute Loss (y,
y’) Neuron activation y’ = f(x) = g(Σi wixi)
Neuron g
- Update `w` such that the loss/error reduces
activation
wi = wi ± (value)
Forward Pass
Neuron pre-
Backward Pass
h
activation
Optimization function:
w1 w2 wn - How to find the `value` for given set of `w`
……
and L
- Finds the direction (±) and also the
magnitude
x1 x2 - Example: Stochastic Gradient Descent
……
xn
14
Generalized Form: Objective Function
Compute Loss (y,
y’)
Objective Function
Neuron g
activation
Forward Pass
Neuron pre-
Backward Pass
h
activation
w1 w2 wn
……
x1 x2 xn
……
15
This classifier is called Perceptron or Logistic
Regression
Generalized Form: Summary
Compute Loss (y,
y’) 1. To predict output values:
Neuron g Activation function y’ = g(Σi wixi)
activation
Forward Pass
Neuron pre- 2. To find out how/good or bad the prediction
Backward Pass
h
activation is:
w1 w2 wn Loss function L(x, y) = e(y, g(Σi wixi))
……
3. To update parameters such that the
prediction improves:
x1 x2 xn Optimization function wi = wi ± (value)

……
16
Activation Function: Linear
● Linear activation function: g(a) = a
● No squashing of output
● Not interesting and does NOT help

learning
17
Activation Function: Sigmoid
18
Activation Function: TanH
19
Activation Function: ReLU
● Rectified Linear activation (ReLU):
g(a) = max(0, a)
● Squash the output in the range of (0, inf)
● There is no upper bound
● Strictly positive output
● Strictly increasing function
20
Regression
Compute Loss (y,
activation
Forward Pass
Backward Pass
h
activation is:
……

……
21
Loss Functions
22
To be continued …
● Optimization Functions
○ Stochastic Gradient Descent
○ Challenges with Regression/ Perceptron
● Multi Layer Perceptron
○ Backpropagation
● Challenges with Neural Networks
● Use Cases of Neural Network
23
Regression
Compute Loss (y,
activation
Forward Pass
Backward Pass
h
activation is:
……

……
24
Optimization Function: Gradient Descent
Mathematical form: f(x1, x2) = y

Height Weight Beauty Predicted
(cm) (kg) Score Score
f(.) = (w1* x1 + w2 * x2)
Person #1 178 81 98 100.4
Start with w1= 0.2, w2 = 0.8
Person #2 163 59 97 79.8
f(.) = (0.2 * x1 + 0.8 * x2)
Person #3 181 78 82 98.6
How good (bad) is my prediction?
Person #4 166 62 80 82.8 Loss function:
Person #x 177 79 75 98.6 MSE = ∑( y – y”)2

x1 x2 y y’ =
31547.4
25
Optimization Function: How to Update Parameters
Mathematical form: f(x1, x2) = y f(.) = (w1* x1 + w2 * x2)
Start with w1= 0.2, w2 = 0.8 f(.) = (0.2 * x1 + 0.8 * x2)
Loss function: MSE = ∑( y – y”)2 =

31547.4 This is very bad loss. Means our prediction/model is bad.
Update to w1= 0.5, w2 = 0.5 f(.) = (0.5 * x1 + 0.5 * x2)

4600.5 This is better than previous.
Update to w1= 0.4, w2 = 0.6 f(.) = (0.4 * x1 + 0.6 * x2)

2429.0 This is even better than previous.
26
The Loss Curve
L
Start with w1= 0.2, w2 = 0.8 L = 31547.4
L = 31547.4
Update to w1= 0.5, w2 = 0.5 L = 4600.5
The loss curve
Update to w1= 0.4, w2 = 0.6 L = 2429.0

L = 2429.0 L = 4600.5
Update the parameters using differential calculus
𝜽1
0. 0. 0.
2 4 5
27
Differential Calculus to the Help
Question:
For which value of w1 and w2 , is the loss function

“minimum” ?
Fermat's Theorem
If f(x) has a local extremum at x=a and f is differentiable at a,

then f’(a) = 0
28
Differential Calculus to the Help
Question:
How to update the values of w1 and w2
Move along with the slope, to update w1 and w2
29
Gradient Descent Algorithm
Compute Loss (y,
y’)
Neuron g
activation
Forward Pass
Neuron pre-
Backward Pass
h
activation
w1 w2 wn
……
x1 x2 xn
……
30
Gradient Descent in Real World
31
Gradient Descent in Real World
●
32
Regression
Compute Loss (y, 1. To predict output values:
y’) Activation function y’ = sigmoid(Σi wixi)
Neuron g
activation 2. To find out how/good or bad the prediction
is:
Forward Pass
Neuron pre-
Backward Pass
h Loss function L(x, y) = (y – y’)2

activation
w1 w2 wn
…… 3. To update parameters such that the
Optimization
function
x1 x2 xn
……
33
Problem #1: Always go through Origin
Compute Loss (y,
y’)
Neuron g
activation
Forward Pass
Neuron pre-
Backward Pass
h
activation
w1 w2 wn
……
x1 x2 xn
f = g(h(x)) = g(Σi wixi )
……
34
Problem #1: Always go through Origin
Compute Loss (y,
y’)
Neuron g
activation
Forward Pass
Neuron pre-
Backward Pass
h
activation
w1 w2 wn b
……
f = g(h(x)) = g(Σi wixi + b)

x1 x2 xn
……
35
Problem #2: Getting Stuck with Local Minima
Compute Loss (y,
y’)
Neuron g
activation
Forward Pass
Neuron pre-
Backward Pass
h
activation
w1 w2 wn
……
x1 x2 xn
……
36
Problem #2: Getting Stuck with Local Minima
Without momentum With momentum

Adding momentum:
37
Problem #3: Strictly Linear Classification
How do we solve this problem?
38
More than One Neuron/ Perceptron?
y1 ’ y1 ’ y2’
w1 w2 … wn
w11 w12 w21 w22 wn1 wn2
x1 x2 … xn
x1 x2 … xn
39
MultiLayer Perceptron (MLP) / Neural Networks
40
Back Propagation: How to learn the weights?
41
How to choose #Layers and #Neurons ?
42
Problem #1: Possible Overfitting?
43
Problem #1: Bias-Variance Trade Off
44
Solution #1: Regularization

Objective function:

Objective function
with L1 regularization:

Objective function
with L2 regularization:
45
Problem #2: How to Chose Learning Rate?
46
Problem #3: How to Initialize W & B?
● For biases: Initialize all to 0
● For weights:
○ Cannot initialize weights to 0 if tanh activation function is used
■ All gradients will end up to be zero
○ Cannot initialize all the weight values to be same
■ Highly restricts the overall learning process
○ Random initialization works in most of the cases!
○ Glorot et al. showed the following initialization is closed to ideal:
47
Problem #4: MLP is a Universal Approximator
• A single hidden layer neural

network is an universal
approximator
• It can model any mathematical

function
• Not easy to train!
48
Problem #5: Going Wide vs. Going Deep?
Going Wide Going Deep
49
Problem #6: Vanishing Gradient Problem

0≤ λ≤1
50
Summary: All the Hyper Parameters!
● Architecture of the neural network model
○ Number of layers
○ Number of nodes in each layer Weights and Biases are
● Which activation function to use the only learnable
parameters
○ Sigmoid, tanh, relu, softmax etc.
● Which loss function to use
○ MSE, L1, CrossEntropy, Hinge etc.
● Which optimization function to use
○ SGD, mini-batch Gradient Descent, Adam, RMSProp
● Which initialization method: for weights and biases
● Which learning rate to use: lr = [0.1, 0.01, 0.001, 0.0001]. Dynamic learning rates are possible!
● Which regularization to use? And what is the regularization factor?
51
No Free Lunch Theorem!
● There are no universal solutions to ML

problems.
● All ML approaches are equally good if we

do not place strong assumptions on the
input data.
● For every ML algorithm, there exists a

sample or sample class where it
outperforms some other method.
52
How To Learn Features?
How to learn these features?
53
Unsupervised Feature Learning
Neural Networks AutoEncoder
Input Input Output

layer Hidden layer Hidden layer
layer layer
Input Data
Features
Input Data
Input Data
. .
. .
. . . . .
. . .
. . .
54
Going Deep: Deep Learning
Stacked AutoEncoder = Deep Learning Hidden#3
Hidden#2
Hidden#1
Input Data Input Data

. .
. .
. . .
. .
. . . .
. . .
. . .
.
Encoding Layers .
Decoding Layers 55
Going Deep: Deep Learning Classifier
Deep Learning Classifier Hidden#3
Hidden#2
Hidden#1
Input Data
.
.
. .
. .
. .
. .
.
Encoding Layers .
56
Convolutional Neural Networks: n-dimensional data
57
Recurrent Neural Networks: Sequential Data
58
Everything about NNs
Questions?
59

NeuralNetworks 2022

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NeuralNetworks 2022

Uploaded by

Copyright:

Available Formats

NEURAL NETWORKS

Height (cm) Weight (kg) Gender?

Person #1 178 81 Male

Person #2 163 59 Female

Person #3 181 78 Male

Person #4 166 62 Female

Person #x 177 79 ???

Height (cm) Weight (kg) Gender?

Person #1 178 81 Male

Person #2 163 59 Female

Person #3 181 78 Male

Person #4 166 62 Female

Person #x 177 79 ???

Height (cm) Weight (kg) Gender?

Person #1 178 81 Male

Person #2 163 59 Female

Person #3 181 78 Male

Person #4 166 62 Female

Person #x 177 79 ???

Height (cm) Weight (kg) Gender?

Person #1 178 81 Male

Person #2 163 59 Female

Person #3 181 78 Male

Person #4 166 62 Female

Person #x 177 79 ???

Height (cm) Weight (kg) Gender?

Person #1 178 81 Male

Person #2 163 59 Female

Person #3 181 78 Male

Person #4 166 62 Female

Person #x 177 79 ???

Height (cm) Weight (kg) Gender?

Person #1 178 81 Male

Person #2 163 59 Female

Person #3 181 78 Male

Person #4 166 62 Female

Person #x 177 79 ???

Height (cm) Weight (kg) Gender?

Person #1 178 81 Male

Person #2 163 59 Female

Person #3 181 78 Male

Person #4 166 62 Female

Person #x 177 79 ??? Mal

Height (cm) Weight (kg) Gender?

Person #1 178 81 Male

Person #2 163 59 Female

Person #3 181 78 Male

Person #4 166 62 Female

Person #x 177 79 ???

Mathematical form: f(x1, x2) = y What is f(.)

Various possible solutions:

Height (cm) Weight (kg) Gender? Mathematical form: f(x1, x2) = y

Person #1 178 81 (100.4) Male

Mathematical form: f(x1, x2) = y What is f(.)

Neuron pre-activation h(x) = (Σi wixi)

x1 x2 xn Optimization function wi = wi ± (value)

● Linear activation function: g(a) = a

● Not interesting and does NOT help

● Squash the output in the range of (0, inf)

● There is no upper bound

● Strictly positive output

● Strictly increasing function