Professional Documents
Culture Documents
Recitation 1
MIT - 6.802 / 6.874 / 20.390 / 20.490 / HST.506 - Spring 2021
Jackie Valeri
https://bedatalab.github.io
C. Neural networks
Input X ∈ X : Output y ∈ Y:
• features (in machine learning) • labels (in machine learning)
• predictors (in statistics) • responses (in statistics)
• independent variables (in statistics) • dependent variables (in statistics)
• regressors (in regression models) • regressand (in regression models)
• input variables • target variables
• covariates
Training set S training = { (X (i )
, y (i ) )} iN= 1 ∈ {X , Y} N , where N is number of training examples
An example is a collection of features (and an associated label)
Training: use Straining to learn functional relationship f : X → Y
9 / 37
What can you do with ML? Terminology
f :X → Y
f (x ;θ) = ŷ
θ:
• weights and biases (intercepts)
• coefficients β Think of weights and biases like
lots and lots of
• parameters
y = mx + b
f:
• model
• hypothesis h
• classifier
• predictor
• discriminative models: P(Y|X)
• generative models: P(X, Y)
R01 Outline
C. Neural networks
Cross-validation (6-fold)
https://towardsdatascience.com/cross-validation-code-visualization-kind-of-fun-b9741baea1f8
Basics of machine learning: objective functions
An objective function J (Θ) is the function that you optimize when training machine learning models.
It is usually in the form of (but not limited to) one or combinations of the following:
https://machinelearningmastery.com/how-to-choose-loss-functions-when-training-deep-learning-neural-networks/
Basics of machine learning: objective functions
An objective function J (Θ) is the function that you optimize when training machine learning models.
It is usually in the form of (but not limited to) one or combinations of the following:
https://machinelearningmastery.com/how-to-choose-loss-functions-when-training-deep-learning-neural-networks/
Basics of machine learning: objective functions
An objective function J (Θ) is the function that you optimize when training machine learning models.
It is usually in the form of (but not limited to) one or combinations of the following:
https://machinelearningmastery.com/how-to-choose-loss-functions-when-training-deep-learning-neural-networks/
Basics of machine learning: loss
Task Loss
Regression (penalize large errors)
Classification (binary)
Classification (multi-class)
Generative
Basics of machine learning: objective functions
An objective function J (Θ) is the function that you optimize when training machine learning models.
It is usually in the form of (but not limited to) one or combinations of the following:
https://machinelearningmastery.com/how-to-choose-loss-functions-when-training-deep-learning-neural-networks/
Basics of machine learning: loss
Basics of machine learning: metrics
Basics of machine learning: metrics
True positive rate = False positive rate = Recall is the same as true
# true positive / # 1 – specificity positive rate!
actually positive False positive rate = # false
positives / all condition negatives
R01 Outline
C. Neural networks
I. Perceptrons to neurons
II. Activation functions
III. Training with backpropagation
IV. Gradient descent
V. Regularization
13 Jan
2016
Neural networks: perceptrons to neurons
Neural networks: perceptrons to neurons
Neural networks: single layer feed-forward NN
Neural networks: activation functions
Step function
Sigmoid
Softmax
Hyperbolic tangent
Neural networks: activation functions
Regression
Linear (ReLU, Leaky ReLU, etc)
(penalize error linearly)
Classification
Softmax
(multi-class)
W1 Z1 A1 W2 Z2 A2 AL-1 WL ZL AL
X b1
f1
b2
f2
… bL
fL Loss
W1 Z1 A1 W2 Z2 A2 AL-1 WL ZL AL
X b1
f1
b2
f2
… bL
fL Loss
So let’s use the following shorthand from the previous figure, Now, to propagate through the whole network, we can keep applying the chain
rule until the first layer of the network,
First, let’s break down how the loss depends on the final layer,
If you spend a few minutes looking at matrix dimensions, it becomes clear that
this is an informal derivation. Here are the dimensions to think about:
Since,
Gradient
@w. Previous change
where
I ✏ is often termed as the “learning rate” (usually set to 0.1 or smaller)
where
that is necessary to not overshoot the optimal solution;
• Gradient descent: ∂E/∂w = partial derivative of error Et wrt 1 w
I controls “weight decay” to penalize large weight w. in order to
• prevent
ϵ = learning rate (e.g.
overfitting <0.1),further
(discussed neededbelow)
to not overshoot the optimal solution
I• ⌘λ is
= the
weight
rate decay, penalizes
to weight large weights
the gradient w t 1 attothe
prevent overfitting
previous update to
• build
η = momentum,
“momentum”based in theon magnitude+sign
update; of previous
when the direction update
of the (∆wist−1);
update
when direction
consistent, of update
this will is consistent
speed up è faster
the convergence convergence
process
For large dataset, stochastic gradient descent (SGD) is often used by
randomly sampling a batch of samples and update the weights
Neural networks: gradient descent
Neural networks: regularization
(Goodfellow 2016)
Neural networks: regularization via validation set, early stopping
Training
set Validation
Leave out small “validation set” set
• not used to train the model
• used to evaluate model at each epoch/iteration
Test
(VCE, Validation cross-entropy) set
• Stop when VCE increases, prevent overfitting
Cross−Entropy per case
4.6
Training CE
4.4 Validation CE Test
CE
4.2
CE per case
3.8
3.6
3.4
3.2
2.8
2.6
0 5 10 15 20 25 30 35 40 45
Learning Epoch
Neural networks: regularization
Penalize
Objective function with ridge regularization
R01 Outline
C. Neural networks
Problem Set 1
Classification
input space:
X = {0, 1, . . . , 255}28×28
after rescaling:
X I = [0, 1]28×28
after flattening:
X II = [0, 1]784
PS1: Data
Problem Set 1
input space:
X = {0, 1, . . . , 255}28×28
after rescaling:
X I = [0, 1]28×28
after flattening:
X II = [0, 1]784
y x W b
tf.placeholder tf.placeholder tf.variable tf.variable
[None, 10] [None, 784] [784,10] [10]
PS1: Gradient of loss with respect to logits
Correct Label
http://cs231n.github.io/neural-networks-case-study/
PS1: Gradient of loss with respect to logits
Correct Label
http://cs231n.github.io/neural-networks-case-study/
PS1: Gradient of loss with respect to logits
Correct Label
http://cs231n.github.io/neural-networks-case-study/
PS1: Implementation
# tf Graph Input
X = tf.placeholder("float")
Y = tf.placeholder("float")
# Gradient descent
optimizer =
tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
Next week
C. Neural networks
Non-parametric models
Momentum
Adam
1 ! = {(,$ , .$ ), ,( , .( , … , (,* , .* )}
2 function knn(0,dist, ! tra/* , ! t01t ):
3 votes = []
4 for 3 = 1 to len(! t01t )
5 d = []
6 for j = 1 to len(! tra/* )
7 d[j] = dist('/, '4)
8 d = argsort(d)
9 votes[i] = most_common(set(labels[d]))
Neural networks: gradient descent - batch gradient update
https://arxiv.org/pdf/1412.6980.pdf
Neural networks: automatic differentiation
Dual numbers Forward automatic differentiation
Augment a standard Taylor series (numerical differentiation), with a “dual number”,
End result
https://www.robots.ox.ac.uk/~tvg/publications/talks/autodiff.pdf