You are on page 1of 54

Introduction to machine learning

Recitation 1
MIT - 6.802 / 6.874 / 20.390 / 20.490 / HST.506 - Spring 2021
Jackie Valeri

Slides adapted from Sachit Saksena and previous course materials


The BE Data & Coding Lab

https://bedatalab.github.io

A peer-to-peer educational community supporting


computational novices, competent practitioners, and experts in
their journey to learn new languages and use those languages to
answer important world problems.
We are a safe space
We can help with:
where it is okay to
ask for help …problem sets for classes

…brainstorming ways to incorporate


We are a computational modeling into your projects
confidential space or preliminary project design

We will help you …brainstorming and implementing


computation into your research projects
answer your
questions, we will not …code review, code efficiency, and
reproducibility
solve your problems
for you …much more – just ask if you aren’t sure!
BE Data and Coding Lab
Inaugural Cohort

Pablo Dan Itai Maxine Patrick


Cardenas Anderson Levin Jonas Holec

Divya Meelim Krista Miguel Jackie


Ramamoorthy Lee Pullen Alcantar Valeri

Python, Matlab, R, COMSOL


Onto recitation R01!

A. What can you do with ML?

B. Basics of machine learning

C. Neural networks

D. Brief preview of pset 1


What can you do with ML?
What can you do with ML?
What can you do with ML? Terminology

Input X ∈ X : Output y ∈ Y:
• features (in machine learning) • labels (in machine learning)
• predictors (in statistics) • responses (in statistics)
• independent variables (in statistics) • dependent variables (in statistics)
• regressors (in regression models) • regressand (in regression models)
• input variables • target variables
• covariates
Training set S training = { (X (i )
, y (i ) )} iN= 1 ∈ {X , Y} N , where N is number of training examples
An example is a collection of features (and an associated label)
Training: use Straining to learn functional relationship f : X → Y
9 / 37
What can you do with ML? Terminology

f :X → Y
f (x ;θ) = ŷ

θ:
• weights and biases (intercepts)
• coefficients β Think of weights and biases like
lots and lots of
• parameters
y = mx + b
f:
• model
• hypothesis h
• classifier
• predictor
• discriminative models: P(Y|X)
• generative models: P(X, Y)
R01 Outline

A. What can you do with ML?

B. Basics of machine learning


I. Get data
II. Identify the space of possible solutions
III. Formulate an objective
IV. Choose algorithm
V. Train (loss)
VI. Validate results (metrics)

C. Neural networks

D. Brief preview of pset 1


Basics of machine learning: data
Train—test split (1-fold)

Cross-validation (6-fold)

https://towardsdatascience.com/cross-validation-code-visualization-kind-of-fun-b9741baea1f8
Basics of machine learning: objective functions

An objective function J (Θ) is the function that you optimize when training machine learning models.
It is usually in the form of (but not limited to) one or combinations of the following:

Loss / cost / error function L (ŷ , y ): Likelihood function / posterior:


Classification • negative log-likelihood (NLL) in maximum
• 0-1 loss likelihood estimation (MLE)
• cross-entropy loss • posterior in maximum a posteriori estimation
• hinge loss (MAP)
Regression Regularizers and constraints
• mean squared error (MSE, L 2norm)
• mean absolute error (MAE, L 1norm)
• Huber loss (hybrid between L1 and L2 norm)
Probabilistic inference
• Kullback-Leibler divergence (KL divergence)

https://machinelearningmastery.com/how-to-choose-loss-functions-when-training-deep-learning-neural-networks/
Basics of machine learning: objective functions

An objective function J (Θ) is the function that you optimize when training machine learning models.
It is usually in the form of (but not limited to) one or combinations of the following:

Loss / cost / error function L (ŷ , y ): Likelihood function / posterior:


Classification • negative log-likelihood (NLL) in maximum
• 0-1 loss likelihood estimation (MLE)
• cross-entropy loss • posterior in maximum a posteriori estimation
• hinge loss (MAP)
Regression Regularizers and constraints
• mean squared error (MSE, L 2norm)
• mean absolute error (MAE, L 1norm)
• Huber loss (hybrid between L1 and L2 norm)
Probabilistic inference
• Kullback-Leibler divergence (KL divergence)

https://machinelearningmastery.com/how-to-choose-loss-functions-when-training-deep-learning-neural-networks/
Basics of machine learning: objective functions

An objective function J (Θ) is the function that you optimize when training machine learning models.
It is usually in the form of (but not limited to) one or combinations of the following:

Loss / cost / error function L (ŷ , y ): Likelihood function / posterior:


Classification • negative log-likelihood (NLL) in maximum
• 0-1 loss likelihood estimation (MLE)
• cross-entropy loss • posterior in maximum a posteriori estimation
• hinge loss (MAP)
Regression Regularizers and constraints
• mean squared error (MSE, L 2norm)
• mean absolute error (MAE, L 1norm)
• Huber loss (hybrid between L1 and L2 norm)
Probabilistic inference
• Kullback-Leibler divergence (KL divergence)

https://machinelearningmastery.com/how-to-choose-loss-functions-when-training-deep-learning-neural-networks/
Basics of machine learning: loss

Task Loss
Regression (penalize large errors)

Regression (penalize error linearly)

Classification (binary)

Classification (multi-class)

Generative
Basics of machine learning: objective functions

An objective function J (Θ) is the function that you optimize when training machine learning models.
It is usually in the form of (but not limited to) one or combinations of the following:

Loss / cost / error function L (ŷ , y ): Likelihood function / posterior:


Classification • negative log-likelihood (NLL) in maximum
• 0-1 loss likelihood estimation (MLE)
• cross-entropy loss • posterior in maximum a posteriori estimation
• hinge loss (MAP)
Regression Regularizers and constraints
• mean squared error (MSE, L 2norm)
• mean absolute error (MAE, L 1norm)
• Huber loss (hybrid between L1 and L2 norm)
Probabilistic inference
• Kullback-Leibler divergence (KL divergence)

https://machinelearningmastery.com/how-to-choose-loss-functions-when-training-deep-learning-neural-networks/
Basics of machine learning: loss
Basics of machine learning: metrics
Basics of machine learning: metrics

Precision = # true positive /


# predicted positive

True positive rate = False positive rate =


Sensitivity 1 – specificity
False positive rate = # false
positives / all condition negatives
Basics of machine learning: metrics
Basics of machine learning: metrics
Precision = # true positive /
# predicted positive

True positive rate = False positive rate = Recall is the same as true
# true positive / # 1 – specificity positive rate!
actually positive False positive rate = # false
positives / all condition negatives
R01 Outline

A. What can you do with ML?

B. Basics of machine learning

C. Neural networks
I. Perceptrons to neurons
II. Activation functions
III. Training with backpropagation
IV. Gradient descent
V. Regularization

D. Brief preview of pset 1


Neural networks: perceptrons to neurons

13 Jan
2016
Neural networks: perceptrons to neurons
Neural networks: perceptrons to neurons
Neural networks: single layer feed-forward NN
Neural networks: activation functions

Step function

Sigmoid

Rectified linear unit

Softmax

Hyperbolic tangent
Neural networks: activation functions

Task Activation Loss


Regression
Linear (ReLU, Leaky ReLU, etc)
(penalize large errors)

Regression
Linear (ReLU, Leaky ReLU, etc)
(penalize error linearly)

Classification Sigmoid, tanh


(binary)

Classification
Softmax
(multi-class)

Generative Linear (ReLU, Leaky ReLU, etc)

Other considerations: gradient intensity, computational activation cost,


exploding/vanishing gradients, depth of network (linear is useless)
Neural networks: training with backpropagation

W1 Z1 A1 W2 Z2 A2 AL-1 WL ZL AL
X b1
f1
b2
f2
… bL
fL Loss

layer 1 layer 2 layer L

Inspired by 6.036 lecture notes (Leslie Kaebling)


Neural networks: training with backpropagation

W1 Z1 A1 W2 Z2 A2 AL-1 WL ZL AL
X b1
f1
b2
f2
… bL
fL Loss

Inspired by 6.036 lecture notes (Leslie Kaebling)


Neural networks: training with backpropagation

So let’s use the following shorthand from the previous figure, Now, to propagate through the whole network, we can keep applying the chain
rule until the first layer of the network,

First, let’s break down how the loss depends on the final layer,

If you spend a few minutes looking at matrix dimensions, it becomes clear that
this is an informal derivation. Here are the dimensions to think about:
Since,

The equation with the correct dimensions for matrix multiplication,


We can re-write the equation as,

Since we have the outputs of every layer, all we need to


compute for the gradient of the last layer with respect to the
weights is the gradient of the loss with respect to the pre-
activation output. On your own time!

Inspired by 6.036 lecture notes (Leslie Kaebling)


Neural networks: gradient descent
The usage of the partial derivatives to update network
weightsGradient-based learning: use derivative to update weights
Learning rate Weight decay momentum
@E
w.t w.t 1 ✏( + w.t 1 ) + ⌘ w t 1

Gradient
@w. Previous change
where
I ✏ is often termed as the “learning rate” (usually set to 0.1 or smaller)
where
that is necessary to not overshoot the optimal solution;
• Gradient descent: ∂E/∂w = partial derivative of error Et wrt 1 w
I controls “weight decay” to penalize large weight w. in order to
• prevent
ϵ = learning rate (e.g.
overfitting <0.1),further
(discussed neededbelow)
to not overshoot the optimal solution
I• ⌘λ is
= the
weight
rate decay, penalizes
to weight large weights
the gradient w t 1 attothe
prevent overfitting
previous update to
• build
η = momentum,
“momentum”based in theon magnitude+sign
update; of previous
when the direction update
of the (∆wist−1);
update
when direction
consistent, of update
this will is consistent
speed up è faster
the convergence convergence
process
For large dataset, stochastic gradient descent (SGD) is often used by
randomly sampling a batch of samples and update the weights
Neural networks: gradient descent
Neural networks: regularization

We need to control model complexity for good generalization

(Goodfellow 2016)
Neural networks: regularization via validation set, early stopping

Training
set Validation
Leave out small “validation set” set
• not used to train the model
• used to evaluate model at each epoch/iteration
Test
(VCE, Validation cross-entropy) set
• Stop when VCE increases, prevent overfitting
Cross−Entropy per case
4.6
Training CE
4.4 Validation CE Test
CE

4.2

CE per case
3.8

3.6

3.4

3.2

2.8

2.6
0 5 10 15 20 25 30 35 40 45

Learning Epoch
Neural networks: regularization

Objective function Dropout

Penalize
Objective function with ridge regularization
R01 Outline

A. What can you do with ML?

B. Basics of machine learning

C. Neural networks

D. Brief preview of pset 1


PS1: TensorFlow Warm Up
PS1: Data

Problem Set 1

Classification

input space:
X = {0, 1, . . . , 255}28×28

after rescaling:
X I = [0, 1]28×28

after flattening:
X II = [0, 1]784
PS1: Data

Problem Set 1
input space:
X = {0, 1, . . . , 255}28×28

after rescaling:
X I = [0, 1]28×28

after flattening:
X II = [0, 1]784

integer-encoded label space:


Yi = {0, 1, . . . , 9}

one-hot-encoded label space:


Yh = [0, 1]10

One hot encoding turns 1, 2, 3 into


1 0 0
0 1 0
0 0 1
PS1: Structure

loss function optimizer


Problem Set 1
x ∈ [0, 1]784
tf.nn.softmax
ŷ ∈ [0, 1]10
W ∈ R784×10
b ∈ R10
f (x ;W , b) = φsoftmax (W r x + b) +
tf.matmul

y x W b
tf.placeholder tf.placeholder tf.variable tf.variable
[None, 10] [None, 784] [784,10] [10]
PS1: Gradient of loss with respect to logits

i goes from 0 to # of training examples


z is one of 10 digits !" = $ % &" + (
,-.
+
2" = −log()*" )where ==> , " )*" =
,1.
∑0 +
?2"
= )0" − @(AB = C) Update weights for j
?!0

)" = [0.6,0.3,0.1]then gradient → [0.6, −0.4,0.1]

Correct Label
http://cs231n.github.io/neural-networks-case-study/
PS1: Gradient of loss with respect to logits

i goes from 0 to # of training examples


z is one of 10 digits !" = $ % &" + (
,-.
+
L is loss 2" = −log()*" )where ==> , " )*" =
,1.
y is actual labels ∑0 + j goes from 0 to 9
?2"
= )0" − @(AB = C) Update weights for j
?!0

)" = [0.6,0.3,0.1]then gradient → [0.6, −0.4,0.1]

Correct Label
http://cs231n.github.io/neural-networks-case-study/
PS1: Gradient of loss with respect to logits

i goes from 0 to # of training examples


z is one of 10 digits !" = $ % &" + (
,-.
+
L is loss 2" = −log()*" )where ==> , " )*" =
,1.
y is actual labels ∑0 + j goes from 0 to 9
?2"
= )0" − @(AB = C) Update weights for j
?!0
if yi (the actual label) is j (a digit), then dLi/zj = pij

)" = [0.6,0.3,0.1]then gradient → [0.6, −0.4,0.1]

Correct Label
http://cs231n.github.io/neural-networks-case-study/
PS1: Implementation

# tf Graph Input
X = tf.placeholder("float")
Y = tf.placeholder("float")

# Set model weights


W = tf.Variable(rng.randn(), name="weight")
b = tf.Variable(rng.randn(), name="bias")

# Construct a linear model


pred = tf.add(tf.mul(X, W), b)

# Mean squared error


cost = tf.reduce_sum(tf.pow(pred-Y, 2))/(2*n_samples)

# Gradient descent
optimizer =
tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
Next week

More neural network review


Convolutional neural networks
Recurrent neural networks
R01 Outline

A. What can you do with ML?

B. Basics of machine learning

C. Neural networks

D. Brief preview of pset 1

E. BONUS CONTENT if time!


BONUS CONTENT!

Non-parametric models

Gradient descent – batch vs stochastic

Momentum

Adam

Lecture question: “how do we actually calculate these derivatives?”


Answer: automatic differentiation
Basics of machine learning: non-parametric models

1 ! = {(,$ , .$ ), ,( , .( , … , (,* , .* )}
2 function knn(0,dist, ! tra/* , ! t01t ):
3 votes = []
4 for 3 = 1 to len(! t01t )
5 d = []
6 for j = 1 to len(! tra/* )
7 d[j] = dist('/, '4)
8 d = argsort(d)
9 votes[i] = most_common(set(labels[d]))
Neural networks: gradient descent - batch gradient update

Gradient of objective J with respect to parameter vector W

Batch gradient update

Pseudocode for gradient update algorithm

1 function gradient_update(W/*/t , +, #, )):


2 !0 = Winit
3 t = 0
t t <$ This is an arbitrary
4 while # ! − # ! >) update criteria
5 t = t+1
6 ! t = ! t<$ − + ∇ @ #
Neural networks: gradient descent - stochastic gradient descent

Stochastic gradient update (per randomly sampled training example):

Pseudocode for stochastic gradient update algorithm

1 function sgd(W/*/t , ), *, T, ,):


2 !0 = Winit
3 for . = 1 to T
4 randomly select i ∈ {1,2,...,n}
5 ! t = ! t<$ − )(.) ∇ @ *

Mini-batch gradient descent


Optimization: momentum

Nesterov inequality? How can we make momentum better?

Goh, "Why Momentum Really Works", Distill, 2017. http://doi.org/10.23915/distill.00006


Optimization: adam

https://arxiv.org/pdf/1412.6980.pdf
Neural networks: automatic differentiation
Dual numbers Forward automatic differentiation
Augment a standard Taylor series (numerical differentiation), with a “dual number”,

Because ”dual numbers” have the (manufactured) property,

The Taylor Series simplifies to,

Which recovers the function output as well as the first derivative.

End result

https://www.robots.ox.ac.uk/~tvg/publications/talks/autodiff.pdf

You might also like