Basic of Machine Learning

Introduction to machine learning
Recitation 1
MIT - 6.802 / 6.874 / 20.390 / 20.490 / HST.506 - Spring 2021
Jackie Valeri
Slides adapted from Sachit Saksena and previous course materials

The BE Data & Coding Lab
https://bedatalab.github.io
A peer-to-peer educational community supporting

computational novices, competent practitioners, and experts in
their journey to learn new languages and use those languages to
answer important world problems.
We are a safe space
We can help with:
where it is okay to
ask for help …problem sets for classes
…brainstorming ways to incorporate

We are a computational modeling into your projects
confidential space or preliminary project design
We will help you …brainstorming and implementing

computation into your research projects
answer your
questions, we will not …code review, code efficiency, and
reproducibility
solve your problems
for you …much more – just ask if you aren’t sure!
BE Data and Coding Lab
Inaugural Cohort
Pablo Dan Itai Maxine Patrick

Cardenas Anderson Levin Jonas Holec
Divya Meelim Krista Miguel Jackie

Ramamoorthy Lee Pullen Alcantar Valeri
Python, Matlab, R, COMSOL

Onto recitation R01!
A. What can you do with ML?
B. Basics of machine learning
C. Neural networks
D. Brief preview of pset 1

What can you do with ML?
What can you do with ML?
What can you do with ML? Terminology
Input X ∈ X : Output y ∈ Y:
• features (in machine learning) • labels (in machine learning)
• predictors (in statistics) • responses (in statistics)
• independent variables (in statistics) • dependent variables (in statistics)
• regressors (in regression models) • regressand (in regression models)
• input variables • target variables
• covariates
Training set S training = { (X (i )
, y (i ) )} iN= 1 ∈ {X , Y} N , where N is number of training examples
An example is a collection of features (and an associated label)
Training: use Straining to learn functional relationship f : X → Y
9 / 37
What can you do with ML? Terminology
f :X → Y
f (x ;θ) = ŷ
θ:
• weights and biases (intercepts)
• coefficients β Think of weights and biases like
lots and lots of
• parameters
y = mx + b
f:
• model
• hypothesis h
• classifier
• predictor
• discriminative models: P(Y|X)
• generative models: P(X, Y)
R01 Outline

I. Get data
II. Identify the space of possible solutions
III. Formulate an objective
IV. Choose algorithm
V. Train (loss)
VI. Validate results (metrics)
C. Neural networks

Basics of machine learning: data
Train—test split (1-fold)
Cross-validation (6-fold)
https://towardsdatascience.com/cross-validation-code-visualization-kind-of-fun-b9741baea1f8
Basics of machine learning: objective functions
An objective function J (Θ) is the function that you optimize when training machine learning models.
It is usually in the form of (but not limited to) one or combinations of the following:
Loss / cost / error function L (ŷ , y ): Likelihood function / posterior:

Classification • negative log-likelihood (NLL) in maximum
• 0-1 loss likelihood estimation (MLE)
• cross-entropy loss • posterior in maximum a posteriori estimation
• hinge loss (MAP)
Regression Regularizers and constraints
• mean squared error (MSE, L 2norm)
• mean absolute error (MAE, L 1norm)
• Huber loss (hybrid between L1 and L2 norm)
Probabilistic inference
• Kullback-Leibler divergence (KL divergence)
https://machinelearningmastery.com/how-to-choose-loss-functions-when-training-deep-learning-neural-networks/


Basics of machine learning: loss
Task Loss
Regression (penalize large errors)
Regression (penalize error linearly)
Classification (binary)
Classification (multi-class)
Generative

Basics of machine learning: loss
Basics of machine learning: metrics
Precision = # true positive /

# predicted positive
True positive rate = False positive rate =

Sensitivity 1 – specificity
False positive rate = # false
positives / all condition negatives
Precision = # true positive /
# predicted positive
True positive rate = False positive rate = Recall is the same as true
# true positive / # 1 – specificity positive rate!
actually positive False positive rate = # false
positives / all condition negatives
R01 Outline
C. Neural networks
I. Perceptrons to neurons
II. Activation functions
III. Training with backpropagation
IV. Gradient descent
V. Regularization

Neural networks: perceptrons to neurons
13 Jan
2016
Neural networks: single layer feed-forward NN
Neural networks: activation functions
Step function
Sigmoid
Rectified linear unit
Softmax
Hyperbolic tangent
Neural networks: activation functions
Task Activation Loss

Regression
Linear (ReLU, Leaky ReLU, etc)
(penalize large errors)
Regression
Linear (ReLU, Leaky ReLU, etc)
(penalize error linearly)
Classification Sigmoid, tanh

(binary)
Classification
Softmax
(multi-class)
Generative Linear (ReLU, Leaky ReLU, etc)
Other considerations: gradient intensity, computational activation cost,

exploding/vanishing gradients, depth of network (linear is useless)
Neural networks: training with backpropagation
W1 Z1 A1 W2 Z2 A2 AL-1 WL ZL AL
X b1
f1
b2
f2
… bL
fL Loss
layer 1 layer 2 layer L
Inspired by 6.036 lecture notes (Leslie Kaebling)

W1 Z1 A1 W2 Z2 A2 AL-1 WL ZL AL
X b1
f1
b2
f2
… bL
fL Loss

So let’s use the following shorthand from the previous figure, Now, to propagate through the whole network, we can keep applying the chain
rule until the first layer of the network,
First, let’s break down how the loss depends on the final layer,
If you spend a few minutes looking at matrix dimensions, it becomes clear that
this is an informal derivation. Here are the dimensions to think about:
Since,
The equation with the correct dimensions for matrix multiplication,

We can re-write the equation as,
Since we have the outputs of every layer, all we need to

compute for the gradient of the last layer with respect to the
weights is the gradient of the loss with respect to the pre-
activation output. On your own time!

Neural networks: gradient descent
The usage of the partial derivatives to update network
weightsGradient-based learning: use derivative to update weights
Learning rate Weight decay momentum
@E
w.t w.t 1 ✏( + w.t 1 ) + ⌘ w t 1
Gradient
@w. Previous change
where
I ✏ is often termed as the “learning rate” (usually set to 0.1 or smaller)
where
that is necessary to not overshoot the optimal solution;
• Gradient descent: ∂E/∂w = partial derivative of error Et wrt 1 w
I controls “weight decay” to penalize large weight w. in order to
• prevent
ϵ = learning rate (e.g.
overfitting <0.1),further
(discussed neededbelow)
to not overshoot the optimal solution
I• ⌘λ is
= the
weight
rate decay, penalizes
to weight large weights
the gradient w t 1 attothe
prevent overfitting
previous update to
• build
η = momentum,
“momentum”based in theon magnitude+sign
update; of previous
when the direction update
of the (∆wist−1);
update
when direction
consistent, of update
this will is consistent
speed up è faster
the convergence convergence
process
For large dataset, stochastic gradient descent (SGD) is often used by
randomly sampling a batch of samples and update the weights
Neural networks: gradient descent
Neural networks: regularization
We need to control model complexity for good generalization
(Goodfellow 2016)
Neural networks: regularization via validation set, early stopping
Training
set Validation
Leave out small “validation set” set
• not used to train the model
• used to evaluate model at each epoch/iteration
Test
(VCE, Validation cross-entropy) set
• Stop when VCE increases, prevent overfitting
Cross−Entropy per case
4.6
Training CE
4.4 Validation CE Test
CE
4.2
CE per case
3.8
3.6
3.4
3.2
2.8
2.6
0 5 10 15 20 25 30 35 40 45
Learning Epoch
Neural networks: regularization
Objective function Dropout
Penalize
Objective function with ridge regularization
R01 Outline
C. Neural networks

PS1: TensorFlow Warm Up
PS1: Data
Problem Set 1
Classification
input space:
X = {0, 1, . . . , 255}28×28
after rescaling:
X I = [0, 1]28×28
after flattening:
X II = [0, 1]784
PS1: Data
Problem Set 1
input space:
X = {0, 1, . . . , 255}28×28
after rescaling:
X I = [0, 1]28×28
after flattening:
X II = [0, 1]784
integer-encoded label space:

Yi = {0, 1, . . . , 9}
one-hot-encoded label space:

Yh = [0, 1]10
One hot encoding turns 1, 2, 3 into

1 0 0
0 1 0
0 0 1
PS1: Structure
loss function optimizer

Problem Set 1
x ∈ [0, 1]784
tf.nn.softmax
ŷ ∈ [0, 1]10
W ∈ R784×10
b ∈ R10
f (x ;W , b) = φsoftmax (W r x + b) +
tf.matmul
y x W b
tf.placeholder tf.placeholder tf.variable tf.variable
[None, 10] [None, 784] [784,10] [10]
PS1: Gradient of loss with respect to logits
i goes from 0 to # of training examples

z is one of 10 digits !" = $ % &" + (
,-.
+
2" = −log()*" )where ==> , " )*" =
,1.
∑0 +
?2"
= )0" − @(AB = C) Update weights for j
?!0
)" = [0.6,0.3,0.1]then gradient → [0.6, −0.4,0.1]
Correct Label
http://cs231n.github.io/neural-networks-case-study/

z is one of 10 digits !" = $ % &" + (
,-.
+
L is loss 2" = −log()*" )where ==> , " )*" =
,1.
y is actual labels ∑0 + j goes from 0 to 9
?2"
?!0
)" = [0.6,0.3,0.1]then gradient → [0.6, −0.4,0.1]
Correct Label

z is one of 10 digits !" = $ % &" + (
,-.
+
L is loss 2" = −log()*" )where ==> , " )*" =
,1.
y is actual labels ∑0 + j goes from 0 to 9
?2"
?!0
if yi (the actual label) is j (a digit), then dLi/zj = pij
)" = [0.6,0.3,0.1]then gradient → [0.6, −0.4,0.1]
Correct Label
PS1: Implementation
# tf Graph Input
X = tf.placeholder("float")
Y = tf.placeholder("float")
# Set model weights

W = tf.Variable(rng.randn(), name="weight")
b = tf.Variable(rng.randn(), name="bias")
# Construct a linear model

pred = tf.add(tf.mul(X, W), b)
# Mean squared error

cost = tf.reduce_sum(tf.pow(pred-Y, 2))/(2*n_samples)
# Gradient descent
optimizer =
tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
Next week
More neural network review

Convolutional neural networks
Recurrent neural networks
R01 Outline
C. Neural networks
E. BONUS CONTENT if time!

BONUS CONTENT!
Non-parametric models
Gradient descent – batch vs stochastic
Momentum
Adam
Lecture question: “how do we actually calculate these derivatives?”

Answer: automatic differentiation
Basics of machine learning: non-parametric models
1 ! = {(,$ , .$ ), ,( , .( , … , (,* , .* )}
2 function knn(0,dist, ! tra/* , ! t01t ):
3 votes = []
4 for 3 = 1 to len(! t01t )
5 d = []
6 for j = 1 to len(! tra/* )
7 d[j] = dist('/, '4)
8 d = argsort(d)
9 votes[i] = most_common(set(labels[d]))
Neural networks: gradient descent - batch gradient update
Gradient of objective J with respect to parameter vector W
Batch gradient update
Pseudocode for gradient update algorithm
1 function gradient_update(W/*/t , +, #, )):

2 !0 = Winit
3 t = 0
t t <$ This is an arbitrary
4 while # ! − # ! >) update criteria
5 t = t+1
6 ! t = ! t<$ − + ∇ @ #
Neural networks: gradient descent - stochastic gradient descent
Stochastic gradient update (per randomly sampled training example):
Pseudocode for stochastic gradient update algorithm
1 function sgd(W/*/t , ), *, T, ,):

2 !0 = Winit
3 for . = 1 to T
4 randomly select i ∈ {1,2,...,n}
5 ! t = ! t<$ − )(.) ∇ @ *
Mini-batch gradient descent

Optimization: momentum
Nesterov inequality? How can we make momentum better?
Goh, "Why Momentum Really Works", Distill, 2017. http://doi.org/10.23915/distill.00006

Optimization: adam
https://arxiv.org/pdf/1412.6980.pdf
Neural networks: automatic differentiation
Dual numbers Forward automatic differentiation
Augment a standard Taylor series (numerical differentiation), with a “dual number”,
Because ”dual numbers” have the (manufactured) property,
The Taylor Series simplifies to,
Which recovers the function output as well as the first derivative.
End result
https://www.robots.ox.ac.uk/~tvg/publications/talks/autodiff.pdf

Basic of Machine Learning

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Basic of Machine Learning

Uploaded by

Copyright:

Available Formats

Introduction to machine learning

Slides adapted from Sachit Saksena and previous course materials

A peer-to-peer educational community supporting

…brainstorming ways to incorporate

We will help you …brainstorming and implementing

Pablo Dan Itai Maxine Patrick

Divya Meelim Krista Miguel Jackie

Python, Matlab, R, COMSOL

A. What can you do with ML?

B. Basics of machine learning

D. Brief preview of pset 1

A. What can you do with ML?

B. Basics of machine learning

D. Brief preview of pset 1

Loss / cost / error function L (ŷ , y ): Likelihood function / posterior:

Loss / cost / error function L (ŷ , y ): Likelihood function / posterior:

Loss / cost / error function L (ŷ , y ): Likelihood function / posterior:

Regression (penalize error linearly)

Loss / cost / error function L (ŷ , y ): Likelihood function / posterior:

Precision = # true positive /

True positive rate = False positive rate =

A. What can you do with ML?

B. Basics of machine learning

D. Brief preview of pset 1

Rectified linear unit

Task Activation Loss

Classification Sigmoid, tanh

Generative Linear (ReLU, Leaky ReLU, etc)

Other considerations: gradient intensity, computational activation cost,

layer 1 layer 2 layer L

Inspired by 6.036 lecture notes (Leslie Kaebling)

Inspired by 6.036 lecture notes (Leslie Kaebling)

The equation with the correct dimensions for matrix multiplication,

Since we have the outputs of every layer, all we need to

Inspired by 6.036 lecture notes (Leslie Kaebling)

We need to control model complexity for good generalization

Objective function Dropout

A. What can you do with ML?

B. Basics of machine learning

D. Brief preview of pset 1

integer-encoded label space:

one-hot-encoded label space:

One hot encoding turns 1, 2, 3 into

loss function optimizer

i goes from 0 to # of training examples

)" = [0.6,0.3,0.1]then gradient → [0.6, −0.4,0.1]

i goes from 0 to # of training examples

)" = [0.6,0.3,0.1]then gradient → [0.6, −0.4,0.1]

i goes from 0 to # of training examples

)" = [0.6,0.3,0.1]then gradient → [0.6, −0.4,0.1]

# Set model weights

# Construct a linear model

# Mean squared error

More neural network review

A. What can you do with ML?

B. Basics of machine learning

D. Brief preview of pset 1

E. BONUS CONTENT if time!

Gradient descent – batch vs stochastic

Lecture question: “how do we actually calculate these derivatives?”

Gradient of objective J with respect to parameter vector W

Batch gradient update

Pseudocode for gradient update algorithm

1 function gradient_update(W/*/t , +, #, )):

1 function sgd(W//t , ), , T, ,):