Professional Documents
Culture Documents
Lecture10 Mid
Lecture10 Mid
Matt Gormley
Lecture 10
Sep. 25, 2019
1
Reminders
• Homework 3: KNN, Perceptron, Lin.Reg.
– Out: Wed, Sep. 18
– Due: Wed, Sep. 25 at 11:59pm
• Midterm Exam 1
– Thu, Oct. 03, 6:30pm – 8:00pm
• Homework 4: Logistic Regression
– Out: Wed, Sep. 25
– Due: Fri, Oct. 11 at 11:59pm
• Today’s In-Class Poll
– http://p10.mlcourse.org
• Reading on Probabilistic Learning is reused later
in the course for MLE/MAP
3
MIDTERM EXAM LOGISTICS
5
Midterm Exam
• Time / Location
– Time: Evening Exam
Thu, Oct. 03 at 6:30pm – 8:00pm
– Room: We will contact each student individually with your room
assignment. The rooms are not based on section.
– Seats: There will be assigned seats. Please arrive early.
– Please watch Piazza carefully for announcements regarding room / seat
assignments.
• Logistics
– Covered material: Lecture 1 – Lecture 9
– Format of questions:
• Multiple choice
• True / False (with justification)
• Derivations
• Short answers
• Interpreting figures
• Implementing algorithms on paper
– No electronic devices
– You are allowed to bring one 8½ x 11 sheet of notes (front and back)
6
Midterm Exam
• How to Prepare
– Attend the midterm review lecture
(right now!)
– Review prior year’s exam and solutions
(we’ll post them)
– Review this year’s homework problems
– Consider whether you have achieved the
“learning objectives” for each lecture / section
7
Midterm Exam
• Advice (for during the exam)
– Solve the easy problems first
(e.g. multiple choice before derivations)
• if a problem seems extremely complicated you’re likely
missing something
– Don’t leave any answer blank!
– If you make an assumption, write it down
– If you look at a question and don’t know the
answer:
• we probably haven’t told you the answer
• but we’ve told you enough to work it out
• imagine arguing for some answer and see if you like it
8
Topics for Midterm 1
• Foundations • Classification
– Probability, Linear – Decision Tree
Algebra, Geometry, – KNN
Calculus – Perceptron
– Optimization
• Regression
• Important Concepts – Linear Regression
– Overfitting
– Experimental Design
9
SAMPLE QUESTIONS
10
Sample Questions
1.4
1.4 Probability
Probability
Assume we have
Assume we have aa sample
sample space
space ⌦.
⌦. Answer
Answer each
each question
question with
with T
T or
or F.
F. No
No justification
justification
is required.
is required.
1.4 Probability
(a) [1 pts.]
(a) [1 pts.] T
T or
or F:
F: If
If events
events A,
A, B,
B, and
and C
C are
are disjoint
disjoint then
then they
they are
are independent.
independent.
Assume we have a sample space ⌦. Answer each question with T or F. No justification
is required.
(a) [1 pts.] T or F: If events A,P
PB, and
(A)P
(A)P C are disjoint then they are independent.
(B|A)
(B|A)
(b) [1 pts.] T or F: P (A|B)
(b) [1 pts.] T or F: P (A|B) // .. (The
(The sign
sign ‘/’
‘/’ means
means ‘is
‘is proportional
proportional to’)
to’)
P (A|B)
P (A|B)
P (A)P (B|A)
(b) [1 pts.] T or F: P (A|B) / . (The sign ‘/’ means ‘is proportional to’)
(c) [1 pts.] T or F: P (A [ B) P P (A|B)
(A).
(c) [1 pts.] T or F: P (A [ B) P (A).
(c) [1
(d)
(d) [1 pts.]
pts.] T
T or
or F:
F: P
P (A [ B)
(A \
\ B) P
P (A).
(A).
••
•
subtree11
subtree subtree22
subtree
subtree1 subtree2
••
• log2 0.75
log 0.75 =
= 0.4
0.4 log2 0.25
log 0.25 =
= 22
2 2
log2 0.75 = 0.4 log2 0.25 = 2
0.5 log
0.5 log2 0.5
0.5 0.5 log
0.5 log2 0.5
0.5
2 2
0.5 log2 0.5 0.5 log2 0.5
12
10-701 Machine Learning
10-701 Machine Learning
Sample Questions
Midterm Exam - Page 7 of 17
Midterm Exam - Page 8 of 17
11/02/2016
11/02/2016
4 K-NN [12 pts]
Now we will apply K-Nearest Neighbors using Euclidean distance to a binary classifi-
In this problem,
cation task. you will bethe
We assign tested onofyour
class the knowledge
test point of
to K-Nearest Neighbors
be the class (K-NN),
of the majority ofwhere
the
k indicates the number of nearest neighbors.
k nearest neighbors. A point can be its own neighbor.
1. [3 pts] For K-NN in general, are there any cons of using very large k values? Select
one. Briefly justify your answer.
(a) Yes (b) No
2. [3 pts] For K-NN in general, are there any cons of using very small k values? Select
one. Briefly justify your answer.
(a) Yes (b) No
Figure 5
3. [2 pts] What value of k minimizes leave-one-out cross-validation error for the dataset
shown in Figure 5? What is the resulting error? 13
SamplePageQuestions
10-601: Machine Learning 10 of 16 2/29/2016
(b) [2 pts.] Suppose (x) is an arbitrary feature mapping from input x 2 X to (x) 2 RN
and let K(x, z) = (x) · (z). Then K(x, z) will always be a valid kernel function.
(c) [2 pts.] Given the same training data, in which the points are linearly separable, the
margin of the decision boundary produced by SVM will always be greater than or equal
to the margin of the decision boundary produced by Perceptron. 14
Given that
real-valued we have an
parameters we input x and
estimate andwe want to estimate
✏ represents the noiseaninoutput
the data. y, inWhen linearthe
regression
noise
we assumemaximizing
is Gaussian, the relationship between them
the likelihood is of theS form
of a dataset = {(x y 1=
, ywx
1 ), .+
. . b, (x
+ n✏,, ywhere
n )} to w and b are
estimate
thereal-valued
parameters
10-601: parameters is we
w and bLearning
Machine estimatetoand
equivalent ✏ represents
minimizing
Page 16 the
7 ofthe noiseerror:
squared in the data. When 2/29/2016 the noise
is Gaussian, maximizing the likelihoodn of a dataset S = {(x1 , y1 ), . . . , (xn , yn )} to estimate
the parameters w and b is equivalent
3 Linear and LogisticwRegression
n
i=1X
X
arg min to (y
Sample Questions
minimizing
(wx +the
i b))2squared
i
[20 pts. + 2 Extra Credit]
. error:
Consider the dataset S plotted in Fig. 1 along with its associated regression line. For
each of the altered data sets S new plotted in Fig. 3, indicate which regression line (relative
to the original one) in Fig. 2 corresponds to the regression line for the new data set. Write
your answers in the table below.
Figure 1: An observed data set and its associated regression line. (a) Adding one outlier to the (b) A
original data set. set.
new
Given that
real-valued we have an
parameters we input x and
estimate andwe want to estimate
✏ represents the noiseaninoutput
the data. y, inWhen linearthe
regression
noise
we assumemaximizing
is Gaussian, the relationship between them
the likelihood is of theS form
of a dataset = {(x y 1=
, ywx
1 ), .+
. . b, (x
+ n✏,, ywhere
n )} to w and b are
estimate
thereal-valued
parameters
10-601: parameters is we
w and bLearning
Machine estimatetoand
equivalent ✏ represents
minimizing
Page 16 the
7 ofthe noiseerror:
squared in the data. When 2/29/2016 the noise
is Gaussian, maximizing the likelihoodn of a dataset S = {(x1 , y1 ), . . . , (xn , yn )} to estimate
the parameters w and b is equivalent
3 Linear and LogisticwRegression
n
i=1X
X
arg min to (y
Sample Questions
minimizing
(wx +the
i b))2squared
i
[20 pts. + 2 Extra Credit]
. error:
Consider the dataset S plotted in Fig. 1 along with its associated regression line. For
each of the altered data sets S new plotted in Fig. 3, indicate which regression line (relative
to the original one) in Fig. 2 corresponds to the regression line for the new data set. Write
your answers in the table below.
new
Given that
real-valued we have an
parameters we input x and
estimate andwe want to estimate
✏ represents the noiseaninoutput
the data. y, inWhen linearthe
regression
noise
we assumemaximizing
is Gaussian, the relationship between them
the likelihood is of theS form
of a dataset = {(x y 1=
, ywx
1 ), .+
. . b, (x
+ n✏,, ywhere
n )} to w and b are
estimate
thereal-valued
parameters
10-601: parameters is we
w and bLearning
Machine estimatetoand
equivalent ✏ represents
minimizing
Page 16 the
7 ofthe noiseerror:
squared in the data. When 2/29/2016 the noise
is Gaussian, maximizing the likelihoodn of a dataset S = {(x1 , y1 ), . . . , (xn , yn )} to estimate
the parameters w and b is equivalent
3 Linear and LogisticwRegression
n
i=1X
X
arg min to (y
Sample Questions
minimizing
(wx +the
i b))2squared
i
[20 pts. + 2 Extra Credit]
. error:
Consider the dataset S plotted in Fig. 1 along with its associated regression line. For
each of the altered data sets S new plotted in Fig. 3, indicate which regression line (relative
to the original one) in Fig. 2 corresponds to the regression line for the new data set. Write
your answers in the table below.
new
Given that
real-valued we have an
parameters we input x and
estimate andwe want to estimate
✏ represents the noiseaninoutput
the data. y, inWhen linearthe
regression
noise
we assumemaximizing
is Gaussian, the relationship between them
the likelihood is of theS form
of a dataset = {(x y 1=
, ywx
1 ), .+
. . b, (x
+ n✏,, ywhere
n )} to w and b are
estimate
thereal-valued
parameters
10-601: parameters is we
w and bLearning
Machine estimatetoand
equivalent ✏ represents
minimizing
Page 16 the
7 ofthe noiseerror:
squared in the data. When 2/29/2016 the noise
is Gaussian, maximizing the likelihoodn of a dataset S = {(x1 , y1 ), . . . , (xn , yn )} to estimate
the parameters w and b is equivalent
3 Linear and LogisticwRegression
n
i=1X
X
arg min to (y
Sample Questions
minimizing
(wx +the
i b))2squared
i
[20 pts. + 2 Extra Credit]
. error:
Consider the dataset S plotted in Fig. 1 along with its associated regression line. For
each of the altered data sets S new plotted in Fig. 3, indicate which regression line (relative
to the original one) in Fig. 2 corresponds to the regression line for the new data set. Write
your answers in the table below.
Figure 1: An observed data set and its associated regression line. (e) Duplicating the original data set and
adding four points that lie on the trajectory
of the original regression line.
new
Matching Game
Goal: Match the Algorithm to its Update Rule
3. Perceptron 6. (i)
T k k + (h (x(i) ) y (i) )xk
h (x) = sign( x)
26
PROBABILISTIC LEARNING
28
Maximum Likelihood Estimation
29
Learning from Data (Frequentist)
Whiteboard
– Principle of Maximum Likelihood Estimation
(MLE)
– Strawmen:
• Example: Bernoulli
• Example: Gaussian
• Example: Conditional #1
(Bernoulli conditioned on Gaussian)
• Example: Conditional #2
(Gaussians conditioned on Bernoulli)
30
LOGISTIC REGRESSION
31
Logistic Regression
Data: Inputs are continuous vectors of length M. Outputs
are discrete.
We are back to
classification.
32
R ec a
ll…
Linear Models for Classification
Key idea: Try to learn
this hyperplane directly
Half-spaces:
Using gradient ascent for linear
classifiers
Key idea behind today’s lecture:
1. Define a linear classifier (logistic regression)
2. Define an objective function (likelihood)
3. Optimize it with gradient descent to learn
parameters
4. Predict the class with highest probability under
the model
35
Using gradient ascent for linear
classifiers
This decision function isn’t Use a differentiable
differentiable: function instead:
1
h(t) = sign( T
t) p (y = 1|t) =
1 + 2tT( T
t)
sign(x) 1
logistic(u) ≡
1+ e−u 36
Using gradient ascent for linear
classifiers
This decision function isn’t Use a differentiable
differentiable: function instead:
1
h(t) = sign( T
t) p (y = 1|t) =
1 + 2tT( T
t)
sign(x) 1
logistic(u) ≡
1+ e−u 37
Logistic Regression
Whiteboard
– Logistic Regression Model
– Learning for Logistic Regression
• Partial derivative for Logistic Regression
• Gradient for Logistic Regression
38
Logistic Regression
Data: Inputs are continuous vectors of length M. Outputs
are discrete.
40
Logistic Regression
41
Logistic Regression
42
LEARNING LOGISTIC REGRESSION
43
Maximum Conditional
Likelihood Estimation
Learning: finds the parameters that minimize some
objective function.
= argmin J( )
We minimize the negative log conditional likelihood:
N
J( ) = HQ; p (y (i) |t(i) )
i=1
Why?
1. We can’t maximize likelihood (as in Naïve Bayes)
because we don’t have a joint model p(x,y)
2. It worked well for Linear Regression (least squares is
MCLE)
44
Maximum Conditional
Likelihood Estimation
Learning: Four approaches to solving = argmin J( )
45
Maximum Conditional
Likelihood Estimation
Learning: Four approaches to solving = argmin J( )
Answer:
At each step (i.e. iteration) of SGD for Logistic Regression we…
A. (1) compute the gradient of the log-likelihood for all examples (2) update all
the parameters using the gradient
B. (1) compute the gradient of the log-likelihood for all examples (2) randomly
pick an example (3) update only the parameters for that example
C. (1) randomly pick a parameter, (2) compute the partial derivative of the log-
likelihood with respect to that parameter, (3) update that parameter for all
examples
D. (1) ask Matt for a description of SGD for Logistic Regression, (2) write it down,
(3) report that answer
E. (1) randomly pick an example, (2) compute the gradient of the log-likelihood
for that example, (3) update all the parameters using that gradient
F. (1) randomly pick a parameter and an example, (2) compute the gradient of
the log-likelihood for that example with respect to that parameter, (3) update
that parameter using that gradient
47
R ec a
ll…
Gradient Descent
Algorithm 1 Gradient Descent
(0)
1: procedure GD(D, )
(0)
2:
3: while not converged do
4: +
— J( )
5: return
d
In order to apply GD to Logistic d 1 J( )
Regression all we need is the d
d 2 J( )
gradient of the objective J( ) = ..
function (i.e. vector of partial .
derivatives). d
d N J( )
48
R ec a
ll…
Stochastic Gradient Descent (SGD)
Answer:
50
Summary
1. Discriminative classifiers directly model the
conditional, p(y|x)
2. Logistic regression is a simple linear
classifier, that retains a probabilistic
semantics
3. Parameters in LR are learned by iterative
optimization (e.g. SGD)
60
Logistic Regression Objectives
You should be able to…
• Apply the principle of maximum likelihood estimation (MLE) to
learn the parameters of a probabilistic model
• Given a discriminative probabilistic model, derive the conditional
log-likelihood, its gradient, and the corresponding Bayes
Classifier
• Explain the practical reasons why we work with the log of the
likelihood
• Implement logistic regression for binary or multiclass
classification
• Prove that the decision boundary of binary logistic regression is
linear
• For linear regression, show that the parameters which minimize
squared error are equivalent to those that maximize conditional
likelihood
61