Lecture10 Mid

10-601 Introduction to Machine Learning
Machine Learning Department

School of Computer Science
Carnegie Mellon University
Midterm Exam Review

+
Binary Logistic Regression
Matt Gormley
Lecture 10
Sep. 25, 2019
1
Reminders
• Homework 3: KNN, Perceptron, Lin.Reg.
– Out: Wed, Sep. 18
– Due: Wed, Sep. 25 at 11:59pm
• Midterm Exam 1
– Thu, Oct. 03, 6:30pm – 8:00pm
• Homework 4: Logistic Regression
– Out: Wed, Sep. 25
– Due: Fri, Oct. 11 at 11:59pm
• Today’s In-Class Poll
– http://p10.mlcourse.org
• Reading on Probabilistic Learning is reused later
in the course for MLE/MAP
3
MIDTERM EXAM LOGISTICS
5
Midterm Exam
• Time / Location
– Time: Evening Exam
Thu, Oct. 03 at 6:30pm – 8:00pm
– Room: We will contact each student individually with your room
assignment. The rooms are not based on section.
– Seats: There will be assigned seats. Please arrive early.
– Please watch Piazza carefully for announcements regarding room / seat
assignments.
• Logistics
– Covered material: Lecture 1 – Lecture 9
– Format of questions:
• Multiple choice
• True / False (with justification)
• Derivations
• Short answers
• Interpreting figures
• Implementing algorithms on paper
– No electronic devices
– You are allowed to bring one 8½ x 11 sheet of notes (front and back)
6
Midterm Exam
• How to Prepare
– Attend the midterm review lecture
(right now!)
– Review prior year’s exam and solutions
(we’ll post them)
– Review this year’s homework problems
– Consider whether you have achieved the
“learning objectives” for each lecture / section
7
Midterm Exam
• Advice (for during the exam)
– Solve the easy problems first
(e.g. multiple choice before derivations)
• if a problem seems extremely complicated you’re likely
missing something
– Don’t leave any answer blank!
– If you make an assumption, write it down
– If you look at a question and don’t know the
answer:
• we probably haven’t told you the answer
• but we’ve told you enough to work it out
• imagine arguing for some answer and see if you like it
8
Topics for Midterm 1
• Foundations • Classification
– Probability, Linear – Decision Tree
Algebra, Geometry, – KNN
Calculus – Perceptron
– Optimization
• Regression
• Important Concepts – Linear Regression
– Overfitting
– Experimental Design
9
SAMPLE QUESTIONS
10
Sample Questions
1.4
1.4 Probability
Probability
Assume we have
Assume we have aa sample
sample space
space ⌦.
⌦. Answer
Answer each
each question
question with
with T
T or
or F.
F. No
No justification
justification
is required.
is required.
1.4 Probability
(a) [1 pts.]
(a) [1 pts.] T
T or
or F:
F: If
If events
events A,
A, B,
B, and
and C
C are
are disjoint
disjoint then
then they
they are
are independent.
independent.
Assume we have a sample space ⌦. Answer each question with T or F. No justification
is required.
(a) [1 pts.] T or F: If events A,P
PB, and
(A)P
(A)P C are disjoint then they are independent.
(B|A)
(B|A)
(b) [1 pts.] T or F: P (A|B)
(b) [1 pts.] T or F: P (A|B) // .. (The
(The sign
sign ‘/’
‘/’ means
means ‘is
‘is proportional
proportional to’)
to’)
P (A|B)
P (A|B)
P (A)P (B|A)
(b) [1 pts.] T or F: P (A|B) / . (The sign ‘/’ means ‘is proportional to’)
(c) [1 pts.] T or F: P (A [ B)  P P (A|B)
(A).
(c) [1 pts.] T or F: P (A [ B)  P (A).
(c) [1
(d)
(d) [1 pts.]
pts.] T
T or
or F:
F: P
P (A [ B)
(A \
\ B)  P
P (A).
(A).
(d) [1 pts.] T or F: P (A \ B) P (A).

11
Sample Questions
••
•
subtree11
subtree subtree22
subtree
subtree1 subtree2
••
• log2 0.75
log 0.75 =
= 0.4
0.4 log2 0.25
log 0.25 =
= 22
2 2
log2 0.75 = 0.4 log2 0.25 = 2
0.5 log
0.5 log2 0.5
0.5 0.5 log
0.5 log2 0.5
0.5
2 2
0.5 log2 0.5 0.5 log2 0.5
12
10-701 Machine Learning
10-701 Machine Learning
Sample Questions
Midterm Exam - Page 7 of 17
Midterm Exam - Page 8 of 17
11/02/2016
11/02/2016
4 K-NN [12 pts]
Now we will apply K-Nearest Neighbors using Euclidean distance to a binary classifi-
In this problem,
cation task. you will bethe
We assign tested onofyour
class the knowledge
test point of
to K-Nearest Neighbors
be the class (K-NN),
of the majority ofwhere
the
k indicates the number of nearest neighbors.
k nearest neighbors. A point can be its own neighbor.
1. [3 pts] For K-NN in general, are there any cons of using very large k values? Select
one. Briefly justify your answer.
(a) Yes (b) No
2. [3 pts] For K-NN in general, are there any cons of using very small k values? Select
one. Briefly justify your answer.
(a) Yes (b) No
Figure 5
3. [2 pts] What value of k minimizes leave-one-out cross-validation error for the dataset
shown in Figure 5? What is the resulting error? 13
SamplePageQuestions
10-601: Machine Learning 10 of 16 2/29/2016
4 SVM, Perceptron and Kernels [20 pts. + 4 Extra Credit]
4.1 True or False

Answer each of the following questions with T or F and provide a one line justification.
(1) (1) (1) (1)
(a) [2 pts.] Consider two datasets D(1) and D(2) where D(1) = {(x1 , y1 ), ..., (xn , yn )}
(2) (2) (2) (2) (1) (2)
and D(2) = {(x1 , y1 ), ..., (xm , ym )} such that xi 2 Rd1 , xi 2 Rd2 . Suppose d1 > d2
and n > m. Then the maximum number of mistakes a perceptron algorithm will make
is higher on dataset D(1) than on dataset D(2) .
(b) [2 pts.] Suppose (x) is an arbitrary feature mapping from input x 2 X to (x) 2 RN
and let K(x, z) = (x) · (z). Then K(x, z) will always be a valid kernel function.
(c) [2 pts.] Given the same training data, in which the points are linearly separable, the
margin of the decision boundary produced by SVM will always be greater than or equal
to the margin of the decision boundary produced by Perceptron. 14
Given that
real-valued we have an
parameters we input x and
estimate andwe want to estimate
✏ represents the noiseaninoutput
the data. y, inWhen linearthe
regression
noise
we assumemaximizing
is Gaussian, the relationship between them
the likelihood is of theS form
of a dataset = {(x y 1=
, ywx
1 ), .+
. . b, (x
+ n✏,, ywhere
n )} to w and b are
estimate
thereal-valued
parameters
10-601: parameters is we
w and bLearning
Machine estimatetoand
equivalent ✏ represents
minimizing
Page 16 the
7 ofthe noiseerror:
squared in the data. When 2/29/2016 the noise
is Gaussian, maximizing the likelihoodn of a dataset S = {(x1 , y1 ), . . . , (xn , yn )} to estimate
the parameters w and b is equivalent
3 Linear and LogisticwRegression
n
i=1X
X
arg min to (y
Sample Questions
minimizing
(wx +the
i b))2squared
i
[20 pts. + 2 Extra Credit]
. error:
arg min (yi (wxi + b))2 .

Consider the dataset S plotted inw Fig. 1 along with its associated regression line. For
3.1 Linear regression new i=1
each of the altered data sets S plotted in Fig. 3, indicate which regression line (relative
to the Consider
Given thehave
that one)
original we dataset 2Scorresponds
an input
in Fig. plotted
x andin weFig.
towant
the1 along with line
to estimate
regression itsanassociated
output
for y,regression
the new in set. line.
linear
data WriteFor
regression Dataset
new
each
your of theinaltered
weanswers
assume the databelow.
setsbetween
therelationship
table S plotted
them in is Fig.
of the3, form
indicate
y = which
wx + bregression
+ ✏, whereline (relative
w and b are
to the original
real-valued one) in Fig.
parameters we 2estimate
corresponds
and ✏torepresents
the regression line in
the noise forthe
thedata.
new data
Whenset.theWrite
noise
your answers in
is Gaussian, Dataset (a) (b) (c) (d) (e)
the likelihood of a dataset S = {(x1 , y1 ), . . . , (xn , yn )} to estimate
the table below.
maximizing
Regression line to minimizing the squared error: 10-601: Machine Learning Page 8 of
Dataset (a) (b) (c) (d) (e)
Xn
Regression line
w
i=1
Consider the dataset S plotted in Fig. 1 along with its associated regression line. For
each of the altered data sets S new plotted in Fig. 3, indicate which regression line (relative
to the original one) in Fig. 2 corresponds to the regression line for the new data set. Write
your answers in the table below.

Regression line
Figure 1: An observed data set and its associated regression line.
Figure 1: An observed data set and its associated regression line. (a) Adding one outlier to the (b) A
original data set. set.

15
Figure 2: New regression lines for altered data sets S new .
new
Given that
regression
noise
we assumemaximizing
, ywx
1 ), .+
. . b, (x
+ n✏,, ywhere
n )} to w and b are
estimate
thereal-valued
parameters
w and bLearning
minimizing
Page 16 the
7 ofthe noiseerror:
n
i=1X
X
arg min to (y
Sample Questions
minimizing
(wx +the
i b))2squared
i
. error:

to the Consider
Given thehave
that one)
an input
in Fig. plotted
x andin weFig.
towant
to estimate
output
for y,regression
the new in set. line.
linear
data WriteFor
regression Dataset
new
each
we of the
assume altered
the data sets
relationship
your answers in the table below. S
between plotted
them in
is Fig.
of 3,
the indicate
form y = which
wx + bregression
+ ✏, whereline
w (relative
and b are
to the original
corresponds
and ✏torepresents
the noise forthe
thedata.
new data
Whenset.theWrite
noise
your answers in
is Gaussian, Dataset
the table below.
maximizing (a) (b) (c) (d) (e)
the likelihood of a dataset S = {(x1 , y1 ), . . . , (xn , yn )} to estimate (a) Adding one outlier to the (b)
Regression
the parameters w and b is equivalent line to minimizing the squared error: original data set. set.
Xn
Regression line
w
i=1

Regression line

(c) Adding three outliers to the original data
set. Two on one side and one on the other
side.

16
new
Given that
regression
noise
we assumemaximizing
, ywx
1 ), .+
. . b, (x
+ n✏,, ywhere
n )} to w and b are
estimate
thereal-valued
parameters
w and bLearning
minimizing
Page 16 the
7 ofthe noiseerror:
n
i=1X
X
arg min to (y
Sample Questions
minimizing
(wx +the
i b))2squared
i
. error:

to the Consider
Given thehave
that one)
an input
in Fig. plotted
x andin weFig.
towant
to estimate
output
for the new y,regression
in
data set. line.
linear WriteFor
regression Dataset
new
each
weanswers
setsbetween
therelationship
table S plotted
them in is Fig.
of the3, form
indicate
y = which
wx + bregression
w and b are
to the original
corresponds
and ✏torepresents
the noise forthethedata.
new data
Whenset. theWrite
noise
your answers in
is Gaussian, Dataset
the table below.
maximizing (a) (b)
the likelihood of a dataset (a)(c) (d)
S =Adding (e)
{(x1 , yone
1 ), .outlier
. . , (xnto
, ynthe
)} to estimate (b) Adding two outliers to the original data
Regression line
the parameters w and b is equivalent to minimizing the original data error:
squared set. set.
n
Regression line X
w
i=1

Regression line

(c) Adding three outliers to the original data (d) Duplicating the original data set.
set. Two on one side and one on the other
side.

17
new
Given that
regression
noise
we assumemaximizing
, ywx
1 ), .+
. . b, (x
+ n✏,, ywhere
n )} to w and b are
estimate
thereal-valued
parameters
w and bLearning
minimizing
Page 16 the
7 ofthe noiseerror:
n
i=1X
X
arg min to (y
Sample Questions
minimizing
(wx +the
i b))2squared
i
. error:

to the Consider
Given thehave
that one)
an input
in Fig. plotted
x andin we
Fig.
towant
to estimate
output
for the new y,regression
in
data set. line.
linear WriteFor
regression Dataset
new
each
weanswers
setsbetween
therelationship
table S plotted
them in is Fig.
of the3, form
indicate
y = which
wx + bregression
w and b are
(c) Adding three outliers to the original data (d) Duplicating the or
to the original
corresponds
and ✏torepresents
the noise forthethedata.new data
Whenset. theWrite
noise
Dataset set. Two on one side and one on the other
your answers in
is Gaussian, the table below.
maximizing the likelihood(a)of a (b) (c) S (d)
dataset = {(x(e)1 , y1 ), . . . , (xn , yn )} to estimate
side.
Regression line
the parameters w and b is equivalent to minimizing the squared error:
Xn
Regression line
w
i=1

Regression line
Figure 1: An observed data set and its associated regression line. (e) Duplicating the original data set and
adding four points that lie on the trajectory
of the original regression line.
Figure 3: New data set S new .

18
new
Matching Game
Goal: Match the Algorithm to its Update Rule
1. SGD for Logistic Regression 4.

k k + (h (x(i) ) y (i) )
h (x) = p(y|x)
2. Least Mean Squares 5. 1
k k +
h (x) = T
x 1 + exp (h (x(i) ) y (i) )
3. Perceptron 6. (i)
T k k + (h (x(i) ) y (i) )xk
h (x) = sign( x)
A. 1=5, 2=4, 3=6 E. 1=6, 2=6, 3=6

B. 1=5, 2=6, 3=4 F. 1=6, 2=5, 3=5
C. 1=6, 2=4, 3=4 G. 1=5, 2=5, 3=5
D. 1=5, 2=6, 3=6 H. 1=4, 2=5, 3=6 19
Q&A
26
PROBABILISTIC LEARNING
28
Maximum Likelihood Estimation
29
Learning from Data (Frequentist)
Whiteboard
– Principle of Maximum Likelihood Estimation
(MLE)
– Strawmen:
• Example: Bernoulli
• Example: Gaussian
• Example: Conditional #1
(Bernoulli conditioned on Gaussian)
• Example: Conditional #2
(Gaussians conditioned on Bernoulli)
30
LOGISTIC REGRESSION
31
Logistic Regression
Data: Inputs are continuous vectors of length M. Outputs
are discrete.
We are back to
classification.
Despite the name

logistic regression.
32
R ec a
ll…
Linear Models for Classification
Key idea: Try to learn
this hyperplane directly
Looking ahead: Directly modeling the

• We’ll see a number of hyperplane would use a
commonly used Linear
Classifiers decision function:
• These include:
– Perceptron h(t) = sign( T
t)
– Logistic Regression
– Naïve Bayes (under
certain conditions) for:
– Support Vector
Machines y { 1, +1}
R ec a
ll…
Background: Hyperplanes
Hyperplane (Definition 1):
Notation Trick: fold the T
bias b and the weights w
H = {x : w x = b}
into a single vector θ by Hyperplane (Definition 2):
prepending a constant to
x and increasing
dimensionality by one!
w
Half-spaces:
Using gradient ascent for linear
classifiers
Key idea behind today’s lecture:
1. Define a linear classifier (logistic regression)
2. Define an objective function (likelihood)
3. Optimize it with gradient descent to learn
parameters
4. Predict the class with highest probability under
the model
35
classifiers
This decision function isn’t Use a differentiable
differentiable: function instead:
1
h(t) = sign( T
t) p (y = 1|t) =
1 + 2tT( T
t)
sign(x) 1
logistic(u) ≡
1+ e−u 36
classifiers
This decision function isn’t Use a differentiable
differentiable: function instead:
1
h(t) = sign( T
t) p (y = 1|t) =
1 + 2tT( T
t)
sign(x) 1
logistic(u) ≡
1+ e−u 37
Logistic Regression
Whiteboard
– Logistic Regression Model
– Learning for Logistic Regression
• Partial derivative for Logistic Regression
• Gradient for Logistic Regression
38
Logistic Regression
Data: Inputs are continuous vectors of length M. Outputs
are discrete.
Model: Logistic function applied to dot product of

parameters with input vector. 1
p (y = 1|t) =
1 + 2tT( T
t)
Learning: finds the parameters that minimize some
objective function. = argmin J( )
Prediction: Output is the most probable class.

ŷ = `;Kt p (y|t)
y {0,1}
39
Logistic Regression
40
Logistic Regression
41
Logistic Regression
42
LEARNING LOGISTIC REGRESSION
43
Maximum Conditional
Likelihood Estimation
Learning: finds the parameters that minimize some
objective function.
= argmin J( )
We minimize the negative log conditional likelihood:
N
J( ) = HQ; p (y (i) |t(i) )
i=1
Why?
1. We can’t maximize likelihood (as in Naïve Bayes)
because we don’t have a joint model p(x,y)
2. It worked well for Linear Regression (least squares is
MCLE)
44
Maximum Conditional
Learning: Four approaches to solving = argmin J( )
Approach 1: Gradient Descent

(take larger – more certain – steps opposite the gradient)
Approach 2: Stochastic Gradient Descent (SGD)

(take many small steps opposite the gradient)
Approach 3: Newton’s Method

(use second derivatives to better follow curvature)
Approach 4: Closed Form???

(set derivatives equal to zero and solve for parameters)
45
Maximum Conditional
Learning: Four approaches to solving = argmin J( )
Approach 1: Gradient Descent

(take larger – more certain – steps opposite the gradient)
Approach 2: Stochastic Gradient Descent (SGD)

(take many small steps opposite the gradient)
Approach 3: Newton’s Method

(use second derivatives to better follow curvature)
Approach 4: Closed Form???

(set derivatives equal to zero and solve for parameters)
Logistic Regression does not

have a closed form solution
for MLE parameters. 46
SGD for Logistic Regression
Question:
Which of the following is a correct description of SGD for Logistic Regression?
Answer:
At each step (i.e. iteration) of SGD for Logistic Regression we…
A. (1) compute the gradient of the log-likelihood for all examples (2) update all
the parameters using the gradient
B. (1) compute the gradient of the log-likelihood for all examples (2) randomly
pick an example (3) update only the parameters for that example
C. (1) randomly pick a parameter, (2) compute the partial derivative of the log-
likelihood with respect to that parameter, (3) update that parameter for all
examples
D. (1) ask Matt for a description of SGD for Logistic Regression, (2) write it down,
(3) report that answer
E. (1) randomly pick an example, (2) compute the gradient of the log-likelihood
for that example, (3) update all the parameters using that gradient
F. (1) randomly pick a parameter and an example, (2) compute the gradient of
the log-likelihood for that example with respect to that parameter, (3) update
that parameter using that gradient
47
R ec a
ll…
Gradient Descent
Algorithm 1 Gradient Descent
(0)
1: procedure GD(D, )
(0)
2:
3: while not converged do
4: +
— J( )
5: return
d
In order to apply GD to Logistic d 1 J( )
Regression all we need is the d
d 2 J( )
gradient of the objective J( ) = ..
function (i.e. vector of partial .
derivatives). d
d N J( )
48
R ec a
ll…
Stochastic Gradient Descent (SGD)
We can also apply SGD to solve the MCLE

problem for Logistic Regression.
We need a per-example objective:
N
Let J( ) = i=1 J (i) ( )
where J (i) ( ) = HQ; p (y i |ti ).
49
Logistic Regression vs. Perceptron
Question:
True or False: Just like Perceptron, one
step (i.e. iteration) of SGD for Logistic
Regression will result in a change to the
parameters only if the current example is
incorrectly classified.
Answer:
50
Summary
1. Discriminative classifiers directly model the
conditional, p(y|x)
2. Logistic regression is a simple linear
classifier, that retains a probabilistic
semantics
3. Parameters in LR are learned by iterative
optimization (e.g. SGD)
60
Logistic Regression Objectives
You should be able to…
• Apply the principle of maximum likelihood estimation (MLE) to
learn the parameters of a probabilistic model
• Given a discriminative probabilistic model, derive the conditional
log-likelihood, its gradient, and the corresponding Bayes
Classifier
• Explain the practical reasons why we work with the log of the
likelihood
• Implement logistic regression for binary or multiclass
classification
• Prove that the decision boundary of binary logistic regression is
linear
• For linear regression, show that the parameters which minimize
squared error are equivalent to those that maximize conditional
likelihood
61

Lecture10 Mid

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture10 Mid

Uploaded by

Copyright:

Available Formats

10-601 Introduction to Machine Learning

Machine Learning Department

Midterm Exam Review

(d) [1 pts.] T or F: P (A \ B) P (A).

4 SVM, Perceptron and Kernels [20 pts. + 4 Extra Credit]

4.1 True or False

arg min (yi (wxi + b))2 .

Dataset (a) (b) (c) (d) (e)

Figure 1: An observed data set and its associated regression line.

arg min (yi (wxi + b))2 .

Dataset (a) (b) (c) (d) (e)

Figure 1: An observed data set and its associated regression line.

Figure 1: An observed data set and its associated regression line.

arg min (yi (wxi + b))2 .

Dataset (a) (b) (c) (d) (e)

Figure 1: An observed data set and its associated regression line.

Figure 1: An observed data set and its associated regression line.

arg min (yi (wxi + b))2 .

Dataset (a) (b) (c) (d) (e)

Figure 3: New data set S new .

Figure 1: An observed data set and its associated regression line.

1. SGD for Logistic Regression 4.

A. 1=5, 2=4, 3=6 E. 1=6, 2=6, 3=6

Despite the name

Looking ahead: Directly modeling the

Model: Logistic function applied to dot product of

Prediction: Output is the most probable class.

Approach 1: Gradient Descent

Approach 2: Stochastic Gradient Descent (SGD)

Approach 3: Newton’s Method

Approach 4: Closed Form???

Approach 1: Gradient Descent

Approach 2: Stochastic Gradient Descent (SGD)

Approach 3: Newton’s Method

Approach 4: Closed Form???

Logistic Regression does not

We can also apply SGD to solve the MCLE

You might also like