You are on page 1of 59

Linear Regression

Jia-Bin Huang
ECE-5424G / CS-5824 Virginia Tech Spring 2019
BRACE YOURSELVES

WINTER IS COMING
BRACE YOURSELVES

HOMEWORK IS COMING
Administrative
• Office hour
• Chen Gao
• Shih-Yang Su

• Feedback (Thanks!)
• Notation?

• More descriptive slides?

• Video/audio recording?

• TA hours (uniformly spread over the week)?


Recap: Machine learning algorithms
Supervised Unsupervised
Learning Learning

Discrete Classification Clustering

Dimensionality
Continuous Regression reduction
Recap: Nearest neighbor classifier
• Training data

• Learning
Do nothing.

• Testing
, where
Recap: Instance/Memory-based Learning
1. A distance metric
• Continuous? Discrete? PDF? Gene data? Learn the metric?
2. How many nearby neighbors to look at?
• 1? 3? 5? 15?
3. A weighting function (optional)
• Closer neighbors matter more
4. How to fit with the local points?
• Kernel regression

Slide credit: Carlos Guestrin


Validation set
• Spliting training set: A fake test set to tune hyper-parameters

Slide credit: CS231 @ Stanford


Cross-validation
• 5-fold cross-validation -> split the training data into 5 equal folds
• 4 of them for training and 1 for validation

Slide credit: CS231 @ Stanford


Things to remember
• Supervised Learning
• Training/testing data; classification/regression; Hypothesis
• k-NN
• Simplest learning algorithm
• With sufficient data, very hard to beat “strawman” approach
• Kernel regression/classification
• Set k to n (number of data points) and chose kernel width
• Smoother than k-NN
• Problems with k-NN
• Curse of dimensionality
• Not robust to irrelevant features
• Slow NN search: must remember (very large) dataset for prediction
Today’s plan: Linear Regression
• Model representation

• Cost function

• Gradient descent

• Features and polynomial regression

• Normal equation
Linear Regression
• Model representation

• Cost function

• Gradient descent

• Features and polynomial regression

• Normal equation
Regression
Training set
real-valued output

Learning Algorithm

𝑥 h 𝑦
Size of house Hypothesis Estimated price
House pricing prediction
Price ($)
in 1000’s
400
300
200
100

500 1000 1500 2000 2500


Size in feet^2
Training set Size in feet^2 (x) Price ($) in 1000’s (y)
2104 460
1416 232
1534 315
852 178 = 47
… …

• Notation:
• = Number of training examples
• = Input variable / features Examples:
• = Output variable / target variable
• (, ) = One training example
• (, ) = training example
Slide credit: Andrew Ng
Model representation

Training set
Shorthand

Learning Algorithm Price ($)


in 1000’s
400
300

𝑥 h 𝑦
200
100

Size of house Hypothesis Estimated price 500 1000 1500 2000 2500
Size in feet^2

Univariate linear regression


Slide credit: Andrew Ng
Linear Regression
• Model representation

• Cost function

• Gradient descent

• Features and polynomial regression

• Normal equation
Size in feet^2 (x) Price ($) in 1000’s (y)
Training set 2104 460
1416 232
1534 315
852 178 = 47
… …

• Hypothesis

: parameters/weights

How to choose ’s?


Slide credit: Andrew Ng
h 𝜃 ( 𝑥 )= 𝜃 0+ 𝜃 1 𝑥
𝑦 𝑦 𝑦
3 3 3
2 2 2
1 1 1

1 2 3 𝑥 1 2 3 𝑥 1 2 3 𝑥

Slide credit: Andrew Ng


Cost function
• Idea: 𝜃 0 , 𝜃1
Choose so that
is close to for our
h 𝜃 ( 𝑥 ) =𝜃 0 +𝜃 1 𝑥
(𝑖 ) (𝑖)
training example
𝑚
𝑦 1 (𝑖 ) 2
Price ($)
in 1000’s
𝐽 ( 𝜃0 , 𝜃1 ) = ∑
2𝑚 𝑖=1
( h𝜃 ( 𝑥 ) − 𝑦 )
(𝑖 )

400
300
200

Cost function
100

500 1000 1500 2000 2500


𝑥 𝜃 0 , 𝜃1
Size in feet^2
Slide credit: Andrew Ng
Simplified
• Hypothesis: • Hypothesis:

• Parameters: • Parameters:

• Cost function: • Cost function:

• Goal: • Goal:

𝜃 0 , 𝜃1 𝜃 0 , 𝜃1
Slide credit: Andrew Ng
, function of , function of
𝑦 𝐽 ( 𝜃1 )
3 3

2 2

1 1

1 2 3 𝑥 0 1 2 3 𝜃1

Slide credit: Andrew Ng


, function of , function of
𝑦 𝐽 ( 𝜃1 )
3 3

2 2

1 1

1 2 3 𝑥 0 1 2 3 𝜃1

Slide credit: Andrew Ng


, function of , function of
𝑦 𝐽 ( 𝜃1 )
3 3

2 2

1 1

1 2 3 𝑥 0 1 2 3 𝜃1

Slide credit: Andrew Ng


, function of , function of
𝑦 𝐽 ( 𝜃1 )
3 3

2 2

1 1

1 2 3 𝑥 0 1 2 3 𝜃1

Slide credit: Andrew Ng


, function of , function of
𝑦 𝐽 ( 𝜃1 )
3 3

2 2

1 1

1 2 3 𝑥 0 1 2 3 𝜃1

Slide credit: Andrew Ng


• Hypothesis:

• Parameters:

• Cost function:

• Goal:
𝜃 0 , 𝜃1
Slide credit: Andrew Ng
Cost function

Slide credit: Andrew Ng


How do we find good that minimize ?
Slide credit: Andrew Ng
Linear Regression
• Model representation

• Cost function

• Gradient descent

• Features and polynomial regression

• Normal equation
Gradient descent
Have some function
Want
𝜃 0 , 𝜃1

Outline:
• Start with some
• Keep changing to reduce
until we hopefully end up at minimum
Slide credit: Andrew Ng
Slide credit: Andrew Ng
Gradient descent
Repeat until convergence{
(for and )
}

: Learning rate (step size)


: derivative (rate of change)

Slide credit: Andrew Ng


Gradient descent
Correct: simultaneous update Incorrect:

Slide credit: Andrew Ng


𝜕
𝜃 1 ≔ 𝜃1 − 𝛼 𝐽 ( 𝜃1 )
𝜕 𝜃1
𝐽 ( 𝜃1 )

𝜕
3 𝐽 ( 𝜃1 ) < 0
𝜕 𝜃1
𝜕
2 𝐽 ( 𝜃1 ) > 0
𝜕 𝜃1

0 1 2 3 𝜃1
Slide credit: Andrew Ng
Learning rate
Gradient descent for linear regression
Repeat until convergence{
(for and )
}

• Linear regression model

Slide credit: Andrew Ng


Computing partial derivative
•=
=

•:
•:

Slide credit: Andrew Ng


Gradient descent for linear regression
Repeat until convergence{

Update and simultaneously

Slide credit: Andrew Ng


Batch gradient descent
• “Batch”: Each step of gradient descent uses all the training examples
Repeat until convergence{
: Number of training examples

Slide credit: Andrew Ng


Linear Regression
• Model representation

• Cost function

• Gradient descent

• Features and polynomial regression

• Normal equation
Training dataset
Size in feet^2 (x) Price ($) in 1000’s (y)
2104 460
1416 232
1534 315
852 178
… …

h 𝜃 ( 𝑥 )=𝜃 0+ 𝜃1 𝑥

Slide credit: Andrew Ng


Multiple features (input variables)
Size in feet^2 () Number of Number of Age of home Price ($) in
bedrooms () floors () (years) () 1000’s (y)
2104 5 1 45 460
1416 3 2 40 232
1534 3 2 30 315
852 2 1 36 178
… …

Notation:
= Number of features
= Input features of training example
= Value of feature in training example
Slide credit: Andrew Ng
Hypothesis
Previously:

Now:

Slide credit: Andrew Ng


h 𝜃 ( 𝑥 )=𝜃 0+𝜃1 𝑥1 +𝜃 2 𝑥 2+…+𝜃 𝑛 𝑥 𝑛
• For convenience of notation, define
( for all examples)

Slide credit: Andrew Ng


Gradient descent
• Previously () • New algorithm ()

Repeat until convergence{ Repeat until convergence{

}
Simultaneously update

Slide credit: Andrew Ng


Gradient descent in practice: Feature scaling
• Idea: Make sure features are on a similar scale (e.g,. )
• E.g. size (0-2000 feat^2)
number of bedrooms (1-5)

𝜃2 𝜃2

3 3
2 2
1 1
𝜃1 𝜃1
0 1 2 3 0 1 2 3 Slide credit: Andrew Ng
Gradient descent in practice: Learning rate
• Automatic convergence test
• too small: slow convergence
• too large: may not converge

• To choose , try

0.001, … 0.01, …, 0.1, … , 1

Image credit: CS231n@Stanford


House prices prediction

• Area

Slide credit: Andrew Ng


Polynomial regression
Price ($)
in 1000’s
400

300

200

100 = (size)
500 1000 1500 2000 2500
= (size)^2
Size in feet^2 = (size)^3

Slide credit: Andrew Ng


Linear Regression
• Model representation

• Cost function

• Gradient descent

• Features and polynomial regression

• Normal equation
() Size in feet^2 Number of Number of Age of home Price ($) in
() bedrooms () floors () (years) () 1000’s (y)
1 2104 5 1 45 460
1 1416 3 2 40 232
1 1534 3 2 30 315
1 852 2 1 36 178
… …

[ ]
460
𝑦 = 232
315
178
⊤ −1 ⊤
𝜃=( 𝑋 𝑋) 𝑋 𝑦 Slide credit: Andrew Ng
Least square solution

Justification/interpretation 1
• Loss minimization

• Least squares loss

• Empirical Risk Minimization (ERM)


Justification/interpretation 2
• Probabilistic model
• Assume linear model with Gaussian errors

• Solving maximum likelihood

Image credit: CS 446@UIUC


𝒚
Justification/interpretation 3
𝑿 𝜽−𝒚
• Geometric interpretation 𝑿𝜽
column space of 

• : column space of or span()

• Residual is orthogonal to the column space of 


training examples, features
Gradient Descent Normal Equation
• Need to choose • No need to choose
• Need many iterations • Don’t need to iterate
• Works well even when • Need to compute
is large
• Slow if is very large

Slide credit: Andrew Ng


Things to remember
• Model representation

• Cost function
• Gradient descent for linear regression
Repeat until convergence {}
• Features and polynomial regression
Can combine features; can use different functions to generate features (e.g.,
polynomial)
• Normal equation
Next
• Naïve Bayes, Logistic regression, Regularization

You might also like