You are on page 1of 29

Lecture 7

Supervised Learning
Linear Regression: Model and Algorithms

Dr. Reda Elbasiony


Linear regression:
The model
How much is this house worth?

we wants to figure
out how much this
house will sell for.

3
How much is this house worth?

$$ ????

4
Data input output

(x1 = sq.ft., y1 = $)

(x2 = sq.ft., y2 = $)

(x3 = sq.ft., y3 = $)
Input vs. Output:
(x4 = sq.ft., y4 = $)
• y is the quantity of interest
• assume y can be predicted from x
(x5 = sq.ft., y5 = $)

Model –
How we assume the world works
y
price ($)

Regression model:

square feet (sq.ft.) x


6
Model –
How we assume the world works
y
price ($)

“Essentially, all models are


wrong, but some are useful.”

square feet (sq.ft.) x


7
Regression Process
• Chose a model

• Find the best fit to our data set.

• Use the fitted function to make future predictions.

8
Simple linear regression model
y yi = w1 xi + εi

w1 is the parameter
(regression coefficient)
price ($)

That we want to learn


from our data set.

f(x) = w1 x

square feet (sq.ft.) x


9
Predictions just based on house size
y Only 1 bathroom!
Obviously price
not the same as it
would be for
price ($)

house with
3 bathrooms

square feet (sq.ft.) x


10
Add more inputs f(x) = w1 sq.ft + w2 #bath
y

x[2]
price ($)

s
om
ro
th
ba
square feet (sq.ft.) #x[1]
11
Many possible inputs
- Square feet
- # bathrooms
- # bedrooms
- Lot size
- Year built
-…

12
General notation
scalar
Output: y
Inputs: x = (x[1],x[2],…, x[d])

d-dim vector
e.g., x[1] = sq. ft, x[2] = #baths and so on.

Notational conventions:
training set: {(xi, yi)}i=1..n
xi = input of ith data point/observation (vector); yi is output
xi[j] = jth input of ith data point (scalar)
n = number of observations; d = number of input features
13
Generic linear regression model
Model: Given feature vector xi = (xi[1],xi[2],…, xi[d])
yi = w1 xi[1] + w2 xi[2] + … + wd xi[d] + εi

= ∑%"#$ wj xi[j] + εi = w T x i + εi = x i T w + εi

feature 1 = x[1] = sq. ft.


feature 2 = x[2] = #bath


feature d = x[d] =lot size
14
Fitting the linear regression model

©2017 Emily Fox CSE 446: Machine Learning


“Cost” of using a given line
y Residual sum of squares (RSS)
price ($)

RSS(w1) = (yi-w1xi)2

square feet (sq.ft.) x


16
RSS for multiple regression
RSS(w)
y

x[2]
price ($)

s
om
ro
th
ba
#

square feet (sq.ft.) x[1]


17
Rewrite in matrix notation

For all observations together

18
RSS for multiple regression

Objective: Find the best fit, i.e.,


find the w that minimizes
y
RSS(w) = (y-Xw)T(y-Xw)
x[2]
price ($)

s
mo
ro
th
ba
#

x[1]
square feet (sq.ft.)
19
Our specific optimization problem

CSE 446: Machine Learning


Two ways to solve our optimization problem

Objective: Find the best fit, i.e.,


find the w that minimizes
y
RSS(w) = (y-Xw)T(y-Xw)
x[2]
price ($)

s
mo
ro

RSS (w) is a function of w


th
ba
#

x[1]
square feet (sq.ft.)
21
1. Solve for RSS(w) = 0
Δ

Gradient of RSS
Δ
RSS(w) = [(y-Xw)T(y-Xw)]
Δ

22
Closed form solution
Δ
RSS(w) = [(y-Xw)T(y-Xw)]
Δ
= -2XT(y-Xw)

We want solution to -2XT(y-Xŵ) = 0

Solution: ``normal equations” XTXŵ = XTy

23
Solution to normal equations XTXŵ = XTy
ŵ = ( XTX )-1 XTy

24
2. Gradient descent
Gradient Descent
• Repeatedly move in direction that reduces the value of the
function.

26
Gradient descent for linear regression:
repeatedly move in direction of negative gradient

while not converged Δ


w(t+1) ß w(t) - η RSS(w(t))

-2XT(y-Xw(t))

27
Interpreting elementwise
Update to jth feature weight:
y wj(t+1) ß wj(t) + 2η xi[j](yi-ŷi(w(t)))
price ($)

x[2]
s
mo
h ro
at
#b

square feet (sq.ft.) x[1]


28
Summary of gradient descent
for multiple regression

init w(1)=0 (or randomly, or smartly), t=1


Δ
while || RSS(w(t))|| > ε
for j=1,…,d
%
partial[j] =-2 ∑"#$ xi[j](yi−ŷi(w(t))
wj(t+1) ß wj(t) – η partial[j]
tßt+1

29

You might also like