You are on page 1of 17

DS303: Introduction to Machine Learning

Linear Regression

Manjesh K. Hanawal
Classification vs Regression

Learning setup:
I S = {(x (1) , y (1) ), (x (2) , y (2) ), . . . , (x (m) , y (m) )} is a set of data
points
I x (i) ∈ X for all i = 1, 2, . . . , m are drawn iid
I y (i) ∈ Y is the label of x (i) . y (i) = f (x (i) ) (f is unknown)

DS303 Manjesh K. Hanawal 2


(Supervised) Learning problem: Given S (dataset) find ‘best’ label
for a new sample x ∈ X drawn from the same distribution.
I when |Y| is finite, learning classification problem
I When |Y| is not finite/countable, learning is a regression
problem

DS303 Manjesh K. Hanawal 3


ple, we can relate the force for stretching a spring and the distance that the spring stretches
(Hooke’s law, shown in Figure 3.1a), or explain how many transistors the semiconductor
Linear Regression
industry can pack into a circuit over time (Moore’s law, shown in Figure 3.1b).
We will focus on simple type of relations: Linear
Despite its simplicity, linear regression is an incredibly powerful tool for analyzing data.
Relation
While between
we’ll focus on thefeatures andchapter,
basics in this labels the
is linear.
next chapter will show how just a few
small tweaks and extensions can enable more complex analyses.

150 y = −0.044 + 35 ⋅ x, r 2 = 0.999



Amount of stretch (mm)


100

50

1 2 3 4 5
Force on spring (Newtons)

(a) In classical mechanics, one could empiri- (b) In the semiconductor industry, Moore’s
Ically
with
verify Hooke’s law by dangling a mass
Amount ofseeing
a spring and stretch when
how much the spring
law is an observation that the number of
force istransistors
appliedonon an a springcircuit doubles
integrated
is stretched.
(spring constant!) roughly every two years.

I 3.1:
Figure Examples
Number of oftransistor
where a lineon
fit the
explains physical
silicon areaphenomena
doubles and months feats.1
18 engineering
DS303
(Moore’s Law!) Manjesh K. Hanawal 4
1
The Moore’s law image is by Wgsimon (own work) [CC-BY-SA-3.0 or GFDL], via Wikimedia Commons.
But just because fitting a line is easy doesn’t mean that it always makes sense. Let’s take
another look at Anscombe’s quartet to underscore this point.
Everything is not linear!
Example: Anscombe’s Quartet Revisited
Recall Anscombe’s Quartet: 4 datasets with very similar statistical properties under a simple quanti-
tative analysis, but that look very di↵erent. Here they are again, but this time with linear regression
lines fitted to each one:

14 14 14 14
12 12 12 12
10 10 10 10
8 8 8 8
6 6 6 6
4 4 4 4
2 2 2 2
2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20

For all 4 of them, the slope of the regression line is 0.500 (to three decimal places) and the intercept is
I byData
3.00 (to two decimal places). This just goes to show: visualizing data can often reveal patterns that are
hidden with similar
pure numeric analysis! quantitative (mean) may look very different
I Visualization of data is reveals patterns that are hidden in
We begin with simple linear regression in which there are only two variables of interest
pureandnumerical
(e.g., weight analysis
height, or force used and distance stretched). After developing intuition
for this setting, we’ll then turn our attention to multiple linear regression, where there
are more variables.
Disclaimer: While some of the equations in this chapter might be a little intimidating,
it’s important to keep in mind that as a user of statistics, the most important thing is to
understand their uses and limitations. Toward this end, make sure not to get bogged down
DS303 Manjesh K. Hanawal
in the details of the equations, but instead focus on understanding how they fit in to the5big
Simple Linear Regression

I Assume: Each sample has one feature/attribute (x (i) ∈ R)


I We will fit line of the from y = β0 + β1 x

I x is called the independent/predictor variable


I y is called the dependent/label/response variable
I β1 is the slope and β0 is the intercept
I We will get different lines for different choice of (β0 , β1 )

I How to quantify how good is a line?


I Choose the best line!

DS303 Manjesh K. Hanawal 6


Probabilistic Model for Linearly Related Data

I Instead of yi = β1 x (i) + β0 assume data is perturbed by noise


I yi = β1 x (i) + β0 + i , where i is random perturbation (noise)
I perturbation denotes that data won’t be fit the model
perfectly
I E [y ] = β0 + β1 x (i) , model fits in expectation.
I We assume that i ∼ N (0, σ 2 ), where σ 2 is known

DS303 Manjesh K. Hanawal 7


Quantify goodness of a line: Mean Squared Error
I Minimize the distance between the line and points
I distance of point (x (i) , y (i) ) from line (β0 , β1 ) (error)

y (i) − (β1 x (i) + β0 )


I As staying above or below line are equally bad we can take

|y (i) − (β1 x (i) + β0 )| absolute error

(y (i) − (β1 x (i) + β0 ))2 squared error


I We take goodness of line (β0 , β1 ) as sum of the squared errors

m
1 X (i)
(y − (β1 x (i) + β0 ))2
m
i=1
n
1 X (i)
min (y − (β1 x (i) + β0 ))2 Mean Squared Error (MSE)
(β0 ,β1 ) m
=1
DS303 Manjesh K. Hanawal 8
Least Squared Solution

MSE is the optimal solution when error is ∼ N (0, 1).


n
1 X (i)
(β̂0 , β̂1 ) = arg min (y − (β1 x (i) + β0 ))2
(β0 ,β1 ) m
i=1

1 Pm 
(i) (i) − 1
Pm (i)
1 Pm (i)

m i=1 x y m i=1 x m i=1 y
β̂1 =  
1 Pm P (i) 2
m
(i)
i=1 (x )
2 − m1 m i=1 x
m
! m
!
1 X (i) 1 X (i)
β̂0 = y − β̂1 x
m m
i=1 i=1

DS303 Manjesh K. Hanawal 9


Expressing the solutions in terms of statistics
Given a random sample (X (1) , X (2) , . . . , X (m) )
Pm 
I Sample mean: X̄ = m1 i=1 X
(i)
Pm 
I Sample variance: SX2 = m−1 1
i=1 (X
(i) − X̄ )2
q
I Sample standard deviations: SX = SX2 .
For give data S = {(x (1) , y (1) ), (x (2) , y (2) ), . . . (x (m) , y (m) )}
m
! m
!
1 X (i) 1 X
(i) 2
x̄ = x sx = (x − x̄)
m m−1
i=1 i=1
m
! m
!
1 X 1 X
ȳ = y (i) sy = (y (i) − ȳ )2
m m−1
i=1 i=1
m
! !
1 X x − x̄ (i) (i)
y − ȳ
r= Correlation coefficient
m−1 sx sy
i=1

DS303 Manjesh K. Hanawal 10


Prediction Example: The perils of extrapolation

s
β̂1 = r syx and β̂0 = ȳ − β̂1 x̄

Given any sample x, its predicted label is

istics for Research Projects


y = β̂1 x +Chapter
β̂0 3
By fitting a line to the Rotten Tomatoes ratings for movies that M. Night Shyamalan directed over tim
For what all x we can get one
xample: The perils of extrapolation
prediction?
may erroneously be led to believe that in 2014 and onward, Shyamalan’s movies will have negat
ratings, which isn’t even possible!

M. Night Shyamalan's Rotten Tomatoes Movie Ratings


90
80
70
60
50

Rating
40
30 y = -4.6531x + 9367.7
20 R² = 0.6875

10
0
1995 2000 2005 2010 2015
-10
Year

fitting a line to the Rotten Tomatoes ratings for movies that M. Night Shyamalan directed over time,
e may erroneously be led to believe that in 2014 and onward, Shyamalan’s movies will have negative
tings, which isn’t even possible! ⌅ 3.3 Multiple Linear Regression
DS303 Manjesh K. Hanawal 11
Correlation coefficient

m
! !
1 X x (i) − x̄ y (i) − ȳ
r=
m−1 sx sy
i=1

I −1 ≤ r ≤ 1. Measure how much y is related to x


I ifStatistics
r is positive
for ResearchyProjects
increases in x Chapter 3

I if r is negative y decreases in x
r = -0.9 r = -0.2 r = 0.2 r = 0.9
4 4 4 4

2 2 2 2

0 0 0 0

2 2 2 2

4 4 4 4
4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 4 2 0 2 4

I rFigure
2
tion is
3.2: An illustration of correlation strength. Each plot shows data with a particular correla-
calledr. Values
coefficient coefficient of0 (outside)
farther than determination (explains
indicate a stronger how
relationship well closer
than values
to 0 (inside). Negative values (left) indicate an inverse relationship, while positive values (right)
data is fit).
indicate a direct relationship.

DS303 Manjesh K. Hanawal 12


Multiple Linear Regression
S = {(x (1) , y (1) ), (x (2) , y (2) ), . . . , (x (m) , y (m) )}, x (i) ∈ Rd , d > 1
(i) (i) (i)
Each sample point x (i) = (x1 , x2 , . . . , xd ).
P
I We can write linear relation: yi = dj=1 xj(i) βj + β0
P
I yi = dj=0 xj(i) βj , where x0(i) = 1 for all i = 1, 2 . . . , m
I set β T = (β0 , β1 , β2 , . . . , βd ) and
(i) (i) (i)
x (i)T = (1, x1 , x2 , . . . , xd )
I Compactly yi = x (i)T β for all i = 1, 2, . . . , m
 (1)   (1) (1)

(1)  
y 1 x1 x2 . . . xd β0
 y (2)   (2) (2) (2)   
  1 x1 x2 . . . xd   β1 
 ..  =   .  . 
 . 
 .   ..  .
y (m) (m) (m) (m) βd
1 x 1 x ...x
2 d
T
y = X β where X is data matrix
The probabilistic model is then
y = XTβ +  where i iid ∼ N (0, σ 2 )
DS303 Manjesh K. Hanawal 13
Solution of Multiple Linear Regression

m
X
β̂ = arg min (y (i) − x (i)T β)2
β
i=1

β̂ = (XX T )−1 X T y

DS303 Manjesh K. Hanawal 14


Model Evaluation:
Suppose every point y (i)
is very close to ȳ =⇒ y (i) does notChapter 3
Statistics for Research Projects
dependent much on xi and there is not much random error.

(xi , yi )

y ŷi
(x, ŷ )
ŷi ȳ

Figure 3.5: An illustration of the components contributing to the di↵erence between the average
y-value ȳ and a particular point (xi , yi ) (blue). Some of the di↵erence, ŷi ȳ, can be explained by
(i)
the model y − ȳand
(orange), (ŷ (i) −yi ȳ )ŷi , is known
= the remainder, (i)
ŷ (i) )
− (green).
+ as the(yresidual
| {z } | {z }
explained by model not explained by model
contribute to yi . In particular, let’s look at how far yi is from the mean ȳ. We’ll write this
di↵erence as:
DS303 Manjesh K. Hanawal 15
Coefficient Determination

X X X
(y (i) − ȳ )2 = (ŷ (i) − ȳ )2 + (y (i) − ŷ (i) )2
|i {z } |i{z } |i {z }
SST SSM SSE
SSM SSE
1= +
|SST
{z } |SST
{z }
r2 1−r 2

I r 2 is called the coefficient of determination (square of


coefficient of correlation!)
I Captures the fraction of variability explained by model
I It is a measure that allows us to determine how certain one
can be in making predictions from a certain model/graph
I closer to 1, the better.

DS303 Manjesh K. Hanawal 16


Polynomial Fit

DS303 Manjesh K. Hanawal 17

You might also like