DS303: Introduction To Machine Learning: Manjesh K. Hanawal

DS303: Introduction to Machine Learning
Linear Regression
Manjesh K. Hanawal
Classification vs Regression
Learning setup:
I S = {(x (1) , y (1) ), (x (2) , y (2) ), . . . , (x (m) , y (m) )} is a set of data
points
I x (i) ∈ X for all i = 1, 2, . . . , m are drawn iid
I y (i) ∈ Y is the label of x (i) . y (i) = f (x (i) ) (f is unknown)
DS303 Manjesh K. Hanawal 2

(Supervised) Learning problem: Given S (dataset) find ‘best’ label
for a new sample x ∈ X drawn from the same distribution.
I when |Y| is finite, learning classification problem
I When |Y| is not finite/countable, learning is a regression
problem

ple, we can relate the force for stretching a spring and the distance that the spring stretches
(Hooke’s law, shown in Figure 3.1a), or explain how many transistors the semiconductor
Linear Regression
industry can pack into a circuit over time (Moore’s law, shown in Figure 3.1b).
We will focus on simple type of relations: Linear
Despite its simplicity, linear regression is an incredibly powerful tool for analyzing data.
Relation
While between
we’ll focus on thefeatures andchapter,
basics in this labels the
is linear.
next chapter will show how just a few
small tweaks and extensions can enable more complex analyses.
150 y = −0.044 + 35 ⋅ x, r 2 = 0.999

●
Amount of stretch (mm)
●
100
50
1 2 3 4 5
Force on spring (Newtons)
(a) In classical mechanics, one could empiri- (b) In the semiconductor industry, Moore’s
Ically
with
verify Hooke’s law by dangling a mass
Amount ofseeing
a spring and stretch when
how much the spring
law is an observation that the number of
force istransistors
appliedonon an a springcircuit doubles
integrated
is stretched.
(spring constant!) roughly every two years.
I 3.1:
Figure Examples
Number of oftransistor
where a lineon
fit the
explains physical
silicon areaphenomena
doubles and months feats.1
18 engineering
DS303
(Moore’s Law!) Manjesh K. Hanawal 4
1
The Moore’s law image is by Wgsimon (own work) [CC-BY-SA-3.0 or GFDL], via Wikimedia Commons.
But just because fitting a line is easy doesn’t mean that it always makes sense. Let’s take
another look at Anscombe’s quartet to underscore this point.
Everything is not linear!
Example: Anscombe’s Quartet Revisited
Recall Anscombe’s Quartet: 4 datasets with very similar statistical properties under a simple quanti-
tative analysis, but that look very di↵erent. Here they are again, but this time with linear regression
lines fitted to each one:
14 14 14 14
12 12 12 12
10 10 10 10
8 8 8 8
6 6 6 6
4 4 4 4
2 2 2 2
2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20
For all 4 of them, the slope of the regression line is 0.500 (to three decimal places) and the intercept is
I byData
3.00 (to two decimal places). This just goes to show: visualizing data can often reveal patterns that are
hidden with similar
pure numeric analysis! quantitative (mean) may look very different
I Visualization of data is reveals patterns that are hidden in
We begin with simple linear regression in which there are only two variables of interest
pureandnumerical
(e.g., weight analysis
height, or force used and distance stretched). After developing intuition
for this setting, we’ll then turn our attention to multiple linear regression, where there
are more variables.
Disclaimer: While some of the equations in this chapter might be a little intimidating,
it’s important to keep in mind that as a user of statistics, the most important thing is to
understand their uses and limitations. Toward this end, make sure not to get bogged down
DS303 Manjesh K. Hanawal
in the details of the equations, but instead focus on understanding how they fit in to the5big
Simple Linear Regression
I Assume: Each sample has one feature/attribute (x (i) ∈ R)

I We will fit line of the from y = β0 + β1 x
I x is called the independent/predictor variable

I y is called the dependent/label/response variable
I β1 is the slope and β0 is the intercept
I We will get different lines for different choice of (β0 , β1 )
I How to quantify how good is a line?

I Choose the best line!

Probabilistic Model for Linearly Related Data
I Instead of yi = β1 x (i) + β0 assume data is perturbed by noise

I yi = β1 x (i) + β0 + i , where i is random perturbation (noise)
I perturbation denotes that data won’t be fit the model
perfectly
I E [y ] = β0 + β1 x (i) , model fits in expectation.
I We assume that i ∼ N (0, σ 2 ), where σ 2 is known

Quantify goodness of a line: Mean Squared Error
I Minimize the distance between the line and points
I distance of point (x (i) , y (i) ) from line (β0 , β1 ) (error)
y (i) − (β1 x (i) + β0 )

I As staying above or below line are equally bad we can take
|y (i) − (β1 x (i) + β0 )| absolute error
(y (i) − (β1 x (i) + β0 ))2 squared error

I We take goodness of line (β0 , β1 ) as sum of the squared errors
m
1 X (i)
(y − (β1 x (i) + β0 ))2
m
i=1
n
1 X (i)
min (y − (β1 x (i) + β0 ))2 Mean Squared Error (MSE)
(β0 ,β1 ) m
=1
Least Squared Solution
MSE is the optimal solution when error is ∼ N (0, 1).

n
1 X (i)
(β̂0 , β̂1 ) = arg min (y − (β1 x (i) + β0 ))2
(β0 ,β1 ) m
i=1
1 Pm
(i) (i) − 1
Pm (i)
1 Pm (i)

m i=1 x y m i=1 x m i=1 y
β̂1 =
1 Pm P (i) 2
m
(i)
i=1 (x )
2 − m1 m i=1 x
m
! m
!
1 X (i) 1 X (i)
β̂0 = y − β̂1 x
m m
i=1 i=1

Expressing the solutions in terms of statistics
Given a random sample (X (1) , X (2) , . . . , X (m) )
Pm
I Sample mean: X̄ = m1 i=1 X
(i)
Pm
I Sample variance: SX2 = m−1 1
i=1 (X
(i) − X̄ )2
q
I Sample standard deviations: SX = SX2 .
For give data S = {(x (1) , y (1) ), (x (2) , y (2) ), . . . (x (m) , y (m) )}
m
! m
!
1 X (i) 1 X
(i) 2
x̄ = x sx = (x − x̄)
m m−1
i=1 i=1
m
! m
!
1 X 1 X
ȳ = y (i) sy = (y (i) − ȳ )2
m m−1
i=1 i=1
m
! !
1 X x − x̄ (i) (i)
y − ȳ
r= Correlation coefficient
m−1 sx sy
i=1

Prediction Example: The perils of extrapolation
s
β̂1 = r syx and β̂0 = ȳ − β̂1 x̄
Given any sample x, its predicted label is
istics for Research Projects

y = β̂1 x +Chapter
β̂0 3
By fitting a line to the Rotten Tomatoes ratings for movies that M. Night Shyamalan directed over tim
For what all x we can get one
xample: The perils of extrapolation
prediction?
may erroneously be led to believe that in 2014 and onward, Shyamalan’s movies will have negat
ratings, which isn’t even possible!
M. Night Shyamalan's Rotten Tomatoes Movie Ratings

90
80
70
60
50
Rating
40
30 y = -4.6531x + 9367.7
20 R² = 0.6875
10
0
1995 2000 2005 2010 2015
-10
Year
fitting a line to the Rotten Tomatoes ratings for movies that M. Night Shyamalan directed over time,
e may erroneously be led to believe that in 2014 and onward, Shyamalan’s movies will have negative
tings, which isn’t even possible! ⌅ 3.3 Multiple Linear Regression
Correlation coefficient
m
! !
1 X x (i) − x̄ y (i) − ȳ
r=
m−1 sx sy
i=1
I −1 ≤ r ≤ 1. Measure how much y is related to x

I ifStatistics
r is positive
for ResearchyProjects
increases in x Chapter 3
I if r is negative y decreases in x
r = -0.9 r = -0.2 r = 0.2 r = 0.9
4 4 4 4
2 2 2 2
0 0 0 0
2 2 2 2
4 4 4 4
4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 4 2 0 2 4
I rFigure
2
tion is
3.2: An illustration of correlation strength. Each plot shows data with a particular correla-
calledr. Values
coefficient coefficient of0 (outside)
farther than determination (explains
indicate a stronger how
relationship well closer
than values
to 0 (inside). Negative values (left) indicate an inverse relationship, while positive values (right)
data is fit).
indicate a direct relationship.

Multiple Linear Regression
S = {(x (1) , y (1) ), (x (2) , y (2) ), . . . , (x (m) , y (m) )}, x (i) ∈ Rd , d > 1
(i) (i) (i)
Each sample point x (i) = (x1 , x2 , . . . , xd ).
P
I We can write linear relation: yi = dj=1 xj(i) βj + β0
P
I yi = dj=0 xj(i) βj , where x0(i) = 1 for all i = 1, 2 . . . , m
I set β T = (β0 , β1 , β2 , . . . , βd ) and
(i) (i) (i)
x (i)T = (1, x1 , x2 , . . . , xd )
I Compactly yi = x (i)T β for all i = 1, 2, . . . , m
 (1)   (1) (1)

(1)  
y 1 x1 x2 . . . xd β0
 y (2)   (2) (2) (2)   
  1 x1 x2 . . . xd   β1 
 ..  =   .  . 
 . 
 .   ..  .
y (m) (m) (m) (m) βd
1 x 1 x ...x
2 d
T
y = X β where X is data matrix
The probabilistic model is then
y = XTβ + where i iid ∼ N (0, σ 2 )
Solution of Multiple Linear Regression
m
X
β̂ = arg min (y (i) − x (i)T β)2
β
i=1
β̂ = (XX T )−1 X T y

Model Evaluation:
Suppose every point y (i)
is very close to ȳ =⇒ y (i) does notChapter 3
Statistics for Research Projects
dependent much on xi and there is not much random error.
(xi , yi )
y ŷi
(x, ŷ )
ŷi ȳ
Figure 3.5: An illustration of the components contributing to the di↵erence between the average
y-value ȳ and a particular point (xi , yi ) (blue). Some of the di↵erence, ŷi ȳ, can be explained by
(i)
the model y − ȳand
(orange), (ŷ (i) −yi ȳ )ŷi , is known
= the remainder, (i)
ŷ (i) )
− (green).
+ as the(yresidual
| {z } | {z }
explained by model not explained by model
contribute to yi . In particular, let’s look at how far yi is from the mean ȳ. We’ll write this
di↵erence as:
Coefficient Determination
X X X
(y (i) − ȳ )2 = (ŷ (i) − ȳ )2 + (y (i) − ŷ (i) )2
|i {z } |i{z } |i {z }
SST SSM SSE
SSM SSE
1= +
|SST
{z } |SST
{z }
r2 1−r 2
I r 2 is called the coefficient of determination (square of

coefficient of correlation!)
I Captures the fraction of variability explained by model
I It is a measure that allows us to determine how certain one
can be in making predictions from a certain model/graph
I closer to 1, the better.

Polynomial Fit

DS303: Introduction To Machine Learning: Manjesh K. Hanawal

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DS303: Introduction To Machine Learning: Manjesh K. Hanawal

Uploaded by

Copyright:

Available Formats

DS303: Introduction to Machine Learning

DS303 Manjesh K. Hanawal 2

DS303 Manjesh K. Hanawal 3

150 y = −0.044 + 35 ⋅ x, r 2 = 0.999

I Assume: Each sample has one feature/attribute (x (i) ∈ R)

I x is called the independent/predictor variable

I How to quantify how good is a line?

DS303 Manjesh K. Hanawal 6

I Instead of yi = β1 x (i) + β0 assume data is perturbed by noise

DS303 Manjesh K. Hanawal 7

y (i) − (β1 x (i) + β0 )

|y (i) − (β1 x (i) + β0 )| absolute error

(y (i) − (β1 x (i) + β0 ))2 squared error

MSE is the optimal solution when error is ∼ N (0, 1).

DS303 Manjesh K. Hanawal 9

DS303 Manjesh K. Hanawal 10

Given any sample x, its predicted label is

istics for Research Projects

M. Night Shyamalan's Rotten Tomatoes Movie Ratings

I −1 ≤ r ≤ 1. Measure how much y is related to x

DS303 Manjesh K. Hanawal 12

DS303 Manjesh K. Hanawal 14

I r 2 is called the coefficient of determination (square of

DS303 Manjesh K. Hanawal 16

DS303 Manjesh K. Hanawal 17

You might also like