Professional Documents
Culture Documents
Linear Regression
Manjesh K. Hanawal
Classification vs Regression
Learning setup:
I S = {(x (1) , y (1) ), (x (2) , y (2) ), . . . , (x (m) , y (m) )} is a set of data
points
I x (i) ∈ X for all i = 1, 2, . . . , m are drawn iid
I y (i) ∈ Y is the label of x (i) . y (i) = f (x (i) ) (f is unknown)
●
100
50
1 2 3 4 5
Force on spring (Newtons)
(a) In classical mechanics, one could empiri- (b) In the semiconductor industry, Moore’s
Ically
with
verify Hooke’s law by dangling a mass
Amount ofseeing
a spring and stretch when
how much the spring
law is an observation that the number of
force istransistors
appliedonon an a springcircuit doubles
integrated
is stretched.
(spring constant!) roughly every two years.
I 3.1:
Figure Examples
Number of oftransistor
where a lineon
fit the
explains physical
silicon areaphenomena
doubles and months feats.1
18 engineering
DS303
(Moore’s Law!) Manjesh K. Hanawal 4
1
The Moore’s law image is by Wgsimon (own work) [CC-BY-SA-3.0 or GFDL], via Wikimedia Commons.
But just because fitting a line is easy doesn’t mean that it always makes sense. Let’s take
another look at Anscombe’s quartet to underscore this point.
Everything is not linear!
Example: Anscombe’s Quartet Revisited
Recall Anscombe’s Quartet: 4 datasets with very similar statistical properties under a simple quanti-
tative analysis, but that look very di↵erent. Here they are again, but this time with linear regression
lines fitted to each one:
14 14 14 14
12 12 12 12
10 10 10 10
8 8 8 8
6 6 6 6
4 4 4 4
2 2 2 2
2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20
For all 4 of them, the slope of the regression line is 0.500 (to three decimal places) and the intercept is
I byData
3.00 (to two decimal places). This just goes to show: visualizing data can often reveal patterns that are
hidden with similar
pure numeric analysis! quantitative (mean) may look very different
I Visualization of data is reveals patterns that are hidden in
We begin with simple linear regression in which there are only two variables of interest
pureandnumerical
(e.g., weight analysis
height, or force used and distance stretched). After developing intuition
for this setting, we’ll then turn our attention to multiple linear regression, where there
are more variables.
Disclaimer: While some of the equations in this chapter might be a little intimidating,
it’s important to keep in mind that as a user of statistics, the most important thing is to
understand their uses and limitations. Toward this end, make sure not to get bogged down
DS303 Manjesh K. Hanawal
in the details of the equations, but instead focus on understanding how they fit in to the5big
Simple Linear Regression
m
1 X (i)
(y − (β1 x (i) + β0 ))2
m
i=1
n
1 X (i)
min (y − (β1 x (i) + β0 ))2 Mean Squared Error (MSE)
(β0 ,β1 ) m
=1
DS303 Manjesh K. Hanawal 8
Least Squared Solution
1 Pm
(i) (i) − 1
Pm (i)
1 Pm (i)
m i=1 x y m i=1 x m i=1 y
β̂1 =
1 Pm P (i) 2
m
(i)
i=1 (x )
2 − m1 m i=1 x
m
! m
!
1 X (i) 1 X (i)
β̂0 = y − β̂1 x
m m
i=1 i=1
s
β̂1 = r syx and β̂0 = ȳ − β̂1 x̄
Rating
40
30 y = -4.6531x + 9367.7
20 R² = 0.6875
10
0
1995 2000 2005 2010 2015
-10
Year
fitting a line to the Rotten Tomatoes ratings for movies that M. Night Shyamalan directed over time,
e may erroneously be led to believe that in 2014 and onward, Shyamalan’s movies will have negative
tings, which isn’t even possible! ⌅ 3.3 Multiple Linear Regression
DS303 Manjesh K. Hanawal 11
Correlation coefficient
m
! !
1 X x (i) − x̄ y (i) − ȳ
r=
m−1 sx sy
i=1
I if r is negative y decreases in x
r = -0.9 r = -0.2 r = 0.2 r = 0.9
4 4 4 4
2 2 2 2
0 0 0 0
2 2 2 2
4 4 4 4
4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 4 2 0 2 4
I rFigure
2
tion is
3.2: An illustration of correlation strength. Each plot shows data with a particular correla-
calledr. Values
coefficient coefficient of0 (outside)
farther than determination (explains
indicate a stronger how
relationship well closer
than values
to 0 (inside). Negative values (left) indicate an inverse relationship, while positive values (right)
data is fit).
indicate a direct relationship.
m
X
β̂ = arg min (y (i) − x (i)T β)2
β
i=1
β̂ = (XX T )−1 X T y
(xi , yi )
y ŷi
(x, ŷ )
ŷi ȳ
Figure 3.5: An illustration of the components contributing to the di↵erence between the average
y-value ȳ and a particular point (xi , yi ) (blue). Some of the di↵erence, ŷi ȳ, can be explained by
(i)
the model y − ȳand
(orange), (ŷ (i) −yi ȳ )ŷi , is known
= the remainder, (i)
ŷ (i) )
− (green).
+ as the(yresidual
| {z } | {z }
explained by model not explained by model
contribute to yi . In particular, let’s look at how far yi is from the mean ȳ. We’ll write this
di↵erence as:
DS303 Manjesh K. Hanawal 15
Coefficient Determination
X X X
(y (i) − ȳ )2 = (ŷ (i) − ȳ )2 + (y (i) − ŷ (i) )2
|i {z } |i{z } |i {z }
SST SSM SSE
SSM SSE
1= +
|SST
{z } |SST
{z }
r2 1−r 2