Professional Documents
Culture Documents
Statistics
Shaheena Bashir
FALL, 2019
2/31
Outline
Introduction
Regression
Assumptions about The Model
Method of Least Squares
Assessment of the Model
Graphical Assessment
Regression with Categorical Predictor
o
3/31
Introduction
o
4/31
Introduction
Motivating Example
I https://www.nature.com/articles/ejhg20095
I https://www.wired.com/2009/03/predicting-height-the-
victorian-approach-beats-modern-genomics/
o
5/31
Introduction
Galton’s Dataset
74
● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ●
72
● ● ● ● ● ● ●
70
● ● ● ● ● ● ● ● ● ●
68 ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
66
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
64
● ● ● ● ● ● ● ●
● ● ●
62
● ● ● ● ●
64 66 68 70 72
o
7/31
Introduction
Deterministic Models
y = α + βx
Area = πr 2
Circumference = 2πr
9
Fahreheit = 32 + × Celsius
5
o
8/31
Regression
o
9/31
Regression
o
10/31
Regression
o
11/31
Regression
Example
74
● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ●
72
● ● ● ● ● ● ●
Child height (inches)
● ● ● ● ● ● ● ●
70
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
68
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
66
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
64
● ● ● ● ● ● ● ●
● ● ●
62
● ● ● ● ●
64 66 68 70 72
o
Mid−parent height (inches)
12/31
Regression
Example: Cont’d
I The red line is the line
∼ N(0, σ 2 )
That is, for any value of the independent variable there is a single
o
most likely value for the dependent variable
15/31
Regression
Assumptions about The Model
Predicted Values
o
16/31
Regression
Assumptions about The Model
Errors
We can estimate random errors, i in the fitted value by the
vertical distance i = yi − βo − β1 xi
y●4
3.5
3.0
y^5
2.5
y^4
y
y●5
y2
y^3
2.0
y^2
1.5
●
y^1 y3
{ e1 = y1 − y^1
1.0
●
y1
1 2 3 4 5 o
x
17/31
Regression
Method of Least Squares
o
18/31
Regression
Method of Least Squares
βˆo
= ȳ − βˆ1 x̄
P
(yi − ȳ )(xi − x̄)
βˆ1 = ,
(xi − x̄)2
P
P P
where ȳ = i yi /n & x̄ = i xi /n are the sample means of the
response variable & the predictor variable respectively. (βˆo , βˆ1 ) are
also called the OLS estimates. The units of βo are the same as
units of Y , while units of β1 correspond to units of Y per unit of
x. The least squares regression line is then
Ŷ = βˆo + βˆ1 X o
19/31
Regression
Method of Least Squares
o
20/31
Regression
Method of Least Squares
Residuals ei
I The vertical distance from the observed yi to the fitted value
yˆi is called the residual.
ei = yi − yˆi = yi − βˆo − βˆ1 xi , i = 1, . . . , n
I The residuals can be thought of as estimates (predicted
values) of the unknown errors 1 , . . . , n
o
21/31
Regression
Method of Least Squares
1. The least squares line always passes through the point (x̄, ȳ )
2. The sum of the residuals ei ’s is 0.
3. The sum of the squares of the ei ’s is called the Residual Sum
of Squares or Sum of Squared Errors (SSE).
SSE
4. An unbiased estimate of the variance σ 2 is given by n−2
o
22/31
Regression
Assessment of the Model
Coefficient of Determination R 2
The strength of the relationship between x and y is measured by
the coefficient of determination R 2 .
(yi − yˆi )2
P
2 SSE
R =1− =1− P
SSy (yi − ȳ )2
Diagnostics
o
24/31
Regression
Assessment of the Model
Bivariate Plots
o
25/31
Regression
Assessment of the Model
6
●
● ● ●
●
● ●
4
● ● ●
● ● ● ●
● ●
● ● ● ● ●
● ●
● ● ●
● ●
2
● ● ●
● ● ●
● ● ●
● ● ● ● ●
● ● ●
Residuals
● ● ●
●
0 ●
●
●
●
●
●
●
●
●
● ● ● ●
● ● ●
● ● ●
−2
● ● ● ●
● ● ●
● ● ●
● ● ●
● ● ●
● ● ●
−4
● ● ●
● ●
● ● ●
● ●
●
●
−6
● ●
−8
66 67 68 69 70 71
Fitted
I The plot should look like a random scatter about the line y =
0 with constant variance.
I A pattern in the plot may indicate violation of one or more
o
assumptions
26/31
Regression
Assessment of the Model
Normal Q−Q
3
●
●●●●●●●●● ●
●●
●●
2
●●
●
●●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●●
●
●●
●
●
●
●●
●
●●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
1
●
Standardized residuals
●
●●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●●
●
●
●
●●
●
●
●●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
0
●
●
●●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●●
●
●●
●
●●
●
●
●●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●●
−1
●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●●
●●
●
●●
−2
●
●●
●●
●
●●
●●
●
●●
●
●●
●
●
●●
●
●
●
●●●●
●●
●●
●●●
●
13
2● ●
−3
●1
−3 −2 −1 0 1 2 3
Theoretical Quantiles
lm(child ~ parent)
Background
o
28/31
Regression
Regression with Categorical Predictor
Y = βo + β1 X +
Here
1 if Males,
X =
0 Females
is the dummy variable.
Then for Males
E (Y |X = 1) = βo + β1
while for females
E (Y |X = 0) = βo
βˆ1 is interpreted as increase or decrease in the mean response for
Males vs Females o
29/31
Regression
Regression with Categorical Predictor
An R Example
Remember
Weight = βˆo + βˆ1 Gender
70
65
60
Weight in Kg
55
50
45
F M
o
31/31
Regression
Regression with Categorical Predictor