You are on page 1of 16

15: Linear Regression

Expected change in Y per unit X


Introduction
(p. 15.1)

• X = independent (explanatory) variable


• Y = dependent (response) variable
• Use instead of correlation
when distribution of X is fixed by researcher (i.e.,
set number at each level of X)
studying functional dependency between X and Y
Illustrative data (bicycle.sav)
(p. 15.1)

• Same as prior chapter


• X = percent receiving 60

reduce or free meal 50

(RFM) 40

• Y = percent using 30

helmets (HELM) 20

• n = 12 (outlier 10

removed to study
0
0 20 40 60 80 100

linear relation)
X - % Receiving Reduced Fee School Lunch
Regression Model (Equation)
(p. 15.2)

yˆ  a  bX “y hat”
where
yˆ represents predicted average of Y at a given X
a represents the line's intercept
b represents the line's slope
How formulas determine best line
(p. 15.2)

• Distance of points
from line = 60

residuals (dotted)
50

40

• Minimizes sum of 30

square residuals 20

• Least squares
10

regression line
0 20 40 60 80 100

X - % Receiving Reduced Fee School Lunch


Formulas for Least Squares Coefficients
with Illustrative Data (p. 15.2 – 15.3)

SS XY  4231.1333
b   0.539
SS XX 7855.67
a  y  bx  30.8833  (0.539)(30.8333)  47.49

SPSS output:
Alternative formula for slope

sY r
b
sX
Interpretation of Slope (b)
(p. 15.3)

• b = expected change in Y
per unit X 60

• Keep track of units! 50


Intercept
– Y = helmet users per 100 40
1 Unit X
– X = % receiving free lunch 30 Slope

• e.g., b of –0.54 predicts 20

decrease of 0.54 units of Y 10

for each unit X 0


0 20 40 60 80 100

X - % Receiving Reduced Fee School Lunch


Predicting Average Y
• ŷ = a + bx
Predicted Y = intercept + (slope)(x)
HELM = 47.49 + (–0.54)(RFM)
• What is predicted HELM when RFM = 50?
ŷ = 47.49 + (–0.54)(50) = 20.5
Average HELM predicted to be 20.5 in neighborhood
where 50% of children receive reduced or free meal
• What is average Y when x = 20?
ŷ = 47.49 +(–0.54)(20) = 36.7
Confidence Interval for Slope Parameter
(p. 15.4)
95% confidence Interval for ß = b  (t n  2 ,.975 )( seb )
where
b = point estimate for slope
tn-2,.975 = 97.5th percentile (from t table or StaTable)
seb = standard error of slope estimate (formula 5)
seY | x
seb  SSYY  b( SS XY )
SS XX seY | x 
n2
standard error of regression
Illustrative Example (bicycle.sav)
• 95% confidence interval for 
= –0.54 ± (t10,.975)(0.1058)
= –0.54 ± (2.23)(0.1058)
= –0.54 ± 0.24
= (–0.78, –0.30)
Interpret 95% confidence interval
Model:
Point estimate for slope (b) = –0.54
Standard error of slope (seb) = 0.24
95% confidence interval for  = (–0.78, –0.30)
Interpretation:
slope estimate = –0.54 ± 0.24
We are 95% confident the slope parameter falls
between –0.78 and –0.30
Significance Test
(p. 15.5)

• H0: ß = 0
• tstat (formula 7) with df = n – 2
• Convert tstat to p value
b  0.539
t stat    5.09
seb 0.1058

df =12 – 2 = 10
p = 2×area beyond tstat on t10
Use t table and StaTable
Regression ANOVA
(not in Reader & NR)

• SPSS also does an analysis of variance on regression model


• Sum of squares of fitted values around grand mean = (ŷi – ÿ)²
• Sum of squares of residuals around line = (yi– ŷi )²
• Fstat provides same p value as tstat
• Want to learn more about relation between ANOVA and regression?
(Take regression course)
Distributional Assumptions
(p. 15.5)

• Linearity
• Independence
• Normality
• Equal variance X
Regression Line

Distribution of Y

Y
Validity Assumptions
(p. 15.6)
• Data = farr1852.sav 200

X = mean elevation above sea level


Y = cholera mortality per 10,000

Mean mortality from Cholera per 10,000


100
80
• Scatterplot (right) shows negative 60

correlation 40

• Correlation and regression


computations reveal: 20

r = -0.88
10
ŷ = 129.9 + (-1.33)x 8

p = .009
6
0 100 200 300 400

• Farr used these results to support Mean Elevation of the Ground above the Higwater Mark
miasma theory and refute contagion
theory
– But data not valid (confounded by
“polluted water source”)

You might also like