You are on page 1of 20

Regression Basics

Predicting a DV with a Single IV



Questions
What are predictors and
criteria?
Write an equation for
the linear regression.
Describe each term.
How do changes in the
slope and intercept
affect (move) the
regression line?
What does it mean to
test the significance of
the regression sum of
squares? R-square?
What is R-square?
What does it mean to choose
a regression line to satisfy
the loss function of least
squares?
How do we find the slope
and intercept for the
regression line with a single
independent variable?
(Either formula for the slope
is acceptable.)
Why does testing for the
regression sum of squares
turn out to have the same
result as testing for R-
square?

Basic Ideas
Jargon
IV = X = Predictor (pl. predictors)
DV = Y = Criterion (pl. criteria)
Regression of Y on X e.g., GPA on SAT
Linear Model = relations between IV
and DV represented by straight line.

A score on Y has 2 parts (1) linear
function of X and (2) error.
Y X
i i i
= + + o | c
(population values)
Basic Ideas (2)
Sample value:
Intercept place where X=0
Slope change in Y if X changes 1
unit. Rise over run.
If error is removed, we have a predicted
value for each person at X (the line):




Y a bX e
i i i
= + +
' = + Y a bX
Suppose on average houses are worth about $75.00 a
square foot. Then the equation relating price to size
would be Y=0+75X. The predicted price for a 2000
square foot house would be $150,000.
Linear Transformation
1 to 1 mapping of variables via line
Permissible operations are addition and
multiplication (interval data)
1 0 8 6 4 2 0
X
4 0
3 5
3 0
2 5
2 0
1 5
1 0
5
0
Y
C h a n g in g t h e Y I n t e r c e p t

Y = 5 + 2 X
Y = 1 0 + 2 X
Y = 1 5 + 2 X
Add a constant
1 0 8 6 4 2 0
X
3 0
2 0
1 0
0
Y
C h a n g in g t h e S lo p e

Y = 5 + . 5 X
Y = 5 + X
Y = 5 + 2 X
Multiply by a constant
' = + Y a bX
Linear Transformation (2)
Centigrade to Fahrenheit
Note 1 to 1 map
Intercept?
Slope?
120 90 60 30 0
Degrees C
240
200
160
120
80
40
0
D
e
g
r
e
e
s

F

32 degrees F, 0 degrees C
212 degrees F, 100 degrees C
Intercept is 32. When X (Cent) is 0, Y (Fahr) is 32.
Slope is 1.8. When Cent goes from 0 to 100 (run), Fahr goes
from 32 to 212 (rise), and 212-32 = 180. Then 180/100 =1.8 is
rise over run is the slope. Y = 32+1.8X. F=32+1.8C.
' = + Y a bX
Review
What are predictors and criteria?
Write an equation for the linear
regression with 1 IV. Describe each
term.
How do changes in the slope and
intercept affect (move) the regression
line?
Regression of Weight on
Height
Ht Wt
61 105
62 120
63 120
65 160
65 120
68 145
69 175
70 160
72 185
75 210
N=10 N=10
M=67 M=150
SD=4.57 SD=
33.99
7 6 7 4 7 2 7 0 6 8 6 6 6 4 6 2 6 0
H e ig h t in I n c h e s
2 4 0
2 1 0
1 8 0
1 5 0
1 2 0
9 0
6 0
W
e
i g
h
t

i n

L
b
s
R e g r e s s io n o f W e ig h t o n H e ig h t

R e g r e s s io n o f W e ig h t o n H e ig h t

R e g r e s s io n o f W e ig h t o n H e ig h t

R is e
R u n
Y = - 3 1 6 . 8 6 + 6 . 9 7 X
X
Correlation (r) = .94.
Regression equation: Y=-316.86+6.97X
' = + Y a bX
Illustration of the Linear
Model. This concept is vital!
7 2 7 0 6 8 6 6 6 4 6 2
H e ig h t
2 0 0
1 8 0
1 6 0
1 4 0
1 2 0
1 0 0
W
e
i g
h
t
R e g r e s s io n o f W e ig h t o n H e ig h t

7 2 7 0 6 8 6 6 6 4 6 2
H e ig h t
R e g r e s s io n o f W e ig h t o n H e ig h t

R e g r e s s io n o f W e ig h t o n H e ig h t

( 6 5 , 1 2 0 )
M e a n o f X
M e a n o f Y
D e v ia t io n f r o m X
D e v ia t io n f r o m Y
L in e a r P a r t
E r r o r P a r t
y
Y '
e
Y X
i i i
= + + o | c
Y a bX e
i i i
= + +
Consider Y as
a deviation
from the
mean.
Part of that deviation can be associated with X (the linear
part) and part cannot (the error).

' = + Y a bX
'
i i i
Y Y e =
Predicted Values & Residuals
N

Ht

Wt

Y'

Resid
1

61

105

108.19

-3.19
2

62

120

115.16

4.84
3

63

120

122.13

-2.13
4

65

160

136.06

23.94
5

65

120

136.06

-16.06
6

68

145

156.97

-11.97
7

69

175

163.94

11.06
8

70

160

170.91

-10.91
9

72

185

184.84

0.16
10

75

210

205.75

4.25
M 67

150

150.00

0.00
SD

4.57

33.99

31.85

11.89

V 20.89

1155.56

1014.37

141.32

7 2 7 0 6 8 6 6 6 4 6 2
H e ig h t
2 0 0
1 8 0
1 6 0
1 4 0
1 2 0
1 0 0
W
e
i g
h
t
R e g r e s s io n o f W e ig h t o n H e ig h t

7 2 7 0 6 8 6 6 6 4 6 2
H e ig h t
R e g r e s s io n o f W e ig h t o n H e ig h t

R e g r e s s io n o f W e ig h t o n H e ig h t

( 6 5 , 1 2 0 )
M e a n o f X
M e a n o f Y
D e v ia t io n f r o m X
D e v ia t io n f r o m Y
L in e a r P a r t
E r r o r P a r t
y
Y '
e
Numbers for linear part and error.
Note M of Y
and Residuals.
Note variance of
Y is V(Y) +
V(res).
' = + Y a bX
Finding the Regression Line
Need to know the correlation, SDs and means of X and Y.
The correlation is the slope when both X and Y are
expressed as z scores. To translate to raw scores, just bring
back original SDs for both.
N
z z
r
Y X
XY

=
X
Y
XY
SD
SD
r b =
To find the intercept, use: X b Y a =
(rise over run)
Suppose r = .50, SD
X
= .5, M
X
= 10, SD
Y
= 2, M
Y
= 5.
2
5 .
2
5 . = = b
15 ) 10 ( 2 5 = = a X Y 2 15 ' + =
Slope
Intercept Equation
Line of Least Squares
7 2 7 0 6 8 6 6 6 4 6 2
H e ig h t
2 0 0
1 8 0
1 6 0
1 4 0
1 2 0
1 0 0
W
e
i g
h
t
R e g r e s s io n o f W e ig h t o n H e ig h t

7 2 7 0 6 8 6 6 6 4 6 2
H e ig h t
R e g r e s s io n o f W e ig h t o n H e ig h t

R e g r e s s io n o f W e ig h t o n H e ig h t

( 6 5 , 1 2 0 )
M e a n o f X
M e a n o f Y
D e v ia t io n f r o m X
D e v ia t io n f r o m Y
L in e a r P a r t
E r r o r P a r t
y
Y '
e
We have some points.
Assume linear relations
is reasonable, so the 2
vbls can be represented
by a line. Where
should the line go?
Place the line so errors (residuals) are small. The line we
calculate has a sum of errors = 0. It has a sum of squared
errors that are as small as possible; the line provides the
smallest sum of squared errors or least squares.
Least Squares (2)
Review
What does it mean to choose a regression line
to satisfy the loss function of least squares?
What are predicted values and residuals?
Suppose r = .25, SD
X
= 1, M
X
= 10, SD
Y
= 2, M
Y
= 5.
What is the regression equation (line)?
Partitioning the Sum of
Squares
e bX a Y + + =
bX a Y + = '
e Y Y + = ' ' Y Y e =
Definitions
) ' ( ) ' ( Y Y Y Y Y Y + =
= y, deviation from mean

+ =
2 2
)] ' ( ) ' [( ) ( Y Y Y Y Y Y
Sum of squares

+ =
2 2 2
) ' ( ) ' ( ) ( Y Y Y Y y
(cross products
drop out)
Sum of
squared
deviations
from the
mean
=
Sum of squares
due to
regression
+
Sum of squared
residuals
reg
error
Analog: SS
tot
=SS
B
+SS
W
Partitioning SS (2)
SS
Y
=SS
Reg
+ SS
Res
Total SS is regression SS plus
residual SS. Can also get
proportions of each. Can get
variance by dividing SS by N if you
want. Proportion of total SS due to
regression = proportion of total
variance due to regression = R
2

(R-square).
Y
s
Y
g
Y
Y
SS
SS
SS
SS
SS
SS
Re
Re
+ =
) 1 ( 1
2 2
R R + =
Partitioning SS (3)
YYY
Wt (Y)
M=150

Y'

Resid
(Y-Y')

Resid
2


105

2025

108.19

-41.81

1748.076

-3.19

10.1761

120

900

115.16

-34.84

1213.826

4.84

23.4256

120

900

122.13

-27.87

776.7369

-2.13

4.5369

160

100

136.06

-13.94

194.3236

23.94

573.1236

120

900

136.06

-13.94

194.3236

-16.06

257.9236

145

25

156.97

6.97

48.5809

-11.97

143.2809

175

625

163.94

13.94

194.3236

11.06

122.3236

160

100

170.91

20.91

437.2281

-10.91

119.0281

185

1225

184.84

34.84

1213.826

0.16

0.0256

210

3600

205.75

55.75

3108.063

4.25

18.0625

Sum =
1500

10400

1500.01

0.01

9129.307

-0.01

1271.907

Variance

1155.56





1014.37



141.32

2
) ( Y Y
Y Y '
2
) ' ( Y Y
Partitioning SS (4)
Total Regress Residual
SS 10400 9129.31 1271.91
Variance 1155.56 1014.37 141.32
12 . 88 . 1
10400
91 . 1271
10400
31 . 9129
10400
10400
+ = + = Proportion of SS
12 . 88 . 1
56 . 1155
32 . 141
56 . 1155
37 . 1014
56 . 1155
56 . 1155
+ = + = Proportion of
Variance
R
2
= .88
Note Y is linear function of X, so
.
XY YY
r r = = 94 .
'
1
'
=
X Y
r
0 12 . 35 . . 88 .
'
2 2 2
'
= = = = =
E Y YE YE YY
r r r R r
Significance Testing
Testing for the SS due to regression = testing for the variance
due to regression = testing the significance of R
2
. All are the
same.
0 :
2
0
=
population
R H
F
SS df
SS df
SS k
SS N k
reg
res
reg
res
= =

/
/
/
/ ( )
1
2
1
k=number of IVs (here
its 1) and N is the
sample size (# people).
F with k and (N-k-1)
df.
F
SS df
SS df
reg
res
= =

=
/
/
. /
. / ( )
.
1
2
9129 31 1
127191 10 1 1
57 42
) 1 /( ) 1 (
/
2
2

=
k N R
k R
F
Equivalent test using R-square
instead of SS.
F =

=
. /
( . ) / ( )
.
88 1
1 88 10 1 1
58 67
Results will be same within
rounding error.
Review
What does it mean to test the
significance of the regression sum of
squares? R-square?
What is R-square?
Why does testing for the regression sum of
squares turn out to have the same result as
testing for R-square?

You might also like