Professional Documents
Culture Documents
Correlation and Regression
Correlation and Regression
© A. Ardalan
a.ardalan07@gmail.com
1 Correlation
2 Linear Regression
Simple Linear Regression
Multiple Linear Regression
3 Chi-square test
Correlation Coefficient
r = +1 perfect positive correlation r close to 1, relatively strong positive correlation Medium positive correlation
y y y
x x x
Random points with no correlation r close to -1, relatively strong negative correlation r = −1 perfect negative correlation
y y y
x x x
Solution
P P P
xy − x y n
r= p P
[n( x 2 ) − ( x)2 ][n( y 2 ) − ( y )2 ]
P P P
60
Sales Volume ($100s)
50
40
1 2 3 4 5 6
Number of Commercials
Decision rule
H0 : ρ = 0 (No significant correlation)
H1 : ρ 6= 0 (Significant correlation)
Test statistic
r
t=q Reject H0 if t > tn−2,α/2 or t < −tn−2,α/2
1−r 2
n−2
Critical Values:
Reject Reject
r 0.9305
t = r = r = 7.18 -2.30 2.30
1−r 2 1−(0.9305)2 -7.18 7.18
n−2 10−2
Linear regression
yˆi = b0 + b1 x
The individual random error terms ei have a mean of zero
xy − xn y
P 8 324 2450 793800 6002500
9 319 1425 454575 2030625
b1 = P P 2 10 255 1700 433500 2890000
x 2 − ( nx) Sum 2865 17150 5085975 30983750
b0 = ȳ − b1 x̄
b1 = .11 tells us that the average value of
5085975− 2865×17150 a house increases by
b1 = 10
(17150)2
= 0.11 0.11($1000) = $109.77, on average, for
30983750− 10 each additional one square foot of size
No houses had 0 square feet, so
2865 17150
b0 = 10 − 0.11 × 10 = 98.25 b0 = 98.25 just indicates that, for houses
within the range of sizes observed, 98.25 is
the portion of the house price not
explained by square feet.
House price = 98.25 + 0.11(Size)
Test statistic
b1 − β1
t= Reject H0 if t > tn−2,α/2 or t < −tn−2,α/2
sb1
s
(yi − yˆi )2
P
sb1 = Estimator of the standard error of the slope
(n − 2) (xi − x̄)2
P
Example
Consider estimated regression equation of house price and size.
α = 0.05
Critical Values:
Reject Reject
t0.025,8 = ±2.3060
α/2 = 0.025 α/2 = 0.025
Standard error of b1 :
0.00515 0.00515
sb1 = 0.033
-3.33 3.33
b1 − β1 0.11 − 0
t = = = 3.33
sb1 0.033
Two tailed P-Value: P(t ≤ −3.33) + P(t ≥ 3.33) = 2(0.00515) = 0.01039 < α
150
100
Income
50
6000 10000
Limit
2000
Balance
0
y = β0 + β1 x1 + β2 x2 + ... + βk xk + ε
|{z}
| {z }
Linear component Random Error component
Y = Xβ + ε
y1 1 x11 . . . x1k β1 ε1
y2 1 x21 . . . x2k β2 ε2
.. = .. .. .. + ..
.. ..
. . . . . . .
yn 1 xn1 . . . xnk βk εn
β = (X0 X)−1 X0 Y
Each β̂j has normal distribution with mean βj and variance σ 2 vjj where vjj
is the jth diagonal entry of the matrix V = (X 0 X )−1
© A. Ardalan Associations March 27, 2022 19 / 27
Linear Regression Multiple Linear Regression
I Test statistic
SSR/k
F = Reject H0 if F ≥ Fk,n−k−1,α
SSE /n − k − 1
(yi − ȳ )2 (sum of squares due to regression) and SSE = (yi − yˆi )2 (sum of squares due to error)
P P
where SSR =
I Test statistic
b1 − β1
t= Reject H0 if t > tn−k−1,α/2 or t < −tn−k−1,α/2
sb1
Example
Figure: SPSS output for trucking company with two independent variables miles
traveled (x1) and number of deliveries (x2) .
F test
I Decision: Based on the p − value = 0.003 ≤ α in the last column of
the Anova table, we can reject H0 : β1 = β2 = 0.
I Conclusion: The is a significant relationship between the travel time
and independet variables (miles traveled and number of deliveries).
T test:
I Decision: Based the p-values of .001 and .004 on the spss output for
coefficinets we can reject H0 : β1 = 0 and H0 : β2 = 0.
I Conclusion: There is sufficient evidence that miles traveled and number
of deliveries have a significant effect on travel time.
Chi-square test
A Chi-Square Test of Independence is used to determine whether or
not there is a significant association between two categorical variables
that is described in a contingency table.
Decision rule
H0 : The two variables are independent
H1 : The two variables are not independent
Test statistic
R X
C
X (oij − eij )2
χ2 = Reject H0 if χ2 > χ2(R−1)(J−1),α or p − value ≤ α
eij
i=1 j=1
oij is the observed cell count in the ith row and jth column of the table
(row i total) × (col j total)
eij = grand total
is the expected cell count in the ith row and jth column of the table under H0
Example
Suppose that you are looking at the results of a survey into the source of
finance for buying a house, and find the following pattern of loan size and
source of loan – which is a contingency table. You might ask whether
there is any relationship or association between the size of a loan and the
source.
α = 0.05, df = (3 − 1)(3 − 1) = 4 O E O −E (O − E )2 (O − E )2 /E
1 30 40.625 -10.625 112.891 2.779
Critical Values: 2 55 56.25 -1.25 1.563 0.028
3 40 28.125 11.875 141.016 5.014
2
χ0.025,4 = 9.488 4 23 17.875 5.125 26.266 1.469
5 29 24.750 4.25 18.063 0.730
6 3 12.375 -9.375 87.891 7.102
Test Statistic: 7 12 6.500 5.500 30.250 4.654
8 6 9.000 3.000 9.000 1.000
2
χ = 24.165 9 2 4.500 2.500 6.250 1.389
Total 200 200 24.165