Correlation and Regression

Associations
© A. Ardalan
March 27, 2022
a.ardalan07@gmail.com
© A. Ardalan Associations March 27, 2022 1 / 27

Contents
Outline of this document
1 Correlation
2 Linear Regression
Simple Linear Regression
Multiple Linear Regression
3 Chi-square test

Correlation
Correlation Coefficient
The population correlation coefficient ρ (rho) measures the strength

of the association between the variables
The sample correlation coefficient r is an estimate of ρ and is used to
measure the strength of the linear relationship in the sample
observations
Features of ρ and r
I Unit free
I Range between −1 and 1
I The closer to −1, the stronger the negative linear relationship
I The closer to 1, the stronger the positive linear relationship
I The closer to 0, the weaker the linear relationship

Correlation
r = +1 perfect positive correlation r close to 1, relatively strong positive correlation Medium positive correlation
y y y
x x x
Random points with no correlation r close to -1, relatively strong negative correlation r = −1 perfect negative correlation
y y y
x x x

Correlation
Correlation coefficient formula

P
P P
xy − x y
n
r =p
[n( x 2 ) − ( x)2 ][n( y 2 ) − ( y )2 ]
P P P P
Example: The store’s manager wants to determine the relationship

between the number of weekend television commercials shown and the
sales at the store during the following week.
Week Number of Commercials (x) Sales Volume ($100s) (y ) x ∗y x2 y2
1 2 50 100 4 2500
2 5 57 285 25 3249
3 1 41 41 1 1681
4 3 54 162 9 2916
5 4 54 216 16 2916
6 1 38 38 1 1449
7 5 63 315 25 3969
8 3 48 144 9 2304
9 4 59 236 16 3481
10 2 46 92 4 2116
Sum 30 510 1629 110 26576

Correlation
Solution
P P P
xy − x y n
r= p P
[n( x 2 ) − ( x)2 ][n( y 2 ) − ( y )2 ]
P P P
10 × (1629) − (30) × (510)

=p
[10 × (110) − (30)2 ][10 × (26576) − (510)2 ]
= 0.9305
60
Sales Volume ($100s)
50
40
1 2 3 4 5 6
Number of Commercials
r = 0.9305 is relatively strong positive linear association between x and y .

Correlation
Significance Test for Correlation
Decision rule

H0 : ρ = 0 (No significant correlation)
H1 : ρ 6= 0 (Significant correlation)
Test statistic
r
t=q Reject H0 if t > tn−2,α/2 or t < −tn−2,α/2
1−r 2
n−2
Desicion based on p-value : Reject H0 if p-value < α

Correlation
Example: For the relationship between the number of weekend television

commercials shown and the sales, we have:
α = 0.05
Critical Values:
Reject Reject
α/2 = 0.025 α/2 = 0.025

t0.025,8 = ±2.3060
Test Statistic: 0.00004690 0.00004690
r 0.9305
t = r = r = 7.18 -2.30 2.30
1−r 2 1−(0.9305)2 -7.18 7.18
n−2 10−2
Two tailed P-Value:

P(t ≤ −7.18) + P(t ≥ 7.18) = 2(0.00004690) = 0.00009381 < α
Decision: There is evidence to reject H0 (t statistic is in the reject region)

Conclusion: There is a significant correlation between variables.
Interpretation:
Linear Regression
Linear regression
Regression analysis is used to:

I Predict the value of a dependent variable based on the value of at least
one independent variable
I Explain the effect of changes in an independent variable on the
dependent variable
Outcome (y): the variable we wish to explain

Explanatory variable (x): the variable used to explain the outcome

Linear Regression Simple Linear Regression
Simple Linear Regression

y = β0 + β1 x + ε
Only one independent y

variable, x
Relationship between x
and y is described by a yi
εi Slope = βi
linear function
yî
Changes in y are
assumed to be caused by
changes in x β0 x
xi
The population regression model

Intercept Slope
y= β0 + β x+ ε
| {z 1 } |{z}
Linear component Random Error component

Linear Regression Assumptions

I Error values (ε) are statistically independent
I Error values are normally distributed
I The probability distribution of the errors has constant variance
I The underlying relationship between the x variable and the y variable is
linear
Estimated Regression Model
I The sample regression line provides an estimate of the population
regression line
yî = b0 + b1 x
The individual random error terms ei have a mean of zero

Estimation of Intercept and Slope

I b0 and b1 are obtained by finding the values of b0 and b1 that minimize
the sum of the squared residuals (Least Squares Criterion)
X X X
e2 = (y − ŷ )2 = (y − (b0 + b1 x))2
I The formulas for b0 and b1 are

P P
P x y
xy −
b1 = P Pn 2 , b0 = ȳ − b1 x̄
( x)
x2 − n
Interpretation of the Slope and the Intercept

I b0 is the estimated average value of y when the value of x is zero
I b1 is the estimated change in the average value of y as a result of a
one-unit change in x

Possible regression lines in simple linear regression

Positive Linear Relationship Negative Linear Relationship No Relationship
y y y
Intercept
Slope β1 is negative
Regression line β0
Intercept Slope β1 is zero
β0
Regression line
Intercept Regression line
β0 Slope β1 is positive
x x x

Example A real estate agent wishes to examine the relationship between

the selling price of a home and its size (measured in square feet)
A random sample of 10 houses is
House Square x ∗y x2
selected Price in Feet
$1000s(y ) (x)
I Outcome(y) = house price in
1 245 1400 343000 1960000
$1000s 2
3
312
279
1600
1700
499200
474300
2560000
2890000
I Explanatory variable (x) = square 4 308 1875 577500 3515625
5 199 1100 218900 1210000
feet 6 219 1550 339450 2402500
P P 7 405 2350 951750 5522500
xy − xn y
P 8 324 2450 793800 6002500
9 319 1425 454575 2030625
b1 = P P 2 10 255 1700 433500 2890000
x 2 − ( nx) Sum 2865 17150 5085975 30983750
b0 = ȳ − b1 x̄
b1 = .11 tells us that the average value of
5085975− 2865×17150 a house increases by
b1 = 10
(17150)2
= 0.11 0.11($1000) = $109.77, on average, for
30983750− 10 each additional one square foot of size
No houses had 0 square feet, so
2865 17150
b0 = 10 − 0.11 × 10 = 98.25 b0 = 98.25 just indicates that, for houses
within the range of sizes observed, 98.25 is
the portion of the house price not
explained by square feet.
House price = 98.25 + 0.11(Size)

Significance Test for Slope

Decision rule

H0 : β1 = 0 (No significant linear relationship)
H1 : β1 6= 0 (Significant linear relationship)
Test statistic
b1 − β1
t= Reject H0 if t > tn−2,α/2 or t < −tn−2,α/2
sb1
s
(yi − yî )2
P
sb1 = Estimator of the standard error of the slope
(n − 2) (xi − x̄)2
P

Example
Consider estimated regression equation of house price and size.
α = 0.05
Critical Values:
Reject Reject
t0.025,8 = ±2.3060
α/2 = 0.025 α/2 = 0.025
Standard error of b1 :
0.00515 0.00515
sb1 = 0.033
Test Statistic: -2.30 2.30
-3.33 3.33
b1 − β1 0.11 − 0
t = = = 3.33
sb1 0.033
Two tailed P-Value: P(t ≤ −3.33) + P(t ≥ 3.33) = 2(0.00515) = 0.01039 < α
Decision: There is evidence to reject H0 (t statistic is in the reject region)

Conclusion: There is sufficient evidence that square footage affects house
price
Interpretation:
Linear Regression Multiple Linear Regression
2000 6000 10000 0 500 1000 2000
150
100
Income
50
6000 10000
Limit
2000
200 400 600 800

Rating
2000
500 1000
Balance
0
50 100 150 200 400 600 800

Multiple Linear Regression

Examine the linear relationship between one dependent variable (Y)
and two or more independent variables (Xi )
Multiple regression model with Xi independet variables:
y = β0 + β1 x1 + β2 x2 + ... + βk xk + ε
|{z}
| {z }
Linear component Random Error component
Estimation of multiple regression model

ŷ = b0 + b1 x1 + b2 x2 + ... + bk xk
which is the line that minimizes the sum of the squared residuals (Least
Squares Criterion)
I b0 is the intercept and the average value of ŷ if all xi are zero.
I Each bi represents the difference in the value of ŷ fore each one unit
difference in xi if other independent variables remain constant.
Estimation of regression coefficient

The multiple regression model can be represented in a compact form
using matrix notation as below
Y = Xβ + ε
      
y1 1 x11 . . . x1k β1 ε1
y2  1 x21 . . . x2k  β2  ε2 
 ..  =  .. ..   ..  +  .. 
      
.. ..
 .  . . . .  .   . 
yn 1 xn1 . . . xnk βk εn
The least squares estimation of coefficients
β = (X0 X)−1 X0 Y
Each β̂j has normal distribution with mean βj and variance σ 2 vjj where vjj
is the jth diagonal entry of the matrix V = (X 0 X )−1
Significant test for slopes

F test: is used to determine whether a significant relationship exists
between the dependent variable and the set of all the independent
variables; we will refer to the F test as the test for overall significance
I Decision rule

H0 : β1 = β2 = · · · = βk = 0
H1 : One or more of the coefficients is not equal to zero
I Test statistic
SSR/k
F = Reject H0 if F ≥ Fk,n−k−1,α
SSE /n − k − 1
(yi − ȳ )2 (sum of squares due to regression) and SSE = (yi − yî )2 (sum of squares due to error)
P P
where SSR =

T test: if the F test shows an overall significance, the t test is used to

determine whether each of the individual independent variables is
significant (individual significance).
I Decision rule
H0 : βi = 0
H1 : βi 6= 0
I Test statistic
b1 − β1
t= Reject H0 if t > tn−k−1,α/2 or t < −tn−k−1,α/2
sb1

Example
In a trucking company, to develop better work schedules, the managers

want to estimate the total daily travel time for their drivers based on miles
traveled and number of deliveries. A sample of 10 driving assignments
provided the data shown in table below
Driving assignment Travel time (y ) Miles traveled (x1 ) Number of deliveries (x2 )
1 9.3 100 4
2 4.8 50 3
3 8.9 100 4
4 6.5 100 2
5 4.2 50 2
6 6.2 80 2
7 7.4 75 3
8 6.0 65 4
9 7.6 90 3
10 6.1 90 2

Figure: SPSS output for trucking company with two independent variables miles
traveled (x1) and number of deliveries (x2) .

F test
I Decision: Based on the p − value = 0.003 ≤ α in the last column of
the Anova table, we can reject H0 : β1 = β2 = 0.
I Conclusion: The is a significant relationship between the travel time
and independet variables (miles traveled and number of deliveries).
T test:
I Decision: Based the p-values of .001 and .004 on the spss output for
coefficinets we can reject H0 : β1 = 0 and H0 : β2 = 0.
I Conclusion: There is sufficient evidence that miles traveled and number
of deliveries have a significant effect on travel time.
The regression equation
Time = −1.008 + 0.071 Miles + 0.749Deliveries

.071 hours is an estimate of the expected increase in travel time corresponding to an increase of one mile in the
distance traveled when the number of deliveries is held constant. Similarly, because b2 = .749, an estimate of the
expected increase in travel time corresponding to an increase of one delivery when the number of miles traveled is held
constant is .749 hours.

Chi-square test
Chi-square test
A Chi-Square Test of Independence is used to determine whether or
not there is a significant association between two categorical variables
that is described in a contingency table.
Decision rule

H0 : The two variables are independent
H1 : The two variables are not independent
Test statistic
R X
C
X (oij − eij )2
χ2 = Reject H0 if χ2 > χ2(R−1)(J−1),α or p − value ≤ α
eij
i=1 j=1
oij is the observed cell count in the ith row and jth column of the table
(row i total) × (col j total)
eij = grand total
is the expected cell count in the ith row and jth column of the table under H0

Chi-square test
Example
Suppose that you are looking at the results of a survey into the source of
finance for buying a house, and find the following pattern of loan size and
source of loan – which is a contingency table. You might ask whether
there is any relationship or association between the size of a loan and the
source.
Size of loan Total

Less than $80, 000 to More than
$80, 000 $150, 000 $150, 000
Source of mortgage Building society 30 55 40 125
Bank 23 29 3 55
Elsewhere 12 6 2 20
Total 65 90 45 200

Chi-square test
Example: For the relationship between the number of weekend television

commercials shown and the sales, we have:
α = 0.05, df = (3 − 1)(3 − 1) = 4 O E O −E (O − E )2 (O − E )2 /E
1 30 40.625 -10.625 112.891 2.779
Critical Values: 2 55 56.25 -1.25 1.563 0.028
3 40 28.125 11.875 141.016 5.014
2
χ0.025,4 = 9.488 4 23 17.875 5.125 26.266 1.469
5 29 24.750 4.25 18.063 0.730
6 3 12.375 -9.375 87.891 7.102
Test Statistic: 7 12 6.500 5.500 30.250 4.654
8 6 9.000 3.000 9.000 1.000
2
χ = 24.165 9 2 4.500 2.500 6.250 1.389
Total 200 200 24.165
P-Value: P(χ2 ≥ 24.165) = 0.00007402 < α
Decision: There is evidence to reject H0 (χ2 statistic is in the reject region)

Conclusion: The evidence supports the view that there is an association, and
the size of a mortgage is related to its source.
Interpretation:

Correlation and Regression

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Correlation and Regression

Uploaded by

Copyright:

Available Formats

Associations

March 27, 2022

© A. Ardalan Associations March 27, 2022 1 / 27

Outline of this document

© A. Ardalan Associations March 27, 2022 2 / 27

The population correlation coefficient ρ (rho) measures the strength

© A. Ardalan Associations March 27, 2022 3 / 27

© A. Ardalan Associations March 27, 2022 4 / 27

Correlation coefficient formula

Example: The store’s manager wants to determine the relationship

© A. Ardalan Associations March 27, 2022 5 / 27

10 × (1629) − (30) × (510)

r = 0.9305 is relatively strong positive linear association between x and y .

Significance Test for Correlation

Desicion based on p-value : Reject H0 if p-value < α

© A. Ardalan Associations March 27, 2022 7 / 27

Example: For the relationship between the number of weekend television

α/2 = 0.025 α/2 = 0.025

Test Statistic: 0.00004690 0.00004690

Two tailed P-Value:

Decision: There is evidence to reject H0 (t statistic is in the reject region)

Regression analysis is used to:

Outcome (y): the variable we wish to explain

© A. Ardalan Associations March 27, 2022 9 / 27

Simple Linear Regression

Only one independent y

The population regression model

© A. Ardalan Associations March 27, 2022 10 / 27

Linear Regression Assumptions

© A. Ardalan Associations March 27, 2022 11 / 27

Estimation of Intercept and Slope

I The formulas for b0 and b1 are

Interpretation of the Slope and the Intercept

© A. Ardalan Associations March 27, 2022 12 / 27

Possible regression lines in simple linear regression

© A. Ardalan Associations March 27, 2022 13 / 27

Example A real estate agent wishes to examine the relationship between

© A. Ardalan Associations March 27, 2022 14 / 27

Significance Test for Slope

Desicion based on p-value : Reject H0 if p-value < α

Test Statistic: -2.30 2.30

Decision: There is evidence to reject H0 (t statistic is in the reject region)

2000 6000 10000 0 500 1000 2000

200 400 600 800

50 100 150 200 400 600 800

© A. Ardalan Associations March 27, 2022 17 / 27

Multiple Linear Regression

Estimation of multiple regression model

Estimation of regression coefficient

The least squares estimation of coefficients

Significant test for slopes

Desicion based on p-value : Reject H0 if p-value < α

© A. Ardalan Associations March 27, 2022 20 / 27

T test: if the F test shows an overall significance, the t test is used to

Desicion based on p-value : Reject H0 if p-value < α

© A. Ardalan Associations March 27, 2022 21 / 27

In a trucking company, to develop better work schedules, the managers

© A. Ardalan Associations March 27, 2022 22 / 27

© A. Ardalan Associations March 27, 2022 23 / 27

The regression equation

Time = −1.008 + 0.071 Miles + 0.749Deliveries

© A. Ardalan Associations March 27, 2022 24 / 27

© A. Ardalan Associations March 27, 2022 25 / 27