You are on page 1of 27

Associations

© A. Ardalan

March 27, 2022

a.ardalan07@gmail.com

© A. Ardalan Associations March 27, 2022 1 / 27


Contents

Outline of this document

1 Correlation

2 Linear Regression
Simple Linear Regression
Multiple Linear Regression

3 Chi-square test

© A. Ardalan Associations March 27, 2022 2 / 27


Correlation

Correlation Coefficient

The population correlation coefficient ρ (rho) measures the strength


of the association between the variables
The sample correlation coefficient r is an estimate of ρ and is used to
measure the strength of the linear relationship in the sample
observations
Features of ρ and r
I Unit free
I Range between −1 and 1
I The closer to −1, the stronger the negative linear relationship
I The closer to 1, the stronger the positive linear relationship
I The closer to 0, the weaker the linear relationship

© A. Ardalan Associations March 27, 2022 3 / 27


Correlation

r = +1 perfect positive correlation r close to 1, relatively strong positive correlation Medium positive correlation
y y y

x x x

Random points with no correlation r close to -1, relatively strong negative correlation r = −1 perfect negative correlation
y y y

x x x

© A. Ardalan Associations March 27, 2022 4 / 27


Correlation

Correlation coefficient formula


P
P P
xy − x y
n
r =p
[n( x 2 ) − ( x)2 ][n( y 2 ) − ( y )2 ]
P P P P

Example: The store’s manager wants to determine the relationship


between the number of weekend television commercials shown and the
sales at the store during the following week.
Week Number of Commercials (x) Sales Volume ($100s) (y ) x ∗y x2 y2
1 2 50 100 4 2500
2 5 57 285 25 3249
3 1 41 41 1 1681
4 3 54 162 9 2916
5 4 54 216 16 2916
6 1 38 38 1 1449
7 5 63 315 25 3969
8 3 48 144 9 2304
9 4 59 236 16 3481
10 2 46 92 4 2116
Sum 30 510 1629 110 26576

© A. Ardalan Associations March 27, 2022 5 / 27


Correlation

Solution
P P P
xy − x y n
r= p P
[n( x 2 ) − ( x)2 ][n( y 2 ) − ( y )2 ]
P P P

10 × (1629) − (30) × (510)


=p
[10 × (110) − (30)2 ][10 × (26576) − (510)2 ]
= 0.9305

60
Sales Volume ($100s)

50

40

1 2 3 4 5 6
Number of Commercials

r = 0.9305 is relatively strong positive linear association between x and y .


© A. Ardalan Associations March 27, 2022 6 / 27
Correlation

Significance Test for Correlation

Decision rule

H0 : ρ = 0 (No significant correlation)
H1 : ρ 6= 0 (Significant correlation)

Test statistic
r
t=q Reject H0 if t > tn−2,α/2 or t < −tn−2,α/2
1−r 2
n−2

Desicion based on p-value : Reject H0 if p-value < α

© A. Ardalan Associations March 27, 2022 7 / 27


Correlation

Example: For the relationship between the number of weekend television


commercials shown and the sales, we have:
α = 0.05

Critical Values:
Reject Reject

α/2 = 0.025 α/2 = 0.025


t0.025,8 = ±2.3060

Test Statistic: 0.00004690 0.00004690

r 0.9305
t = r = r = 7.18 -2.30 2.30
1−r 2 1−(0.9305)2 -7.18 7.18
n−2 10−2

Two tailed P-Value:


P(t ≤ −7.18) + P(t ≥ 7.18) = 2(0.00004690) = 0.00009381 < α

Decision: There is evidence to reject H0 (t statistic is in the reject region)


Conclusion: There is a significant correlation between variables.
Interpretation:
© A. Ardalan Associations March 27, 2022 8 / 27
Linear Regression

Linear regression

Regression analysis is used to:


I Predict the value of a dependent variable based on the value of at least
one independent variable
I Explain the effect of changes in an independent variable on the
dependent variable

Outcome (y): the variable we wish to explain


Explanatory variable (x): the variable used to explain the outcome

© A. Ardalan Associations March 27, 2022 9 / 27


Linear Regression Simple Linear Regression

Simple Linear Regression


y = β0 + β1 x + ε

Only one independent y


variable, x
Relationship between x
and y is described by a yi
εi Slope = βi
linear function
yˆi
Changes in y are
assumed to be caused by
changes in x β0 x
xi

The population regression model


Intercept Slope
y= β0 + β x+ ε
| {z 1 } |{z}
Linear component Random Error component

© A. Ardalan Associations March 27, 2022 10 / 27


Linear Regression Simple Linear Regression

Linear Regression Assumptions


I Error values (ε) are statistically independent
I Error values are normally distributed
I The probability distribution of the errors has constant variance
I The underlying relationship between the x variable and the y variable is
linear
Estimated Regression Model
I The sample regression line provides an estimate of the population
regression line

yˆi = b0 + b1 x
The individual random error terms ei have a mean of zero

© A. Ardalan Associations March 27, 2022 11 / 27


Linear Regression Simple Linear Regression

Estimation of Intercept and Slope


I b0 and b1 are obtained by finding the values of b0 and b1 that minimize
the sum of the squared residuals (Least Squares Criterion)
X X X
e2 = (y − ŷ )2 = (y − (b0 + b1 x))2

I The formulas for b0 and b1 are


P P
P x y
xy −
b1 = P Pn 2 , b0 = ȳ − b1 x̄
( x)
x2 − n

Interpretation of the Slope and the Intercept


I b0 is the estimated average value of y when the value of x is zero
I b1 is the estimated change in the average value of y as a result of a
one-unit change in x

© A. Ardalan Associations March 27, 2022 12 / 27


Linear Regression Simple Linear Regression

Possible regression lines in simple linear regression


Positive Linear Relationship Negative Linear Relationship No Relationship
y y y
Intercept
Slope β1 is negative
Regression line β0
Intercept Slope β1 is zero
β0
Regression line
Intercept Regression line
β0 Slope β1 is positive
x x x

© A. Ardalan Associations March 27, 2022 13 / 27


Linear Regression Simple Linear Regression

Example A real estate agent wishes to examine the relationship between


the selling price of a home and its size (measured in square feet)
A random sample of 10 houses is
House Square x ∗y x2
selected Price in Feet
$1000s(y ) (x)
I Outcome(y) = house price in
1 245 1400 343000 1960000
$1000s 2
3
312
279
1600
1700
499200
474300
2560000
2890000
I Explanatory variable (x) = square 4 308 1875 577500 3515625
5 199 1100 218900 1210000
feet 6 219 1550 339450 2402500
P P 7 405 2350 951750 5522500

xy − xn y
P 8 324 2450 793800 6002500
9 319 1425 454575 2030625
b1 = P P 2 10 255 1700 433500 2890000
x 2 − ( nx) Sum 2865 17150 5085975 30983750

b0 = ȳ − b1 x̄
b1 = .11 tells us that the average value of
5085975− 2865×17150 a house increases by
b1 = 10
(17150)2
= 0.11 0.11($1000) = $109.77, on average, for
30983750− 10 each additional one square foot of size
No houses had 0 square feet, so
2865 17150
b0 = 10 − 0.11 × 10 = 98.25 b0 = 98.25 just indicates that, for houses
within the range of sizes observed, 98.25 is
the portion of the house price not
explained by square feet.
House price = 98.25 + 0.11(Size)

© A. Ardalan Associations March 27, 2022 14 / 27


Linear Regression Simple Linear Regression

Significance Test for Slope


Decision rule

H0 : β1 = 0 (No significant linear relationship)
H1 : β1 6= 0 (Significant linear relationship)

Test statistic
b1 − β1
t= Reject H0 if t > tn−2,α/2 or t < −tn−2,α/2
sb1

s
(yi − yˆi )2
P
sb1 = Estimator of the standard error of the slope
(n − 2) (xi − x̄)2
P

Desicion based on p-value : Reject H0 if p-value < α


© A. Ardalan Associations March 27, 2022 15 / 27
Linear Regression Simple Linear Regression

Example
Consider estimated regression equation of house price and size.
α = 0.05

Critical Values:

Reject Reject
t0.025,8 = ±2.3060
α/2 = 0.025 α/2 = 0.025

Standard error of b1 :

0.00515 0.00515
sb1 = 0.033

Test Statistic: -2.30 2.30

-3.33 3.33
b1 − β1 0.11 − 0
t = = = 3.33
sb1 0.033

Two tailed P-Value: P(t ≤ −3.33) + P(t ≥ 3.33) = 2(0.00515) = 0.01039 < α

Decision: There is evidence to reject H0 (t statistic is in the reject region)


Conclusion: There is sufficient evidence that square footage affects house
price
Interpretation:
© A. Ardalan Associations March 27, 2022 16 / 27
Linear Regression Multiple Linear Regression

2000 6000 10000 0 500 1000 2000

150
100
Income

50
6000 10000

Limit
2000

200 400 600 800


Rating
2000
500 1000

Balance
0

50 100 150 200 400 600 800

© A. Ardalan Associations March 27, 2022 17 / 27


Linear Regression Multiple Linear Regression

Multiple Linear Regression


Examine the linear relationship between one dependent variable (Y)
and two or more independent variables (Xi )
Multiple regression model with Xi independet variables:

y = β0 + β1 x1 + β2 x2 + ... + βk xk + ε
|{z}
| {z }
Linear component Random Error component

Estimation of multiple regression model


ŷ = b0 + b1 x1 + b2 x2 + ... + bk xk
which is the line that minimizes the sum of the squared residuals (Least
Squares Criterion)
I b0 is the intercept and the average value of ŷ if all xi are zero.
I Each bi represents the difference in the value of ŷ fore each one unit
difference in xi if other independent variables remain constant.
© A. Ardalan Associations March 27, 2022 18 / 27
Linear Regression Multiple Linear Regression

Estimation of regression coefficient


The multiple regression model can be represented in a compact form
using matrix notation as below

Y = Xβ + ε
      
y1 1 x11 . . . x1k β1 ε1
y2  1 x21 . . . x2k  β2  ε2 
 ..  =  .. ..   ..  +  .. 
      
.. ..
 .  . . . .  .   . 
yn 1 xn1 . . . xnk βk εn

The least squares estimation of coefficients

β = (X0 X)−1 X0 Y
Each β̂j has normal distribution with mean βj and variance σ 2 vjj where vjj
is the jth diagonal entry of the matrix V = (X 0 X )−1
© A. Ardalan Associations March 27, 2022 19 / 27
Linear Regression Multiple Linear Regression

Significant test for slopes


F test: is used to determine whether a significant relationship exists
between the dependent variable and the set of all the independent
variables; we will refer to the F test as the test for overall significance
I Decision rule

H0 : β1 = β2 = · · · = βk = 0
H1 : One or more of the coefficients is not equal to zero

I Test statistic

SSR/k
F = Reject H0 if F ≥ Fk,n−k−1,α
SSE /n − k − 1
(yi − ȳ )2 (sum of squares due to regression) and SSE = (yi − yˆi )2 (sum of squares due to error)
P P
where SSR =

Desicion based on p-value : Reject H0 if p-value < α

© A. Ardalan Associations March 27, 2022 20 / 27


Linear Regression Multiple Linear Regression

T test: if the F test shows an overall significance, the t test is used to


determine whether each of the individual independent variables is
significant (individual significance).
I Decision rule 
H0 : βi = 0
H1 : βi 6= 0

I Test statistic

b1 − β1
t= Reject H0 if t > tn−k−1,α/2 or t < −tn−k−1,α/2
sb1

Desicion based on p-value : Reject H0 if p-value < α

© A. Ardalan Associations March 27, 2022 21 / 27


Linear Regression Multiple Linear Regression

Example

In a trucking company, to develop better work schedules, the managers


want to estimate the total daily travel time for their drivers based on miles
traveled and number of deliveries. A sample of 10 driving assignments
provided the data shown in table below
Driving assignment Travel time (y ) Miles traveled (x1 ) Number of deliveries (x2 )
1 9.3 100 4
2 4.8 50 3
3 8.9 100 4
4 6.5 100 2
5 4.2 50 2
6 6.2 80 2
7 7.4 75 3
8 6.0 65 4
9 7.6 90 3
10 6.1 90 2

© A. Ardalan Associations March 27, 2022 22 / 27


Linear Regression Multiple Linear Regression

Figure: SPSS output for trucking company with two independent variables miles
traveled (x1) and number of deliveries (x2) .

© A. Ardalan Associations March 27, 2022 23 / 27


Linear Regression Multiple Linear Regression

F test
I Decision: Based on the p − value = 0.003 ≤ α in the last column of
the Anova table, we can reject H0 : β1 = β2 = 0.
I Conclusion: The is a significant relationship between the travel time
and independet variables (miles traveled and number of deliveries).
T test:
I Decision: Based the p-values of .001 and .004 on the spss output for
coefficinets we can reject H0 : β1 = 0 and H0 : β2 = 0.
I Conclusion: There is sufficient evidence that miles traveled and number
of deliveries have a significant effect on travel time.

The regression equation

Time = −1.008 + 0.071 Miles + 0.749Deliveries


.071 hours is an estimate of the expected increase in travel time corresponding to an increase of one mile in the
distance traveled when the number of deliveries is held constant. Similarly, because b2 = .749, an estimate of the
expected increase in travel time corresponding to an increase of one delivery when the number of miles traveled is held
constant is .749 hours.

© A. Ardalan Associations March 27, 2022 24 / 27


Chi-square test

Chi-square test
A Chi-Square Test of Independence is used to determine whether or
not there is a significant association between two categorical variables
that is described in a contingency table.
Decision rule

H0 : The two variables are independent
H1 : The two variables are not independent

Test statistic

R X
C
X (oij − eij )2
χ2 = Reject H0 if χ2 > χ2(R−1)(J−1),α or p − value ≤ α
eij
i=1 j=1

oij is the observed cell count in the ith row and jth column of the table
(row i total) × (col j total)
eij = grand total
is the expected cell count in the ith row and jth column of the table under H0

© A. Ardalan Associations March 27, 2022 25 / 27


Chi-square test

Example

Suppose that you are looking at the results of a survey into the source of
finance for buying a house, and find the following pattern of loan size and
source of loan – which is a contingency table. You might ask whether
there is any relationship or association between the size of a loan and the
source.

Size of loan Total


Less than $80, 000 to More than
$80, 000 $150, 000 $150, 000
Source of mortgage Building society 30 55 40 125
Bank 23 29 3 55
Elsewhere 12 6 2 20
Total 65 90 45 200

© A. Ardalan Associations March 27, 2022 26 / 27


Chi-square test

Example: For the relationship between the number of weekend television


commercials shown and the sales, we have:

α = 0.05, df = (3 − 1)(3 − 1) = 4 O E O −E (O − E )2 (O − E )2 /E
1 30 40.625 -10.625 112.891 2.779
Critical Values: 2 55 56.25 -1.25 1.563 0.028
3 40 28.125 11.875 141.016 5.014
2
χ0.025,4 = 9.488 4 23 17.875 5.125 26.266 1.469
5 29 24.750 4.25 18.063 0.730
6 3 12.375 -9.375 87.891 7.102
Test Statistic: 7 12 6.500 5.500 30.250 4.654
8 6 9.000 3.000 9.000 1.000
2
χ = 24.165 9 2 4.500 2.500 6.250 1.389
Total 200 200 24.165

P-Value: P(χ2 ≥ 24.165) = 0.00007402 < α

Decision: There is evidence to reject H0 (χ2 statistic is in the reject region)


Conclusion: The evidence supports the view that there is an association, and
the size of a mortgage is related to its source.
Interpretation:
© A. Ardalan Associations March 27, 2022 27 / 27

You might also like