You are on page 1of 13

UNIT 5

1\ Correlation
In statistics, the word "correlation" has a very specific meaning. Statistical correlation
means that, given two variables X and Y measured for each case in a sample, variation in X
corresponds (or does not correspond) to variation in Y, and vice versa. That is, extreme values of X
are associated with extreme values of Y, and less extreme X values with less extreme Y values.
The correlation coefficient (Pearson r) measures the degree of this correspondence.
Linear Correlation:
Linear Correlation analysis is used to measure strength of the association (linear
relationship) between two or more variables.
- Only concerned with strength of the relationship
- No causal effect is implied
The Simple Linear Correlation Coefficient
There is a simple and straightforward way to measure correlation between two variables. It
is called the Pearson correlation coefficient (r) – named after Karl Pearson who invented it. It's
longer name, the Pearson product-moment correlation, is sometimes used.
Value of the Correlation Coefficient:
The population correlation coefficient ρ(rho) measures the strength of the association between the
variables.
The sample correlation coefficient r is an estimate of ρ and is used to measure the strength of the
linear relationship in the sample observations.
The value of the correlation coefficient always lies in the range of –1 to 1; that is,
r > 0 indicates a positive relationship of X and Y: as one gets larger, the other gets larger.
r < 0 indicates a negative relationship: as one gets larger, the other gets smaller.
r = 0 indicates no relationship
Features of ρ and r:
- Unit free
- Range between -1 and 1
- The closer to -1, the stronger the negative linear relationship
- The closer to 1, the stronger the positive linear relationship
- The closer to 0, the weaker the linear relationship

1
The formula for computing the Pearson r is as follows:

r=
∑ xy−n x y
√(∑ x ¿−n x 2 )(∑ y 2−n y 2 )¿
2

Or:
∑ xy −x y
n
r=
sx s y
Where:

sx=
√ ∑ x 2 −x 2
n

√ ∑ 2
y
sy= − y2
n
Examples:
1\ the following table shows marks of 10 students in BUSS-304 and BUSS-307. Find the
correlation coefficient and comment on your value:

2
Mark in BUSS-304 18 20 30 40 47 55 60 80 88 92
Mark in BUSS-307 42 55 61 54 63 68 81 66 80 100
Solution:
sr Mark in BUSS-304 Mark in BUSS-307
x2 y2 xy
(X) (Y)
1 18 42 324 1764 756
2 20 55 400 3025 1100
3 30 61 900 3721 1830
4 40 54 1600 2916 2160
5 47 63 2209 3969 2961
6 55 68 3025 4624 3740
7 60 81 3600 6561 4860
8 80 66 6400 4356 5280
9 88 80 7744 6400 7040
10 92 100 8464 10000 9200
total 530 670 34666 47336 38927

x=
∑ x = 530 =53, y=
∑ y = 670 =67
n 10 n 10

∑ xy = 38927 =3892.7
n 10

sx=
√∑ x 2 −x 2=
n √ 34666
10
−¿ ¿ = 25.64

sy=
√ ∑
n
y2
− y2 = √ 47336
10
−¿ ¿ =15.64

∑ xy −x y
n 3892.7−53× 67 341.7
r= = = =0.852
sx s y 25.64 ×15.64 401.01
There is a strong positive linear relationship between marks of BUSS-304 and BUSS-307.

2\ Calculate the correlation coefficient for incomes and food expenditures of seven households as
given below:
Income (X) 35 48 21 39 14 28 25
3
Food expenditure (Y) 9 14 7 11 5 8 9

3\ Find the correlation coefficient between x and y from the following data:
X 1 2 4 6 7 8 10
y 10 14 12 13 15 12 13

2\ Simple Linear Regression


Regression refers to the statistical technique of modeling the relationship between variables.
In simple linear regression, we model the relationship between two variables. One of the variables,
denoted by Y, is called the dependent variable and the other, denoted by X, is called the
independent variable.
The model we will use to depict the relationship between X and Y will be a straight-line
relationship.
- Only one independent variable, x.
- Relationship between x and y is described by a linear function
- Changes in y are assumed to be caused by changes in x
A graphical sketch of the pairs (X, Y) is called a scatter plot.

4
5
The sample regression line provides an estimate of the population regression line

The Least Squares Regression Line:


At its simplest level, linear regression is a method for fitting a straight line through an x-y scatter
:plot. described by the following formula

ŷ = a+bx
)or, equivalently, ŷ = α + βx (
:where
x = a value on the x axis
b = slope parameter
.a = intercept parameter (i.e., value on y axis where x = 0)
ŷ= a predicted value of y.
∑ xy −x y ∑ xy −x y
n
b= or n
∑ x 2 −x 2 b= 2
sx
n

a= y−b x

6
Examples:
1\ Find the least squares regression line of y on x for the data of marks of BUSS-304 and BUSS-307
in the table below. Use BUSS-304 as an independent variable and BUSS-307 as a dependent
variable.
Mark in BUSS-304 18 20 30 40 47 55 60 80 88 92
Mark in BUSS-307 42 55 61 54 63 68 81 66 80 100
Solution:
sr Mark in BUSS-304 Mark in BUSS-307
x2 xy
(X) (Y)
1 18 42 324 756
2 20 55 400 1100
3 30 61 900 1830
4 40 54 1600 2160
5 47 63 2209 2961
6 55 68 3025 3740
7 60 81 3600 4860
8 80 66 6400 5280
9 88 80 7744 7040
10 92 100 8464 9200
total 530 670 34666 38927

x=
∑ x = 530 =53 , y=
∑ y = 670 =67
n 10 n 10

∑ xy −x y
n
b=
∑ x 2 −x 2
n
38927
−(53 ×67)
10
b= =0.52
34666
−(53)2
10
a= y−b x
a=67−0.52 (53 )=39.44

ŷ = 39.44 + 0.52x

7
:Interpretation of a and b
Interpretation of a: Consider a student with zero mark in BUSS-304:
ŷ = 39.44 + 0.52(0) = 39 marks in BUSS-307
Interpretation of b: The slope of the regression line is positive, suggesting a direct relationship
between marks of BUSS-304 and marks of BUSS-307. The value of the slope (b = 0.52) indicates
that each one-mark increase in BUSS-304 will increase the estimated mark of BUSS-307 by 0.52
mark.
Point Estimates Using the Regression Line:
Predict the mark in BUSS-307if mark in BUSS-304 is 50:
ŷ = 39.44 + 0.52(50) = 65.4 marks in BUSS-307
Coefficient of determination, r2 (or R2):
Coefficient of determination, r2 (or R2) gives the proportion of fluctuation or variance of one
variable that is predictable (Y) is explained from the other variable (x). 
The further the regression line is away from the points, the less it is able to explain.
0 ≤ r2 ≤ 1
:For the above data calculate the coefficient of determination
Coefficient of determination r2. = (0.852)2= 0.726
.Interpretation of r2: 72.6% variation in BUSS-307 marks is explained by BUSS-304 marks
Computer solution: MS Excel

SUMMARY OUTPUT
Regression Statistics
Multiple R 0.851993
R Square 0.725891
Adjusted R
Square 0.691628
Standard Error 9.154708
Observations 10

ANOVA
Significanc
  df SS MS F eF
Regression 1 1775.531 1775.531 21.18552 0.001749
Residual 8 670.4694 83.80868
Total 9 2446      

Standard Upper Lower Upper


  Coefficients Error t Stat P-value Lower 95% 95% 95.0% 95.0%
Intercept 39.46031 6.646844 5.936699 0.000347 24.13266 54.78796 24.13266 54.78796
mark in
BUSS304 0.519617 0.112892 4.602773 0.001749 0.259287 0.779946 0.259287 0.779946

8
2\ Find the least squares regression line for the data of incomes and food expenditure on the seven
households given in the table below. Use income as an independent variable and food expenditure
as a dependent variable.
Income (X) 35 48 21 39 14 28 25
Food expenditure (Y) 9 14 7 11 5 8 9

sr Income: Food Exp.


xy x2
x Y
1 35 9 315 1225
2 48 14 672 2304
3 21 7 147 441
4 39 11 429 1521
5 14 5 70 196
6 28 8 224 784
7 25 9 225 625
total 210 63 2082 7096

x=
∑ x = 210 =30 , y=
∑ y = 63 =9
n 7 n 7

∑ xy −x y
n
b=
∑ x 2 −x 2
n
2082
−(30× 9)
7
b= =0.24
7096
−(30)2
7
a= y−b x
a=9−0.24 ×30=1.8

ŷ = 1.8+ 0.24x
Interpretation of a and b:
Interpretation of a: Consider the household with zero income
ŷ = 1.8 + .24(0) = RO 1.8 hundred
Thus, we can state that household with no income is expected to spend RO 1.8 per month on food
The regression line is valid only for the values of x between 14 and 48.

9
Interpretation of b: The value of b in the regression model gives the change in y due to change of
one unit in x. We can state that, on average, a RO1 increase in income of a household will increase
the food expenditure by RO 0.24.
Point Estimates Using the Regression Line:
Predict the food expenditure if income is 17:
ŷ = 1.8 + 0.24(17) = RO 5.88 hundred

Coefficient of determination r2. = (0.96)2 = 0.926


Interpretation of R2 :92.6% variation in food expenditure is explained by the income.
MS Excel

SUMMARY OUTPUT
Regression Statistics
Multiple R 0.962409
R Square 0.926231
Adjusted R
Square 0.911477
Standard
Error 0.858888
Observatio
ns 7

ANOVA
Significan
  df SS MS F ce F
46.3115 46.3115 62.7792
Regression 1 6 6 9 0.000516
3.68844 0.73768
Residual 5 2 8
Total 6 50      

Coefficien Standar Lower Upper Lower Upper


  ts d Error t Stat P-value 95% 95% 95.0% 95.0%
0.96925 1.81976 0.12844 4.25536 - 4.25536
Intercept 1.763819 4 9 7 -0.72773 7 0.72773 7
0.03044 7.92333 0.00051 0.31946 0.16295 0.31946
Income (X) 0.241206 2 8 6 0.162951 1 1 1

10
3\ A production manager has compared the years of expertise of five assembly-line employees with
their hourly productivity. The data are and shown here. Find the least squares regression line for the
data. And find the correlation coefficient (r).
Years of Units Produced
Employee expertise (x) in One Hour (y)
A 12 55
B 14 63
C 17 67
D 16 70
E 11 51
Solution:
Employee x y xy x2 y2
A 12 55 660 144 3025
B 14 63 882 196 3969
C 17 67 1139 289 4489
D 16 70 1120 256 4900
E 11 51 561 121 2601
Total 70 306 4362 1006 18984

x=
∑ x = 70 =14 , y=
∑ y = 306 =61.2
n 5 n 5

∑ xy −x y
n
b=
∑ x 2 −x 2
n
4362
−(14 × 61.2)
5
b= =3
1006 2
−(14 )
5
a= y−b x
a=61.2−3 × 14=19.2

ŷ = 19.2 + 3x
Interpretation of a and b:

11
Interpretation of a: Consider an employee with zero years of expertise.
ŷ = 19.2 + 3(0) = 19.2 units of production.
Thus, we can state that an employee with no years of expertise is expected to produce 19.2 units per
hour.
Interpretation of b: The value of b in the regression model gives the change in y due to change of
one unit in x. The slope of the regression line is positive, suggesting a direct relationship between
years of expertise and productivity. The value of the slope (b = 3) indicates that each one-unit
increase in the years of expertise will increase the estimated productivity by 3.0 units per hour.

Point Estimates Using the Regression Line:


Estimate the productivity of an employee with 15 years of expertise:
ŷ = 19.2 + 3(15) = 64.2 units of production

The correlation coefficient (r):

∑ xy −x y
n
r=
sx s y

sy=
√ 1006
5
−(14)2= 2.28

sy=
√ 18984
5
−(61.2) = 7.17
2

872.4−14 × 61.2 15.6


r= = =0.95
2.28 ×7.17 16.3477

The coefficient of correlation (r = 0.9546) is positive, reflecting that productivity (y) is directly
related to years of expertise (x).

Coefficient of determination r2. = (0.95)2 = 0.911


.Interpretation of R2: 91.1% variation in productivity is explained by expertise

12
Computer solution: MS Excel

SUMMARY OUTPUT

Regression Statistics
Multiple R 0.954576
R Square 0.911215
Adjusted R
Square 0.88162
Standard Error 2.75681
Observations 5

ANOVA
Significanc
  df SS MS F eF
Regression 1 234 234 30.78947 0.011542
Residual 3 22.8 7.6
Total 4 256.8      

Standar Upper Lower Upper


  Coefficients d Error t Stat P-value Lower 95% 95% 95.0% 95.0%
Intercept 19.2 7.668918 2.503613 0.087428 -5.20592 43.60592 -5.20592 43.60592
Years of Exp (x) 3 0.540655 5.548826 0.011542 1.279395 4.720605 1.279395 4.720605

13

You might also like