You are on page 1of 14

5 Regression and Correlation (1)

05 Regression and Correlation


 we make decisions based on prediction of future events
 we develop relationship between what is already known and what is to be estimated
 correlation analysis to determine the degree to which the variables are related. It tells us how well the estimating
equation actually describes the relationship.
Examples (1)
Consider another example to investigate the relationship between Density and compressive strength at 28 days from
examination of 40 concrete cube test records during the period 8 July 1991 to 21 September 1992, and arranged in reverse
chronological order
Density (kg/m3) Compressive Density (kg/m3) Compressive Density (kg/m3) Compressive Density (kg/m3) Compressive
Stength Stength Stength Stength
(N/mm2) (N/mm2) (N/mm2) (N/mm2)
2437 60.5 2428 56.9 2435 57.8 2444 64.9
2437 60.9 2448 67.3 2446 60.9 2447 63.4
2425 59.8 2456 68.9 2441 61.9 2433 60.5
2427 53.4 2436 49.9 2456 67.2 2429 68.1
2435 68.3 2454 59.8 2458 61.1 2455 56.3
2471 65.7 2449 56.7 2414 50.7 2473 64.9
2472 61.5 2441 57.9 2448 59.0 2488 69.5
2445 60.0 2457 60.2 2445 63.3 2454 58.9
2436 59.6 2447 55.8 2436 52.5 2427 54.4
2450 60.5 2436 53.2 2469 54.6 2411 58.8
(Ref. Applied Statistics for Civil and Environmental Engineers, 2nd Edition by Kottegoda and Renzo Example 6.1)
Let we denote x and y for concrete density and strength, respectively. Suppose the investigator believes that the relation

between y and x is exactly given by: Y = −274.4 + 0.1368x.
If this is true we must obtain the exact value of yield y for a given value of x. Thus when x = 2445, the yield must be:
Y = 22 + 2.5 (2445) = 60.076
But it is 60.0. There is an error of 60.076 – 60.0 = 0.076. Hence no deterministic model can be constructed to represent
this experiment. This type of error is known as probabilistic model.
Regression Model
A mathematical equation that allows us to predict values of one dependent variable from known values of one or more
independent variables is called a regression equation. Today the term regression is applied to all types of prediction problems and
does not necessarily imply a regression towards the population mean.
Linear Regression
We consider here the problem of estimating or predicting the value of a dependent variable Y on the basis of a known
measurement of an independent and frequently controlled variable X. The variable intended to be estimated or predicted is termed
as dependent variable or Regressand or response variable and the variable on the basis of which the dependent variable is to be
estimated is called the independent variable, the regressor or the predictor.
Scatter Diagram
Let us consider the data given in Example 1. The data table has been plotted in figure to give a scattered diagram.

In the scattered diagram, the points follow closely a straight line indicate that the two variables are to some extend
linearly related. Once a reasonable linear relationship is obtained, we usually try to express this mathematically by a straight-line
equation Y = a + bX, called the linear regression line, where the constants a and b represent the y-intercept and slope respectively.
5 Regression and Correlation (2)

Such a regression line has been drawn in the following figure. This linear regression line can be used to predict the value Y
corresponding to any given value X.

Many possible regression lines could be fitted to the sample data, but we choose that particular line which best fits that
data. The best regression line is obtained by estimating the regression parameters by the most commonly used method of least
squares.
Estimation of a Straight Line using the Method of Least Squares
Examples (2) (Weather and Traffic)
Weather and traffic are two everyday occurrences that have inherent randomness. For example, if you live in a cold
climate you know that traffic tends to be more difficult when snow falls and covers the roads.We can create a simple mathematical
model of traffic incidents as a function of snowy weather, based on known data.
In the following table, we have accumulated a record of the number of snow days occurring in a certain locality over the
past 10 years, along with the number of traffic incidents reported to police in the same year. A scatter plot of the data can be used
to visualize the possible correlation.
Snow Days 16 55 43 29 59 42 20 45 30 35
Incidents 5825 11427 9006 5963 11449 8380 5745 9104 6495 6938
Scattered Plot is shown below:

We see that there is a general trend to the data, with traffic incidents increasing as the number of snow days increases.
We have added a linear trend line to the data to highlight this relationship. This linear trend is, in fact, a straight line probabilistic
model of the data.
A straight-line probabilistic model is often referred to as a linear regression, Y = a + b X
The normal equations are:
Y = na + bX
X Y = aX + bX2
Table construction for these normal equations are:
5 Regression and Correlation (3)

X Y X2 XY
16 5825 256 93200
55 11427 3025 628485
43 9006 1849 387258
29 5963 841 172927
59 11449 3481 675491
42 8380 1764 351960
20 5745 400 114900
45 9104 2025 409680
30 6495 900 194850
35 6938 1225 242830
374 80332 15766 3271581
Now the normal equations:
80332 = 10a + 374b
3271581 = 374a + 15766b
By solving, we get a = 2415 and b = 150.2
The regression equation is
Y = 2415 + 150.2 X
Using this regression line Y = 2415 + 150.2 X, we can predict number of incidents during a number of snow days. For
example the predicted number of incidents during 40 snow days, are 8423. The estimated number of incidents during 45 snow days
are 9174 but the observed number of incidents during 45 snow days are 9104.
Such interpretation is valid only when x lies between 16 and 59. An extension of the model beyond these values may lead
to unreasonable results. The value of b is called coefficient of regression.
Examples (3)
Fit a parabola y = ax2 + bx + c in least square sense to the data

X= 10 12 15 23 20
Y= 14 17 23 25 21
Solution
We are given y = ax2 + bx + c
The normal equations to the curve are
Y = a  X2 + b  X + 5c
XY = a X3 + b X2 + c X
X2Y = a X4 + b X3 + c X2
X Y X2 X3 X4 XY X2Y
10 14 100 1000 10000 140 1400
12 17 144 1728 20736 204 2448
15 23 225 3375 50625 345 5175
23 25 529 12167 279841 575 13225
20 21 400 8000 160000 420 8400
X = 80 Y = 100 x2 = 1398 x3 = 26270 x4 =521202 XY = 1684 X2Y= 30648
Substituting the obtained values from the table in normal equations, we have
5 Regression and Correlation (4)

100 = 1398 a + 80 b + 5c
1684 = 26270 a + 1398 b + 80 c
30648 = 521202 a + 26270 b + 1398 c
on solving, a =  0.07, b = 3.03, c =  8.89
Hence the required equation is Y = 0.07 X2 + 3.03 X  8.89
Examples (4) (Linearization)
The curve to be fitted is

Y = a ebX Y = a bX
log10Y = log10 a ebX ln Y = ln a ebX
log10Y = log10 a + log10 ebX = ln a + ln ebX
log10Y = log10 a + b X log10 e = ln a + bX ln e
put y = log10Y, A = log10 a and B = b log10 e ln Y = ln a + bX
we get y = A+BX by putting ln Y = y, ln a = A
y = A + bX

 the normal equations are


y = nA + BX
 Xy = A  X + B  X2 Table construction:
X Y y = log10 Y X2 Xy
0 0.10 -1 0 0
0.5 0.45 - 0.3468 0.25 -0.1734
1.0 2.15 0.3324 1.0 0.3324
1.5 9.15 0.9614 2.25 1.4421
2.0 40.35 1.6058 4.0 3.2116
2.5 180.75 2.2571 6.25 5.6428
X=7.5 y = 3.8099 X2 = 13.75 Xy = 10.4555

Substituting the values of the summations in the normal equations, we get 3.8099 = 6A + 7.5 B
and 10.4555 = 7.5 A + 13.75 B
On solving, A =  0.9916, B = 1.3013
log10 a =  0.9916, b log10 e = 1.3013
a = anti log10 A = 0.1019, b = B/log10e = 2.9963 and the curve is Y = 0.1019 e(2.9963)X
Note: try this above example by taking natural log (i.e. ln )
Alternatively
Y = a ebX
ln Y = ln a ebX
= ln a + ln ebX
= ln a + bX ln e ln Y = ln a + bX
put ln Y = y , ln a = A  a = anti ln (A) we have
y = A + bX
now find normal equations for this st. line
y = nA + b X
Xy = A X + b X2
Table construction:

X Y y = ln Y X2 Xy
5 Regression and Correlation (5)

Exercises
(1) Given below the data relating to the thermal energy generated in Pakistan 1981-94. The energy generation is in billion
kwh.
Year 1981 1982 1983 1984 1985 1986 1987
Energy Generated 4.2 5.2 5.1 5.2 6.5 7.3 8.4
Year 1988 1989 1990 1991 1992 1993 1994
Energy Generated 10.8 11.9 14.5 16.1 19.4 19.7 23.0
Fit a straight line to the data. Find the residuals. Plot the residuals and comment on your result.
(2) Following is the annual installation of computers in labs in UET. Fit a linear regression equation of the computers on years
and give the annual rate of installation of them.
Year: 2001-2003 2003-2005 2005-2007 2007-2009 2009-2011
No of Computers installed: 139 144 150 154 158
Note
For each situation where the independent variable is a time factor, the values assigned to
2001-2003,… may be taken as 1,2,3,…
(3) A study of the department of transportation on the effect of bus ticket prices on the number of passengers produced the
following results:

Ticket Prices (Cents): 25 30 35 40 45 50 55 60


Passengers per 100 miles: 800 780 660 640 600 600 620 620

(a) Plot these data.


(b) Develop an estimating line that best describe the data.
(c) Predict the number of passengers per 100 miles if the ticket price were 50 cents.
(Statistics for Management, 7th Ed, by Richard Levin and David Rubin Prob. 12.18 )
(4) A tire manufacturing company is interested in removing pollutants from the exhaust at the factory and cost in concern.
The company has collected data from other companies concerning the amount of money spent on environmental
measures and the resulting amount of dangerous pollutants released (as a percentage of total emissions)
Money Spent ($ thousands) 8.4 10.2 16.5 21.7 9.4 8.3 11.5
Percentage of Dangerous Pollutants 35.9 31.8 24.7 25.2 36.8 35.8 33.4
Money Spent ($ thousands) 18.4 16.7 19.3 28.4 4.7 12.3
Percentage of Dangerous Pollutants 25.4 31.4 27.4 15.8 31.5 28.9
(a) Compute the regression equation
(b) Predict the percentage of dangerous pollutants released when $20,000 is spent on the control measures.
(c) Calculate the standard error of estimates.
(Statistics for Management, 7th Ed, by Richard Levin and David Rubin Prob. 12.24)
Examples (5)
Obtain a relation of the form Y = a bX for the following data by the method of least squares.
5 Regression and Correlation (6)

X= 2 3 4 5 6
Y= 8.3 15.3 33.1 65.2 127.4
Solution
The curve to be fitted is Y = a bX or y = A + BX, where A = log10 a, B= log10b and y = log10Y.
 the normal equations are:  y = 5A + BX and Xy = a X + B  X2

X Y y = log10Y X2 Xy
2 8.3 0.9191 4 1.8382
3 15.4 1.1872 9 3.5616
4 33.1 1.5198 16 6.0792
5 65.2 1.8142 25 9.0710
6 127.4 2.1052 36 12.6312
X = 20 y = 7.5455 X2 = 90 Xy = 33.1812

Substituting the values in normal equations, we get


7.5455 = 5A + 20B and 33.1812 = 20A + 90 B
On solving, A = 0.31 and B = 0.3
 a = anti-log10 A = 2.04 and b = anti-log10 B = 1.995
Hence, the required curve is Y = 2.04 (1.995)X
Exercise (1)
 
Fit a least squares line for 20 pairs of observations having X = 2, Y = 8, X2 = 180 and XY=404
Exercise (2)
For 5 pairs of observations, it is given that A.M of X is 2 and A.M of Y is 15. It is also known that X2 = 30, X3 = 100, X4
=354, XY = 242, X2Y = 850. Fit a second degree parabola taking X as an independent variable.
Exercise (3)
Given the following sets of values:
X 6.5 5.3 8.6 1.2 4.2 2.9 1.1 3.9
Y 3.2 2.7 4.5 1.0 2.0 1.7 0.6 1.9
(a) Compute the least squares regression equation for Y values on X values.
(b) Compute the least squares regression equation for X values on Y values.
Exercise (4)
For each of the following data, determine the estimated regression equation Y = a + bX:
 
(a) Y = 20, X = 10, XY = 1000, X2 = 2000, n = 10.
(b) X = 528, Y = 11720, XY = 193640, X2 = 11440, n = 32
Exercise (5)
For the following set of data: (AIOU)

plot the scatter diagram.


Develop the estimating equation that best describes the data
Predict Y for X = 10, 15, 20
X 13 16 14 11 17 9 13 17 18 12
Y 6.2 8.6 7.2 4.5 9.0 3.5 6.5 9.3 9.5 5.7
Coefficient of Determination
To determine the goodness of fit for the estimated regression equation.
5 Regression and Correlation (7)

^  ^
(Y  Y)2 (Y  Y )2
SSR Explained Variation . .
r2 = SST = Total Variation = =1
 
(Y  Y)2 (Y  Y)2

Alternative formula for Coefficient of Determination:



aY + bXY  n Y 2
r2 =

Y2 n Y 2

Exercise (6)
years R&D Annual
expenses (X) Profit (Y)
1st 5 31
Calculate 2nd 11 40
Coefficient of rd
Determination 3 4 30
using both the 4th 5 34
formulas
5th 3 25
6th 2 20
X = 30 Y=180

Exercise (7)
The curb weight x in hundreds of pounds and braking distance y in feet, at 50 miles per hour on dry pavement, were
measured for five vehicles, with the results shown in the table.

X: 25 27.5 32.5 35 45
Y: 105 125 140 140 150

Fitted line for this data is Y = 66.34 + 1.990X and fitted second degree parabola is
Y = -112.5 + 12.61 X – 0.1510 X2 shown in the following figure. compute the coefficient of determination and interpret its value in
the context of vehicle weight and braking distance.
Examples (6)
An architect wants to determine the relationship between the heights (in feet) of a building and the number of stories in
the building. The data for a sample of 10 buildings in a city shown below. Explain the relationship.

Stories: X 64 54 40 31 45 38 42 41 37 40
Height: Y 841 725 635 616 615 582 535 520 511 485

Correlation
5 Regression and Correlation (8)

Two variables are said to be correlated if they tend to simultaneously vary in some direction; if both the variables tend to
increase (or decrease) together, the correlation is said to be direct or positive. e.g. the length of an iron bar will increase as
temperature increases. If one variable tend to increase as the other variable decreases, the correlation is said to be negative or
inverse. e.g. the volume of gas will decrease as the pressure increases.
(1) The correlation answers the STRENGTH of linear association between paired variables, say X and Y. On the other hand,
the regression tells us the FORM of linear association that best predicts Y from the values of X.
(2) Correlation is calculated whenever:
o both X and Y is measured in each subject and quantify how much they are linearly associated.
o in particular the Pearson's product moment correlation coefficient is used when the assumption of both X and Y
are sampled from normally-distributed populations are satisfied
o or the Spearman's moment order correlation coefficient is used if the assumption of normality is not satisfied.
o correlation is not used when the variables are manipulated, for example, in experiments.
The numerical measure of strength in the linear relationship between any two variables is called the correlation
coefficient, usually denoted by r, is defined by
_ _
(XX) (YY)
r= , called Pearson Product Moment Correlation Coefficient.
_ _
(XX)2 (YY) 2
XY(X)( Y)/n
Alternatively, r =
[X2(X)2/n][ Y2(Y)2/n]
Its range is from -1 to +1
If r = -1, that’s mean there is a perfect
negative correlation If r = +1, that’s mean
there is a perfect positive correlation

If r is near -1 , that’s mean there is a strong


negative correlation If r is near +1 , that’s mean
there is a strong positive correlation

If r is near 0 but negative, that’s mean there is a weak


negative correlation If r is near 0 but positive , that’s mean
there is a weak positive correlation

It is important to note that r = 0 does not mean that there is no relationship at all. e.g. if all the observed values lie exactly
on a circle, there is a perfect non-linear relationship between the variables.
Rank Correlation
Sometimes, the actual measurements of individuals or objects are either not available or accurate assessment is not
possible. They are then arranged in order according to some characteristic of interest.. Such an ordered arrangement is called a
ranking and the order given to an individual or object is called its rank. The correlation between two such sets of ranking is called
Rank Correlation.
we have
6di2
r = 1 - n(n2 - 1)
This is also ranging from – 1 to + 1
Note
5 Regression and Correlation (9)

If two objects or observations are tied (having same value), lets say for fourth and fifth, then they are both given the
mean rank of 4 and 5. i.e. 4.5.
This situation is given in the following example.
Examples (7)
The following table shows the number of hours studied (X) by a random sample of ten students and their grades in
examination (Y):
X: 8 5 11 13 10 5 18 15 2 8
Y: 56 44 79 72 70 54 94 85 33 65
Calculate Spearman’s rank correlation coefficient.
Solution
We rank the X values by giving rank 1 to the highest value 18, rank 2 to 15, rank 3 to 13, rank 4 to 11, rank 5 to 10, rank
6.5 (mean of rank 6 and 7) to both 8, rank 8.5 (mean of rank 8 and 9) to both 5 and rank 10 to 2. Similarly we rank the values of Y by
giving 1 to the highest value 94, rank 2 to 85, rank 3 to 79, …, and rank 10 to 33 which is the smallest.
Table given below:

X Y Rank of X Rank of Y di d2
8 56 6.5 7 - 0.5 0.25
5 44 8.5 9 - 0.5 0.25
11 79 4 3 1.0 1
13 72 3 4 - 1.0 1
10 70 5 5 0.0 0
5 54 8.5 8 0.5 0.25
18 94 1 1 0.0 0
15 85 2 2 0.0 0
2 33 10 10 0.0 0
8 65 6.5 6 0.5 0.25

d2 = 3
The value of n is 10.
6di2 6(3)
Hence r = 1 - n(n2 - 1) = 1 - 10(102 - 1) = 0.98

Compare this value with the correlation coefficient for the original values.
Exercise (8)
Ten competitors in a beauty contest are ranked by three judges in the following order

1st Judge 1 6 5 10 3 2 4 9 7 8
2nd Judge 3 5 8 4 7 10 2 1 6 9
3rd Judge 6 4 9 8 1 2 3 10 5 7
5 Regression and Correlation (10)

Use the rank correlation coefficient to discuss which pair of judges have the nearest approach to common tastes in
beauty.
5 Regression and Correlation (11)

Multiple Linear Regression with two Regressors


^
Y = b0 + b1X1 + b2X2
Examples (8)
A statistician wants to predict the incomes of restaurants, using two independent variables : the number of restaurants
employees and restaurants floor area. He collected the following data.

Income (000) Floor Area (000 Number of


Y sq. ft) Employees
X1 X2
30 10 15
22 5 8
16 10 12
7 3 7
14 2 10
^
Calculate the estimated multiple linear regression equation Y = b0 + b1X1 + b2X2
Solution
Normal Equations are:
 Y = na + b1X1 + b2X2
X1Y = a X1 + b1X12 + b2 X1X2
X2Y = a X2 + b1X1X2 + b2 X22
Construction of the table:
Y X1 X2 X12 X22 X1X2 X1Y X2Y
30 10 15 100 225 150 300 450
22 5 8 25 64 40 110 176
16 10 12 100 144 120 160 192
7 3 7 9 49 21 21 49
14 2 10 4 100 20 28 140
89 30 52 238 582 351 619 1007
Substituting the sums in the normal equations
5a + 30 b1 + 52 b2 = 89
30 a + 238 b1 + 351 b2 = 619
52 a + 351 b1 + 582 b2 = 1007
By solving simultaneously we have
a = - 1.33, b1 = 0.38, b2 = 1.62
^
Hence the desired estimated multiple linear regression equation Y = - 1.33 + 0.38 X1 + 1.62 X2
5 Regression and Correlation (12)
5 Regression and Correlation (13)
5 Regression and Correlation (14)

Assignment 2
Cost accountant often estimates overhead based on the level of production. At the Standard Knitting Co. they have
collected information on overhead expenses and units produced at different plants, and want to estimate a regression equation to
predict future overhead. (AIOU)
Cost accountant 191 170 272 155 280 173 234 116 153 178
(overhead expenses)

units produced 40 42 53 35 56 39 48 30 37 40

Part (i)
(a) Develop the regression equation of the form Y = a + bX for the cost accountants.
(b) Find standard deviation of regression.
(c) Calculate the coefficient of determination.
Part (ii)

a) Develop the regression equation of the form Y = a + bX+cX2 for the cost accountants.
b) Find standard deviation of regression.
c) Calculate the coefficient of determination.
Hints:
where Y denotes the cost accountant and X denotes number of units produced. So the regression equation to fit is Y = a + bX
Normal equations are
Y = na + bX
XY = aX + bX2

X Y XY X2 ^y Y ^y (Y ^y)2  
Y-Y (Y - Y )2
(observed (estimated values of Y)
values of Y)

Exercise 2
Ten competitors in a contest are graded by three judges in the following order
X :1st Judge 1 4 5 10 3 2 4 8 7 8

Y : 2nd Judge 3 5 8 2 7 10 2 1 3 9

Z : 3rd Judge 5 4 7 8 1 2 3 10 4 7

Use the rank correlation coefficient to discuss which pair of judges have the nearest approach to common tastes in contest.
X Y Z Ranked Ranked Ranked d= d'= d’’= d2 d’2 d’’2
Values Values Values X-Y X-Z Y-Z
of x of y of z
1 3 5
4 5 4
5 8 7
10 2 8
3 7 1
2 10 2
4 2 3
8 1 10
7 3 4
8 9 7

You might also like