Professional Documents
Culture Documents
In the scattered diagram, the points follow closely a straight line indicate that the two variables are to some extend
linearly related. Once a reasonable linear relationship is obtained, we usually try to express this mathematically by a straight-line
equation Y = a + bX, called the linear regression line, where the constants a and b represent the y-intercept and slope respectively.
5 Regression and Correlation (2)
Such a regression line has been drawn in the following figure. This linear regression line can be used to predict the value Y
corresponding to any given value X.
Many possible regression lines could be fitted to the sample data, but we choose that particular line which best fits that
data. The best regression line is obtained by estimating the regression parameters by the most commonly used method of least
squares.
Estimation of a Straight Line using the Method of Least Squares
Examples (2) (Weather and Traffic)
Weather and traffic are two everyday occurrences that have inherent randomness. For example, if you live in a cold
climate you know that traffic tends to be more difficult when snow falls and covers the roads.We can create a simple mathematical
model of traffic incidents as a function of snowy weather, based on known data.
In the following table, we have accumulated a record of the number of snow days occurring in a certain locality over the
past 10 years, along with the number of traffic incidents reported to police in the same year. A scatter plot of the data can be used
to visualize the possible correlation.
Snow Days 16 55 43 29 59 42 20 45 30 35
Incidents 5825 11427 9006 5963 11449 8380 5745 9104 6495 6938
Scattered Plot is shown below:
We see that there is a general trend to the data, with traffic incidents increasing as the number of snow days increases.
We have added a linear trend line to the data to highlight this relationship. This linear trend is, in fact, a straight line probabilistic
model of the data.
A straight-line probabilistic model is often referred to as a linear regression, Y = a + b X
The normal equations are:
Y = na + bX
X Y = aX + bX2
Table construction for these normal equations are:
5 Regression and Correlation (3)
X Y X2 XY
16 5825 256 93200
55 11427 3025 628485
43 9006 1849 387258
29 5963 841 172927
59 11449 3481 675491
42 8380 1764 351960
20 5745 400 114900
45 9104 2025 409680
30 6495 900 194850
35 6938 1225 242830
374 80332 15766 3271581
Now the normal equations:
80332 = 10a + 374b
3271581 = 374a + 15766b
By solving, we get a = 2415 and b = 150.2
The regression equation is
Y = 2415 + 150.2 X
Using this regression line Y = 2415 + 150.2 X, we can predict number of incidents during a number of snow days. For
example the predicted number of incidents during 40 snow days, are 8423. The estimated number of incidents during 45 snow days
are 9174 but the observed number of incidents during 45 snow days are 9104.
Such interpretation is valid only when x lies between 16 and 59. An extension of the model beyond these values may lead
to unreasonable results. The value of b is called coefficient of regression.
Examples (3)
Fit a parabola y = ax2 + bx + c in least square sense to the data
X= 10 12 15 23 20
Y= 14 17 23 25 21
Solution
We are given y = ax2 + bx + c
The normal equations to the curve are
Y = a X2 + b X + 5c
XY = a X3 + b X2 + c X
X2Y = a X4 + b X3 + c X2
X Y X2 X3 X4 XY X2Y
10 14 100 1000 10000 140 1400
12 17 144 1728 20736 204 2448
15 23 225 3375 50625 345 5175
23 25 529 12167 279841 575 13225
20 21 400 8000 160000 420 8400
X = 80 Y = 100 x2 = 1398 x3 = 26270 x4 =521202 XY = 1684 X2Y= 30648
Substituting the obtained values from the table in normal equations, we have
5 Regression and Correlation (4)
100 = 1398 a + 80 b + 5c
1684 = 26270 a + 1398 b + 80 c
30648 = 521202 a + 26270 b + 1398 c
on solving, a = 0.07, b = 3.03, c = 8.89
Hence the required equation is Y = 0.07 X2 + 3.03 X 8.89
Examples (4) (Linearization)
The curve to be fitted is
Y = a ebX Y = a bX
log10Y = log10 a ebX ln Y = ln a ebX
log10Y = log10 a + log10 ebX = ln a + ln ebX
log10Y = log10 a + b X log10 e = ln a + bX ln e
put y = log10Y, A = log10 a and B = b log10 e ln Y = ln a + bX
we get y = A+BX by putting ln Y = y, ln a = A
y = A + bX
Substituting the values of the summations in the normal equations, we get 3.8099 = 6A + 7.5 B
and 10.4555 = 7.5 A + 13.75 B
On solving, A = 0.9916, B = 1.3013
log10 a = 0.9916, b log10 e = 1.3013
a = anti log10 A = 0.1019, b = B/log10e = 2.9963 and the curve is Y = 0.1019 e(2.9963)X
Note: try this above example by taking natural log (i.e. ln )
Alternatively
Y = a ebX
ln Y = ln a ebX
= ln a + ln ebX
= ln a + bX ln e ln Y = ln a + bX
put ln Y = y , ln a = A a = anti ln (A) we have
y = A + bX
now find normal equations for this st. line
y = nA + b X
Xy = A X + b X2
Table construction:
X Y y = ln Y X2 Xy
5 Regression and Correlation (5)
Exercises
(1) Given below the data relating to the thermal energy generated in Pakistan 1981-94. The energy generation is in billion
kwh.
Year 1981 1982 1983 1984 1985 1986 1987
Energy Generated 4.2 5.2 5.1 5.2 6.5 7.3 8.4
Year 1988 1989 1990 1991 1992 1993 1994
Energy Generated 10.8 11.9 14.5 16.1 19.4 19.7 23.0
Fit a straight line to the data. Find the residuals. Plot the residuals and comment on your result.
(2) Following is the annual installation of computers in labs in UET. Fit a linear regression equation of the computers on years
and give the annual rate of installation of them.
Year: 2001-2003 2003-2005 2005-2007 2007-2009 2009-2011
No of Computers installed: 139 144 150 154 158
Note
For each situation where the independent variable is a time factor, the values assigned to
2001-2003,… may be taken as 1,2,3,…
(3) A study of the department of transportation on the effect of bus ticket prices on the number of passengers produced the
following results:
X= 2 3 4 5 6
Y= 8.3 15.3 33.1 65.2 127.4
Solution
The curve to be fitted is Y = a bX or y = A + BX, where A = log10 a, B= log10b and y = log10Y.
the normal equations are: y = 5A + BX and Xy = a X + B X2
X Y y = log10Y X2 Xy
2 8.3 0.9191 4 1.8382
3 15.4 1.1872 9 3.5616
4 33.1 1.5198 16 6.0792
5 65.2 1.8142 25 9.0710
6 127.4 2.1052 36 12.6312
X = 20 y = 7.5455 X2 = 90 Xy = 33.1812
^ ^
(Y Y)2 (Y Y )2
SSR Explained Variation . .
r2 = SST = Total Variation = =1
(Y Y)2 (Y Y)2
Exercise (6)
years R&D Annual
expenses (X) Profit (Y)
1st 5 31
Calculate 2nd 11 40
Coefficient of rd
Determination 3 4 30
using both the 4th 5 34
formulas
5th 3 25
6th 2 20
X = 30 Y=180
Exercise (7)
The curb weight x in hundreds of pounds and braking distance y in feet, at 50 miles per hour on dry pavement, were
measured for five vehicles, with the results shown in the table.
X: 25 27.5 32.5 35 45
Y: 105 125 140 140 150
Fitted line for this data is Y = 66.34 + 1.990X and fitted second degree parabola is
Y = -112.5 + 12.61 X – 0.1510 X2 shown in the following figure. compute the coefficient of determination and interpret its value in
the context of vehicle weight and braking distance.
Examples (6)
An architect wants to determine the relationship between the heights (in feet) of a building and the number of stories in
the building. The data for a sample of 10 buildings in a city shown below. Explain the relationship.
Stories: X 64 54 40 31 45 38 42 41 37 40
Height: Y 841 725 635 616 615 582 535 520 511 485
Correlation
5 Regression and Correlation (8)
Two variables are said to be correlated if they tend to simultaneously vary in some direction; if both the variables tend to
increase (or decrease) together, the correlation is said to be direct or positive. e.g. the length of an iron bar will increase as
temperature increases. If one variable tend to increase as the other variable decreases, the correlation is said to be negative or
inverse. e.g. the volume of gas will decrease as the pressure increases.
(1) The correlation answers the STRENGTH of linear association between paired variables, say X and Y. On the other hand,
the regression tells us the FORM of linear association that best predicts Y from the values of X.
(2) Correlation is calculated whenever:
o both X and Y is measured in each subject and quantify how much they are linearly associated.
o in particular the Pearson's product moment correlation coefficient is used when the assumption of both X and Y
are sampled from normally-distributed populations are satisfied
o or the Spearman's moment order correlation coefficient is used if the assumption of normality is not satisfied.
o correlation is not used when the variables are manipulated, for example, in experiments.
The numerical measure of strength in the linear relationship between any two variables is called the correlation
coefficient, usually denoted by r, is defined by
_ _
(XX) (YY)
r= , called Pearson Product Moment Correlation Coefficient.
_ _
(XX)2 (YY) 2
XY(X)( Y)/n
Alternatively, r =
[X2(X)2/n][ Y2(Y)2/n]
Its range is from -1 to +1
If r = -1, that’s mean there is a perfect
negative correlation If r = +1, that’s mean
there is a perfect positive correlation
It is important to note that r = 0 does not mean that there is no relationship at all. e.g. if all the observed values lie exactly
on a circle, there is a perfect non-linear relationship between the variables.
Rank Correlation
Sometimes, the actual measurements of individuals or objects are either not available or accurate assessment is not
possible. They are then arranged in order according to some characteristic of interest.. Such an ordered arrangement is called a
ranking and the order given to an individual or object is called its rank. The correlation between two such sets of ranking is called
Rank Correlation.
we have
6di2
r = 1 - n(n2 - 1)
This is also ranging from – 1 to + 1
Note
5 Regression and Correlation (9)
If two objects or observations are tied (having same value), lets say for fourth and fifth, then they are both given the
mean rank of 4 and 5. i.e. 4.5.
This situation is given in the following example.
Examples (7)
The following table shows the number of hours studied (X) by a random sample of ten students and their grades in
examination (Y):
X: 8 5 11 13 10 5 18 15 2 8
Y: 56 44 79 72 70 54 94 85 33 65
Calculate Spearman’s rank correlation coefficient.
Solution
We rank the X values by giving rank 1 to the highest value 18, rank 2 to 15, rank 3 to 13, rank 4 to 11, rank 5 to 10, rank
6.5 (mean of rank 6 and 7) to both 8, rank 8.5 (mean of rank 8 and 9) to both 5 and rank 10 to 2. Similarly we rank the values of Y by
giving 1 to the highest value 94, rank 2 to 85, rank 3 to 79, …, and rank 10 to 33 which is the smallest.
Table given below:
X Y Rank of X Rank of Y di d2
8 56 6.5 7 - 0.5 0.25
5 44 8.5 9 - 0.5 0.25
11 79 4 3 1.0 1
13 72 3 4 - 1.0 1
10 70 5 5 0.0 0
5 54 8.5 8 0.5 0.25
18 94 1 1 0.0 0
15 85 2 2 0.0 0
2 33 10 10 0.0 0
8 65 6.5 6 0.5 0.25
d2 = 3
The value of n is 10.
6di2 6(3)
Hence r = 1 - n(n2 - 1) = 1 - 10(102 - 1) = 0.98
Compare this value with the correlation coefficient for the original values.
Exercise (8)
Ten competitors in a beauty contest are ranked by three judges in the following order
1st Judge 1 6 5 10 3 2 4 9 7 8
2nd Judge 3 5 8 4 7 10 2 1 6 9
3rd Judge 6 4 9 8 1 2 3 10 5 7
5 Regression and Correlation (10)
Use the rank correlation coefficient to discuss which pair of judges have the nearest approach to common tastes in
beauty.
5 Regression and Correlation (11)
Assignment 2
Cost accountant often estimates overhead based on the level of production. At the Standard Knitting Co. they have
collected information on overhead expenses and units produced at different plants, and want to estimate a regression equation to
predict future overhead. (AIOU)
Cost accountant 191 170 272 155 280 173 234 116 153 178
(overhead expenses)
units produced 40 42 53 35 56 39 48 30 37 40
Part (i)
(a) Develop the regression equation of the form Y = a + bX for the cost accountants.
(b) Find standard deviation of regression.
(c) Calculate the coefficient of determination.
Part (ii)
a) Develop the regression equation of the form Y = a + bX+cX2 for the cost accountants.
b) Find standard deviation of regression.
c) Calculate the coefficient of determination.
Hints:
where Y denotes the cost accountant and X denotes number of units produced. So the regression equation to fit is Y = a + bX
Normal equations are
Y = na + bX
XY = aX + bX2
X Y XY X2 ^y Y ^y (Y ^y)2
Y-Y (Y - Y )2
(observed (estimated values of Y)
values of Y)
Exercise 2
Ten competitors in a contest are graded by three judges in the following order
X :1st Judge 1 4 5 10 3 2 4 8 7 8
Y : 2nd Judge 3 5 8 2 7 10 2 1 3 9
Z : 3rd Judge 5 4 7 8 1 2 3 10 4 7
Use the rank correlation coefficient to discuss which pair of judges have the nearest approach to common tastes in contest.
X Y Z Ranked Ranked Ranked d= d'= d’’= d2 d’2 d’’2
Values Values Values X-Y X-Z Y-Z
of x of y of z
1 3 5
4 5 4
5 8 7
10 2 8
3 7 1
2 10 2
4 2 3
8 1 10
7 3 4
8 9 7