You are on page 1of 25

BAMS1743 QUANTITATIVE METHODS

Chapter 6 Correlation and Regression

Scatter diagram
➢ A plot to illustrate diagrammatically any relationship that
may exist between two variables, namely the dependent
variable (Y) and the independent variable (X).

➢ The below scatter diagrams show some important


patterns that should be looked for when examining a
particular X-Y relationship.

Perfect Positive linear correlation


Y


X

Perfect negative linear correlation


Strong positive linear correlation


• •

• •

• •

• •


BAMS1743 QUANTITATIVE METHODS

Strong negative linear correlation


• •

• •

••

• ••

No linear correlation
• •

•• • • •

• • • •

• • • • • •

Eg 1: A sample of seven household’s incomes and the food


expenditures. Construct a scatter diagram for the data
below.
Income (X) Food expenditures (Y)
35 9
49 15
21 7
39 11
15 5

28 8
25 9
BAMS1743 QUANTITATIVE METHODS

Scatter diagram
16

14

12
Food Expenditures

10

0
0 10 20 30 40 50 60
Income
(5m)
Comment: Strong positive linear correlation (2m)

Product Moment Correlation Coefficient (r)


➢ A measure of the strength of the linear relationship that
exists between two variables, X and Y, and is denoted by
the letter r.

➢ The value of r lies between –1 and +1 (inclusive), that is


Strong -ve –1  r  +1. Moderate -ve Moderate positive Strong +ve
0.5-0.69 0.7-0.99
-0.7 to -0.99 -0.5 to -0.69

-1 Weak -ve 0 Weak +ve +1


Perfect negative No linear Perfect positive
linear correlation correlation linear correlation
BAMS1743 QUANTITATIVE METHODS

Formula for Pearson’s product moment correlation


coefficient:

𝑛 ∑ 𝑋𝑌−∑ 𝑋 ∑ 𝑌
r=
√[𝑛 ∑ 𝑋 2 −(∑ 𝑋)2 ] [𝑛 ∑ 𝑌 2 −(∑ 𝑌)2 ]

Eg 2: Given the sales calls and the number of units sold for 10
salesperson. Determine the correlation coefficient and
interpret the value.
Sales A B C D E F G H I J
person
Sales 14 35 22 29 6 15 17 20 12 29
calls
(X)
Unit 28 66 38 70 22 27 28 47 14 68
sold
(Y)

Soln:
X Y XY X2 Y2
14 28 392 196 784
35 66 2310 1225 4356
22 38 836 484 1444
29 70 2030 841 4900
6 22 132 36 484
15 27 405 225 729
17 28 476 289 784
20 47 940 400 2209
12 14 168 144 196
29 68 1972 841 4624
BAMS1743 QUANTITATIVE METHODS

X= 199 Y= 408 XY= 9661 X2= 4681 Y2= 20510
(0.5m) (0.5m) (1m) (1m) (1m)

𝑛 ∑ 𝑋𝑌−∑ 𝑋 ∑ 𝑌 10(9661)−(199)(408)
r= = √[10(4681)−(199) 2 ][10(20510)−(408)2 ]
(2m)
√[𝑛 ∑ 𝑋 2 −(∑ 𝑋)2 ] [𝑛 ∑ 𝑌 2 −(∑ 𝑌)2 ]

15418
= = 0.9238 (1m)
√(7209)(38636)

 Strong positive linear correlation (1 to 2m)

Eg 3: The below data relates weekly maintenance cost to the


age of ten machines of similar type in a manufacturing
company. Calculate the product moment correlation
between age and cost.
Machine 1 2 3 4 5 6 7 8 9 10
Age (X) 5 10 15 20 30 30 30 50 50 60
Cost (Y) 190 240 250 300 310 335 300 300 350 395
BAMS1743 QUANTITATIVE METHODS

Soln:
X Y XY X2 Y2
5 190 950 25 36100
10 240 2400 100 57600
15 250 3750 225 62500
20 300 6000 400 90000
30 310 9300 900 96100
30 335 10050 900 112225
30 300 9000 900 90000
50 300 15000 2500 90000
50 350 17500 2500 122500
60 395 23700 3600 156025
X= 300 Y= 2970 XY= 97650 X2= 12050 Y2= 913050

𝑛 ∑ 𝑋𝑌−∑ 𝑋 ∑ 𝑌 10(97650)−(300)(2970)
r= = √[10(12050)−(300) 2 ][10(913050)−(2970)2 ]
√[𝑛 ∑ 𝑋 2 −(∑ 𝑋)2 ] [𝑛 ∑ 𝑌 2 −(∑ 𝑌)2 ]

85500
= = 0.8799
√(30500)(309600)

Strong positive linear correlation


BAMS1743 QUANTITATIVE METHODS

Coefficient of determination (COD)

➢ Unless the correlation r is exactly or very nearly +1, -1 or


0, its meaning or significance is a little unclear.

➢ A more meaningful analysis is available from the square


of the coefficient being expressed in percentage, which is
called the coefficient of determination.

Coefficient of determination = r2  100.

➢ Note that coefficient of determination can never be


negative

Eg 4: Based on the value of product moment correlation


coefficient in Eg 2, calculate the coefficient of
determination and interpret the answer.
From Eg 2, X = Sales calls, Y = Unit solds
r = 0.9238
Coefficient of Determination, COD
= r2 × 100%
= (0.9238)2 × 100% = 85.34% (2m)
Interpretation: This means that 85.34% of the variation
in unit solds (Y) was explained by the
variation in sales calls (X). (2m)
BAMS1743 QUANTITATIVE METHODS

Spearman’s Rank Correlation


➢ a measure of correlation for two sets of ordinal (rank)
data.
𝟔 ∑ 𝒅𝟐
Rank correlation, 𝒓𝒔 = 𝟏 − 𝒏(𝒏𝟐 −𝟏)

where d = difference between the ranks of each


pair
n = number of paired observations
- 1  rs  1
rs  0  positive correlation
rs  0  negative correlation
rs = 0  no association among the ranks

➢ Procedure:

Step 1: Rank the X values (to give R1 values). Rx


Step 2: Rank the Y values (to give R2 values). Ry
Step 3: For each pair of ranks, calculate d 2 = (R1 – R2)2 and
then calculate d 2.
Step 4: The value of the rank correlation can then be found
using the following formula:
6 ∑ 𝑑2
𝑟𝑠 = 1 − 𝑛(𝑛2 −1)
BAMS1743 QUANTITATIVE METHODS

Eg 5:

X R1 Y R2 d = (R1-R2) or (R2-R1) d2
23 6 245 4 2 4
35 5 236 6 -1 1
18 7 238 5 2 4
36 4 232 7 -3 9
41 3 250 2 1 1
43 2 247 3 -1 1
48 1 252 1 0 0
d 2 = 20
6 ∑ 𝑑2 6(20) 120
𝑟𝑠 = 1 − 𝑛(𝑛2 −1) = 1 - 7(72 −1) = 1 - 336 = 0.6429

Interpretation: Positive correlation

Eg 6: The ranking of male and female senior citizens with


respect to the popularity of certain prime-time
programs. The composite rankings are:

Program Male Ranking Female Ranking


Football 1 5
Robin 4 1
News 3 2
Our Hero 2 4
Fun Game 5 3

(a) Draw a scatter diagram. Let male ranking be X.


(b) Compute Spearman’s rank correlation.
Interpret.
BAMS1743 QUANTITATIVE METHODS

Soln: a)
Scatter diagram
6

Female Ranking, Y 5
4
3
2
1
0
0 1 2 3 4 5 6
Male Ranking, X

R1 R2 d d2
b)
1 5 -4 16
4 1 3 9
3 2 1 1
2 4 -2 4
5 3 2 4 d 2= 34

6 ∑ 𝑑2 6(34) 204
𝑟𝑠 = 1 − 𝑛(𝑛2 −1)= 1 - 5(52 −1) = 1 - 120 = -0.7

 Negative correlation

➢ Notes on the rank correlation procedure


(a) Clearly, if rankings are already given for one or both
sets of bivariate values, step 1 and 2 in the procedure
would not be necessary. Big to small
(b) Ranks are usually allocated in descending order,
although it is perfectly feasible to allocate in
ascending order. However whichever method is
selected must use on both variables.
BAMS1743 QUANTITATIVE METHODS

(c) If one or more groups of data items have the same


value (known as tied values), the ranks that would
have been allocated separately must be averaged and
this average rank given to each item with this equal
value.
(d) Given a set of numeric bivariate data, both rank and
product moment coefficients can be calculated and in
general slightly different results will be obtained. It
should be understood that the rank coefficient is an
approximation to the product moment coefficient.

Comment on the accuracy between r and rs.


r is more accurate than rs because r is using all values in
calculation, while rs is only use ranking. (1m)
rs is approximate to r. (1m)

Eg 7: A small sample of individuals revealed the following


scores on an eye perception test (X) and a mechanical
aptitude test (Y). Compute the coefficient of rank
correlation. Interpret.
Subject 001 002 003 004 005 006 007 008 009 010
Eye 805 777 820 628 777 810 805 840 777 820
Perception
(X)
Mechanical 23 62 60 40 70 28 30 42 55 51
aptitude (Y)
BAMS1743 QUANTITATIVE METHODS

Soln:
X R1 Y R2 d d2
805 5.5 23 10 -4.5 20.25
777 8 62 2 6 36
820 2.5 60 3 -0.5 0.25
628 10 40 7 3 9
777 8 70 1 7 49
810 4 28 9 -5 25
805 5.5 30 8 -2.5 6.25
840 1 42 6 -5 25
777 8 55 4 4 16
820 2.5 51 5 -2.5 6.25

1m 1m d 2= 193 2m
6 ∑ 𝑑2 6(193) 1158
𝑟𝑠 = 1 − 𝑛(𝑛2 −1)= 1 - 10(102 −1) = 1 - = -0.17 2m
990

Negative correlation (1m)


BAMS1743 QUANTITATIVE METHODS

Comment on the accuracy of the product moment correlation


(r) and rank correlation (rs).

r is more accurate than rs because r is using all values in


calculation, while rs is only use ranking. (1m)
rs is an approximation to r. (1m)

Least squares method


➢ Regression equation
- A mathematical equation that defines the relationship
between 2 variables.

➢ Least Square Principle


- Determining a regression equation by minimizing the
sum of the squares of the vertical distances between the
actual Y values and the predicted Y values, Y ’ or 𝑌̂.
➢ The regression equation is
𝑌̂ = 𝑎 + 𝑏𝑋

𝑛 ∑ 𝑋𝑌−∑ 𝑋 ∑ 𝑌
where b = (slope of the line)
𝑛 ∑ 𝑋 2 −(∑ 𝑋)2

∑𝑌 ∑𝑋
a= −𝑏 (y-intercept)
𝑛 𝑛

n = number of sample
BAMS1743 QUANTITATIVE METHODS

Prediction using regression equation


➢ To estimate based on present data.
➢ Two distinct ways to estimate using a regression line.
(1) Interpolation- accurate because it is within the
observed range.
❖ Estimation carried out within the range of
values given for the independent variables.

(2) Extrapolation- not so accurate because it is not


within the observed range.
❖ Estimation based on values of the independent
variable in a region that has not been considered
in the calculation of regression line.

Eg 8: Refer to Eg 3.
(a) Find the least squares regression line of
maintenance cost on age and use this to predict
the maintenance cost for a machine of this type,
which is 40 months old.
(b) Plot the regression line.
(c) Predict the maintenance cost for a 40 months
old machine of this type graphically.

Soln:
(a) Least squares regression line of maintenance cost on
age: 𝑌̂ = 𝑎 + 𝑏𝑋 n = 10, X = Age,
Y=cost
BAMS1743 QUANTITATIVE METHODS

𝑛 ∑ 𝑋𝑌−∑ 𝑋 ∑ 𝑌 10(97650)−(300)(2970) 85500


b= = = 30500
𝑛 ∑ 𝑋 2 −(∑ 𝑋)2 10(12050)−(300)2

= 2.803
∑𝑌 ∑𝑋 2970 300
a= −𝑏 = − 2.803 ( 10 ) = 212.91
𝑛 𝑛 10

𝑌̂ = 212.91 + 2.803𝑋


When X = 40, 𝑌̂ = 212.91 + 2.803(40)
= RM 325.03
Accurate because it is interpolation OR
because it is within observed range.
(b) Plotting the regression line:
X 5 60
Y 227 381

Scatter diagram
450

400 y = 2.8033x + 212.9


R² = 0.7742
350
Cost

300

250

200

150
0 10 20 30 40 50 60 70
Age
BAMS1743 QUANTITATIVE METHODS

c) From graph, maintenance cost for 40 months old is


RM325.

Eg 9: A lecturer wants to know the relationship between the


hours of revision and the marks obtained in the final
examination.
Student Lee Min Wong Peng Kong Tan Wee Mei Caroline Anita
No. hours 40 15 26 35 10 45 34 20 17 30
Mark 65 42 55 74 35 80 75 38 28 55
obtained

(a) Draw a scatter diagram and comment on it.


(b) Determine the coefficient of correlation and
interpret.
(c) Determine the coefficient of determination and
explain.
(d) Determine the regression equation.
(e) Estimate the marks of a student who revises 28
hours.
Soln:
a)
BAMS1743 QUANTITATIVE METHODS

Scatter diagram
90
80
Marks obtained 70
60
50
40
30
20
10
0
0 10 20 30 40 50
No. hours

Comment: Strong positive linear correlation


b)
X Y XY X2 Y2
40 65 2600 1600 4225
15 42 630 225 1764
26 55 1430 676 3025
35 74 2590 1225 5476
10 35 350 100 1225
45 80 3600 2025 6400
34 75 2550 1156 5625
20 38 760 400 1444
17 28 476 289 784
30 55 1650 900 3025
X= 272 Y= 547 XY= 16636 X2= 8596 Y2= 32993

𝑛 ∑ 𝑋𝑌−∑ 𝑋 ∑ 𝑌 10(16636)−(272)(547)
r= = √[10(8596)−(272) 2 ][10(32993)−(547)2 ]
√[𝑛 ∑ 𝑋 2 −(∑ 𝑋)2 ] [𝑛 ∑ 𝑌 2 −(∑ 𝑌)2 ]
BAMS1743 QUANTITATIVE METHODS

17576
= = 0.9163
√(11976)(30721)

Interpretation: Strong positive linear correlation

c) COD = r2 × 100% = 0.91632 × 100% = 83.96%


 This means that 83.96% of the variation in Marks obtained
(Y) was explained by the variation in No. hours (X).

d) Least squares regression line of maintenance cost on age:


𝑌̂ = 𝑎 + 𝑏𝑋

𝑛 ∑ 𝑋𝑌−∑ 𝑋 ∑ 𝑌 10(16636)−(272)(547) 17576


b= = = 11976
𝑛 ∑ 𝑋 2 −(∑ 𝑋)2 10(8596)−(272)2

= 1.468
∑𝑌 ∑𝑋 547 272
a= −𝑏 = − 1.468 ( 10 ) = 14.770
𝑛 𝑛 10

𝑌̂ = 14.770 + 1.468𝑋

e) When X = 28, 𝑌̂ = 14.770 + 1.468(28) = 55.874


 56
When X = 60, 𝑌̂ = 14.770 + 1.468(60) = 102.85
103
BAMS1743 QUANTITATIVE METHODS

f) Comment on the accuracy of the answer obtained in (e).


Accurate because it is within the observed
range (from X= 10 to 45).
When X = 60, not so accurate because it is not within
observed range. N
Comment and compare the accuracy of the
answers obtained in (e).
When X=28, Accurate because it is within the observed
range (from X= 10 to 45).
When X = 60, not so accurate because it is not within
observed range.
X=28 is more accurate than X=60. N
BAMS1743 QUANTITATIVE METHODS
BAMS1743 QUANTITATIVE METHODS

Computer Application – Using Excel


Example
In Mr. Steve's physical fitness course, several
fitness scores were taken. The following sample is
the number of push-ups and sit-ups done by ten
randomly selected students:

Student 1 2 3 4 5 6 7 8 9 10
Push-ups 27 22 15 35 30 52 35 55 40 40
(X)
Sit-ups (Y) 30 26 25 42 38 40 32 54 50 43
BAMS1743 QUANTITATIVE METHODS

Follow the instruction below:

Step 1: Key in the data in an Excel worksheet as shown in Figure 1.

Step 2: Click Data → Data analysis. Choose Regression

Step 3: Input Y-Range → Highlight the range of Y values in your \


worksheet.
Input X-Range → Highlight the range of X values in your worksheet

Step 4: Labels in first row → Check this box if you had entered the variable
name in your first cell.

Step 5: Output range → Key in one cell destination where your output will be
displayed.

Step 6: OK → When you are done, click ok.

Figure 1
BAMS1743 QUANTITATIVE METHODS

The Summary Output:

1. From the Regression Statistics, you can get the 𝑟 value by finding the square
root of R Square, 𝑟 = √0.70465805.
2. The Coefficients are values of a and b.
𝑎 = 14.90822536 𝑏 = 0.657885317
Hence, 𝑌̂= 14.9082 + 0.6579𝑋

Scatter Diagram

Step 1: Highlight both columns of data. On the Insert tab, click the
Scatter (X, Y) chart command button. Select the Chart
subtype that doesn’t include any lines as shown in Figure
2.
BAMS1743 QUANTITATIVE METHODS

Figure 2

Step 2: Right-click the x axis or y axis and click Format Axis. On the Format Axis
pane, set the desired Minimum and Maximum bounds as
appropriate. Additionally, you can change the Major units that control
the spacing between the gridlines.

Figure 3

Step 3: Add Axis Titles and a Trendline by clicking the Add Chart
Element Menu.
Figure 4

A plot of the data points (scatter


plot) and the fitted regression
line is shown in Figure 5.

Scatter Diagram of Push-ups and Sit-ups


60

50

40

30

20
10 20 30 40 50 60
Push-ups

Figure 5

25

You might also like