You are on page 1of 12

CORRELATION AND REGRESSION

3.0 CORRELATION AND REGRESSION


3.1 Introduction

Sometimes two variables are found to relate to each other in some ways. A change of one
variable might cause another variable to change due to the influence of the first variable on
the second variable. For example, an increase in the price of sugar may cause the price of
certain foods to increase. Higher sugar prices cause the production costs of these food to
increase, therefore food manufacturers have to increase their selling price. In the same way,
increase in the price of cement may raise the costs of new houses and other construction
works.

The same applies when petroleum price increase. Petroleum price affects the prices of other
commodities, products, and services, resulting in a series of chain reactions. Higher fuel prices
cause energy prices to increase, therefore producers need to increase the selling price of their
products and services to cover the escalating operating costs. Some business will run at a
loss when customers are not willing to pay for the higher prices and profits dip when they are
unable to increase prices adequately due to competition. A good example is the tourism
industry. This industry is very much affected when customers cancel their bookings due to
high travel costs. A decrease in the number of tourists in turn affects hotels and other related
industries. Thus, it is not surprising when world petroleum prices soar, many industries run
into difficulties and stock markets turn bearish.

From the above example we can conclude two things. First, when an increase in one variable
causes another variable to increase, these two variables are said to have a positive linear
relationship, for example, sugar price and food price. Secondly, an increase of one variable
may cause another variable to decrease, for example, the relationship between energy prices
and companies’ profits. These two variables are said to have a negative relationship.

Correlation analysis is a statistical method used to measure the strength of the relationship
between two variables, while regression analysis is a statistical technique that is used to obtain
the equation relating to the variables.
3.2 Scatter Diagram

The first step in determining whether a relationship exists between two variables is to plot a
graph for the data. Normally, the independent variable is labelled on the horizontal axis and
the dependent variable on the vertical axis. These paired variables are then plotted. This graph
is called on the vertical axis. These paired diagram forms certain patterns (increasing or
decreasing), indicating that there is a relationship between the two variables. If the scatter
diagram does not show any pattern or is randomly scattered, we can assume that the two
variables do not have a relationship.

If the scatter plot falls on a straight line, we assume that the two variables have a perfect corr
elation. If this straight line is positive in gradient (or slope), then we have a perfect positive re
lationship (Figure 3.1). However, if the gradient is negative, we have a perfect negative relati
onship (Figure 3.2).

Examples of scatter plots and the strength of correlation between two variables are given in

Figure 3.1 to 3.5.

Figure 3.1 Perfect positive correlation

Figure 3.2 Perfect negative correlation


60

50

40

30

20

10

0
0 20 40 60 80 100

Figure 3.3 No correlation

80
70
60
50
40
30
20
10
0
0 10 20 30 40 50 60 70 80

Figure 3.4 Positive linear correlation

160
140
120
100
80
60
40
20
0
0 2 4 6 8 10

Figure 3.5 Negative linear correlation


2.5

1.5

0.5

0
0 2 4 6 8 10 12

Figure 3.6 Non-linear relationship

3.3 Linear Correlation Coefficient

Although scatter diagram gives us some information on the relationship between two
variables, there is no specific way of determining the strength of the relationship. Linear
correlation coefficient provides us with the measures to evaluate the strength of relationship.
Two methods are commonly used for this purpose. The first is the Pearson’s product moment
correlation coefficient which is normally denoted by r. Pearson’s correlation coefficient is used
to measure strength of the relationship between two variables that are quantities in nature.
This method is not suitable for qualitative data, especially rank data.

For qualitative data, Spearman’s rank correlation coefficient is used. Spearman’s rank
correlation coefficient (normally denoted by p) can also be used for quantitative data. However,
the quantitative data must be transformed into qualitative data before this correlation
coefficient can be used. Thus, there are some losses of accuracy. Pearson’s moment
correlation coefficient is preferred over Spearman’s rank correlation coefficient for qualitative
variables.

Pearson’s Product Moment Correlation Coefficient

Pearson’s correlation coefficient tells us two aspects of the relationship between two variables.
The sign (- or +) for r identities the kind of relationship between the two quantitative variables,
and the magnitude of r describes the strength of the relationship.

The mathematical formula for Pearson’s correlation coefficient r is

𝛴𝑥𝛴𝑦
𝛴𝑥𝑦 − 𝑛
𝑟=
(𝛴𝑥)2 (𝛴𝑦)2
√[𝛴𝑥 2 − ] [𝛴𝑦 2 −
𝑛 𝑛 ]

or
𝑛𝛴𝑥𝑦 − 𝛴𝑥𝛴𝑦
𝑟=
√[𝑛𝛴𝑥 2 − (𝛴𝑥)2 ][𝑛𝛴𝑦 2 − (𝛴𝑦)2 ]

Where r = correlation coefficient

n = number of observations

Σxy = sum of the product of x and y

Σx² = sum of squares of values of variable x and

(Σx)² = square of the sum of all the values of variable x

The magnitude of the correlation lies between -1.0 and 1.0. this means that -1.0 ≤ r ≤ 1.0

The value of correlation coefficient that is close to -1.0 indicates that the two variables have a
strong negative relationship. Negative relationship means that an increase in one variable
causes another variable to decrease, and vice versa. On the other hand, a value that is close
to 1.0 indicates that the two variables have a strong positive relationship. Positive relationship
means that an increase in one variable will cause the other variable to increase and vice versa.

A correlation close to or equals to zero means that there is no linear relationship between the
two variables. This means that an increase or decrease in value of one variable will not affect
the other variable will not affect the other variable, and vice versa. However, it does not imply
that the two variables are definitely unrelated. The two variables might be related in a non-
linear relationship or they may not have any relationship at all.

3.4 SIMPLE Regression Line

In a scatter diagram, two axes are drawn. The values of the independent variable 𝑥 are plotted
on the horizontal axis and the values of the dependent variable y are plotted on the vertical
axis. Each paired value of 𝑥 and 𝑦 is then plotted on the graph. This scatter diagram enables
us to assess if there is a relationship between the two variables. The form of the relationship
can be linear, curvilinear, quadratic, or not related at all.

If the scatter diagram indicates a linear relationship, then a best fitting line can be drawn onto
the data. Two methods are normally used. An approximate method is by drawing a best fitting
line through the points plotted in such a way that the line is closest to the points and that line
passes through the point (𝑥̅ , 𝑦̅), where 𝑥̅ , 𝑦̅ are the mean of 𝑥 and 𝑦 respectively.

A more accurate method is the method of least squares. A regression line with a positive
slope indicates that there is a direct relationship between the two variables. This means that
if 𝑥 increases, 𝑦 will increase as well, and vice versa. A negative slope indicates an inverse
relationship between the two variables. This means that if 𝑥 increases, 𝑦 will decrease and if
𝑥 decreases, 𝑦 will increase. Thus 𝑥 and 𝑦 are always moving in opposite direction. If the slope
is zero, then we say that two variables are not related.

The linear regression equation for sample data can be written in the form of 𝒚 = 𝒂 + 𝒃𝒙. In
the equation, 𝑥 is the independent variable, 𝑦 is the dependent variable and 𝑎, 𝑏 are two
constants. The constant 𝑎 is the value of 𝑦 where the regression line intersects with the 𝑦-
axis, while 𝑏 is the slope of the regression line. The value 𝑏 can be interpreted as the change
in 𝑦 per unit change in 𝑥. The regression line can be used to make forecast about the value of
𝑦 for a given value of 𝑥 in the domain. The accuracy of the forecast depends on the strength
of the relationship between the two variables. The stronger the relationship between 𝑥 and 𝑦,
the more accurate the forecast.

The Least Squares Regression Line of 𝒚 on 𝒙

The least squares regression line of 𝑦 on 𝑥 for a set of data is in form of 𝑦 = 𝑎 + 𝑏𝑥. By this
method, the best fitting regression line is the line where the sum of squared deviations
between the estimated and actual values of the dependent variable for the sample data is
minimised (Figure 3.7).

d6 d8

d4
d7
d1
d5
y

d3
d2

Figure 3.7 Least squares regression line


The value of 𝑎 and 𝑏 in the regression line 𝑦 = 𝑎 + 𝑏𝑥 can be calculated using the following
formulae.

(∑𝑥)(∑𝑦)
∑𝑥𝑦 − 𝑛 ∑𝑥𝑦 − 𝑛𝑥𝑦
̅̅̅
𝑏= 𝑜𝑟 𝑏=
2
(∑𝑥)2 ∑𝑥 − 𝑛𝑥̅ 2
2
∑𝑥 −
𝑛

∑𝑦 ∑𝑥
𝑎= −𝑏( ) 𝑜𝑟 𝑎 = 𝑦̅ − 𝑏𝑥̅
𝑛 𝑛

The coefficient 𝑏 can also be calculated in the following forms.

𝑛∑𝑥𝑦 − ∑𝑥∑𝑦 ̅̅̅


∑(𝑥 − 𝑥̅ )(𝑦 − 𝑦)
𝑏= 𝑜𝑟 𝑏 =
𝑛∑𝑥 2 − (∑𝑥)2 ∑(𝑥 − 𝑥̅ )2

3.5 Coefficient of Determination (𝑹𝟐 )

The coefficient of determination is the ratio of the explained variation to the total variation. It is
normally denoted by 𝑅2 . For simple regression line of 𝑦 on 𝑥, coefficient of determination is
the square of correlation coefficient, 𝑟. Thus, we can state that

𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛
Coefficient of determination, 𝑅2 =
𝑡𝑜𝑡𝑎𝑙 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛

For example, if correlation coefficient 𝑟 = 0.91, then the coefficient of determination

𝑅2 = (0.91)2 = 0.828

The term 𝑅2 is expressed as percentage. Therefore, 𝑅2 = 0.828 means that 82.8% of the total
variation can be explained by the regression line using the independent variable, 𝑥. Similarly,
if 𝑅2 = 0.91 , then we say that 91% of the total variation in 𝑦 is explained by the regression
line using the independent variable, 𝑥.

Points to Note When Interpreting the Correlation Coefficient

There are certain points to note when examining the relationship between two variables on
the basis of sample data. Sometimes, a correlation that equals to zero or close to zero does
not imply that the two variables are unrelated. The two variables might be related in a non-
linear relationship.

Since both the Pearson’s, r and Spearman’s, r measures the linear relationship, the value of
coefficients can be close to zero when the relationship is non-linear.
Also, the existence of high correlation between two variables does not necessarily explain why
the relationship exists. Specifically, the large correlation coefficient does not mean that one
variable is the sole factor that causes the effect on the second variable, and vice versa.

An example is the increase in the prices of chicken and beef during Hari Raya. Here, it is quite
unlikely that the increase in the prices of chicken would trigger an increase in the price of beef,
and vice versa. Both variables (prices of chicken and prices of beef) are not directly related to
each other. Instead, both increase in price is due to higher demand during the Hari Raya
festive season. Therefore, caution has to be taken when interpreting the correlation coefficient.
Example 1

The data below shows the number of patients and operating cost (RM ’00) in 10 months at
Nana Paediatric Hospital.

Month Number of patients Operating Cost


January 115 257
February 42 130
March 30 110
April 80 185
May 108 220
June 75 170
July 90 186
August 25 75
September 58 174
October 100 205

a) Calculate Pearson’s product moment correlation coefficient for the number of patients and
operating cost. Interpret the meaning of the value obtained.

𝛴𝑥 = 723, 𝛴𝑦 = 1712
𝛴𝑥 = 61667, 𝛴𝑦 2 = 319096
2

𝛴𝑥𝑦 = 138832, 𝑛 = 10

𝛴𝑥𝛴𝑦
𝛴𝑥𝑦 − 𝑛
𝑟= = 0.963
(𝛴𝑥)2 (𝛴𝑦)2
√[𝛴𝑥 2 − ] [𝛴𝑦 2 − ]
𝑛 𝑛

Interpret: There is a very strong positive linear relationship between the number of
patients and the operating cost.

b) Calculate the coefficient of determination and interpret its meaning.

𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛
𝑅2 =
𝑡𝑜𝑡𝑎𝑙 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛

𝑅2 = (0.963)2 = 92.74%

Interpret: 92.74% of total variation in operating cost can be explained by the regression
equation using number of patients as an independent.

c) Find the linear regression equation by using least square method.

(∑𝑥)(∑𝑦)
∑𝑥𝑦 − 𝑛
𝑏= = 1.6025
2
(∑𝑥)2
∑𝑥 − 𝑛
∑𝑦 ∑𝑥
𝑎= − 𝑏 ( ) = 55.34
𝑛 𝑛

𝑦 = 𝑎 + 𝑏𝑥

∴ 𝑌 = 55.34 + 1.6025𝑋

d) Estimate the operating cost if the number of patients is 85.

𝑌̂ = 55.34 + 1.6025𝑋

𝑌̂ = 55.34 + 1.6025 (85) = 191.5525

∴ 𝑇ℎ𝑒 𝑜𝑝𝑒𝑟𝑎𝑡𝑖𝑛𝑔 𝑐𝑜𝑠𝑡 = 191.5525 × 100 = 𝑅𝑀19155.25


Example 2

The following are the MINITAB results on CGPA and starting salaries (in RM’00) of seven
graduates.

Based on the output, answer the following questions.

a) Identify the independent and dependent variable.

Independent: CGPA
Dependent: Salary of graduate

b) Explain the value of the correlation of determination.

𝑅2 = 86.1%
Interpret: 86.1% of total variation in salary of graduate can be explained by the
regression equation using results on CGPA as an independent.

c) State the estimated regression line

𝑌̂ = 7.712 + 5.5429𝑋

d) Interpret the value of the slope of the regression line obtained in (c).

Slope: 5.5429
Interpret: When CGPA increase by 1 unit, the starting salary of graduates will increase
by RM554.29

e) Estimate the salary obtained when the CGPA is 3.15.

𝑌̂ = 7.712 + 5.5429𝑋
𝑌̂ = 7.712 + 5.5429(3.15) = 25.1721
∴ 𝑇ℎ𝑒 𝑠𝑎𝑙𝑎𝑟𝑦 = 25.1721 × 100 = 𝑅𝑀2517.21

You might also like