Professional Documents
Culture Documents
1) Correlation Analysis
Fellow
Dr.Kawal Gill, Associate Professor
Department/College: Shri Guru Gobind Singh College of Commerce
Author
Dr. Madhu Gupta, Associate Professor
College/Department: Janki devi Memorial College of Commerce,
University of Delhi
Reviewer
Dr. Bindra Prasad, Associate Professor
College/ Department: Department of Commerce, Shaheed Bhagat
Singh College, University of Delhi
Table of Contents
(3.1) Correlation Analysis
o 1.1 Introduction
o 1.2 Scatter Diagram
o 1.3 Karl Pearson’s Coefficient Of Correlation
o 1.4 Probable Error Of Correlation Coefficient
o 1.5 Spearman’s rank correlation
o Summary
o Exercise
o References
o Glossary
1.1 Introduction
So far, we have analyzed series involving a single variable (or measurement), such as
height of different people or marks of students or wages of workers. This is referred to
as univariate data, i.e., data involving only one variable. In this Chapter and the next,
we will study ‘bivariate data’ i.e., we will collect and analyze data relating to two
variables (measurements) for each element of the population or sample. For example,
we may
measure height and weight of different persons or we may measure price and demand of
different commodities. Such bivariate data is studied to analyze the relationship between
the two variables. For example, if we study height and weight of different individuals of a
group, we must see whether these two variables have any association or covariation
between them so that if one variable changes the other also changes either in the same
direction or in the opposite direction. We may notice that taller men are usually heavier
and shorter men are usually lighter. Similarly, if we study price and demand of a
commodity we may find, as price increases demand decreases and as price decreases
demand increases. When we observe such kind of phenomenon, we say that the two or
more variables are mutually related or co-related. Under correlation, this relationship
between two or more variables is studied.
1.1.1 Meaning
Correlation is one of the most common and most useful statistics. Correlation is an
analysis of covariation between two or more variables. If, with a change in one variable,
other variable also changes in the same or in the opposite direction, then we say the two
variables are correlated. For example, height and weight, price and demand, cost and
profits are generally correlated.
Correlation is positive, when the variables moves in the same direction, that is, when one
variable increases the other also increases and if one decreases, the other decreases.
Some examples of positive correlation are
Correlation is negative, when the two variables move in opposite direction, that is, an
increase in the value of one variable causes a decrease in the value of other variable or
the decrease in the value of one variable causes an increase in the value of other
variable. Some examples of negative correlation are
When we study the relationship between two variables, we call it simple correlation like
studying the relationship between marks in statistics and marks in accountancy of
different students.
When we study relationship among three or more variables, then it is called multiple
correlation. Thus, if we study the relationship between profit, sales, and cost of
production of any product, we call it multiple correlation.
3. Linear and Non-linear CorrelationIf the ratio of the amount of change in one
variable to the amount of corresponding change in the other variable is constant,
correlation is said to be linear. In other words, when two variables X and Y form a linear
relationship Y = a + bX, it will be a case of linear correlation. Consider the following
example:
X 3 5 9 12 13
Y 9 13 21 27 29
changes from 13 to 21. The ratio of two changes is (same as before). You may
take any combination of the changes in the two variables; the ratio of changes in the
example is always . This is linear correlation. The values of X and Y fall on the straight
line Y = 3 + 2X. In the case of linear correlation the pairs of values of X and Y, when
plotted on a graph paper, give a straight line.
However, if the value of the amount of change in one variable to the amount of change in
the other variable is not constant, correlation is said to be non-linear or curvilinear.
Correlation between two variables does not necessarily indicate causation i.e., a cause
and effect relationship between variable. The coefficient of correlation must be thought of
only as a measure of covariation and not as something that proves causation (see web
link 1.1). However, causation will always result in correlation. An observed correlation
The above points make it clear that correlation is only a mathematical relationship, which
implies nothing in itself about cause and effect. Thus, while interpreting the correlation
coefficient it is essential to see if there is any likelihood of any relationship existing
between variables under study. If there were no such likelihood, the observed
correlations would be meaningless.
There are many methods of studying correlation. The most popular ones are:
1. Scatter diagram
2. Pearson’s co-efficient of correlation and
3. Rank method
Under this method, the observed data is plotted on a graph paper taking one variable on
X-axis and other on Y-axis. The scatterdness of the dots, so plotted, gives the indication
whether the correlation is positive or negative and also an idea about the degree of such
relationship. That is why it is called scatter diagram. The greater the scatter of the
plotted points on the graph, the lesser is the relationship between the two variables. If
the points fall on a straight line having either a positive or a negative slope the
correlation is perfect. If this line runs from bottom left corner to top right, the correlation
is said to be perfect positive but if it runs from left to bottom right, it is perfect negative
correlation.
18 16 14 12 10 8 6 4 2
Hours of study (X)
Marks(Y) 30 32 28 26 22 24 20 16 18
Solution:
The points plotted are rising from the lower left hand corner to the upper right hand
corner of the graph, which shows that the two variables have positive correlation. As
the points are not on a straight line but are very close to it, there is a high degree of
positive correlation between them.
Merits
1. It is a very simple method of studying correlation. It is easy to draw, understand and
interpret a scatter diagram.
2. It is not affected by the values of extreme items.
Limitations
1. It does not give the precise degree of relationship between the variables. It only gives
an idea about the degree of correlation.
2. It is not amenable to mathematical treatment.
The value of correlation coefficient (r) will always lie between 1. When r = +1, it means
there is a perfect positive correlation between the two values. When r = -1, it means the
two variables have perfect negative correlation between them. When r = 0, then there is
no correlation between the variables. When r is positive, the correlation is also positive;
when r is negative, correlation is also negative. Thus, Pearson’s coefficient of correlation
tells us both the degree as well as the direction of relationship between two variables.
23 27 28 28 29 30 31 33 35 36
Sales
revenue (in
Rs lakhs) X
Advertising 18 20 22 27 21 29 27 29 28 29
expenditure
(in 000Rs) Y
The above discussed formula of finding the coefficient of correlation may be called direct
formula (i.e., the formula that comes directly from the definition of the coefficient of
correlation). The formula can be given several other shapes. Some of the shapes, that
are convenient for calculation purposes, have been adopted as indirect methods.
The formula for calculating correlation coefficient without taking deviations from mean is
The formula is useful when calculations are done using calculator or computer.
Illustration: Calculate coefficient of correlation from the following data:
N = 10, ∑X = 60, ∑Y = 40, ∑X2 = 494, ∑Y2 = 212 and ∑XY = 288
Demand (in dx = dy = d x2 d y2 d xd y
Price units) X - 104 Y - 148
(in Rs) Y
X
100 150 -4 2 16 4 -8
102 120 -2 -28 4 784 56
104 130 0 -18 0 324 0
107 110 3 -38 9 1444 -114
105 120 1 -28 1 784 -28
112 120 8 -28 64 784 -224
105 190 1 42 1 1764 42
101 250 -3 102 9 10404 -306
836 1190 4 6 104 16292 -582
When the data in the bivariate frequency distribution is large it can be classified into a
bivariate frequency table (or correlation table as it is generally called); taking one
variable in the column headings and the other in the stubs. In between the table, the
frequencies are specified. Similar formulae for calculating correlation coefficient may be
deduced.
Direct method
Indirect methods
Here f is the corresponding frequency and N = ∑f. the meaning of other symbols is the
same as before. The last formula is the most useful for calculation purposes.
Illustration: Find Karl Pearson’s coefficient of correlation from the following data of
income (in 000Rs) and savings (in 000 Rs).
5 - 1 - - - 1
6 2 - 3 1 - 6
7 - 5 2 1 2 10
8 - - 2 2 1 5
9 - - - 2 1 3
Total 2 6 7 6 4 25
(Note: figures in boxes are the product of cell frequency and corresponding d x and dy
giving rise to fdxdy of the cell.)
We have ∑f =25, ∑fdx =4, ∑fdx2= 36, ∑fdy =3, ∑fdy2=27 and ∑fdxdy =17
Great care must be exercised in interpreting the value of Pearson’s correlation coefficient
otherwise fallacious conclusions can be drawn. The following general rules will be helpful
while interpreting the value of ‘r’:
4. There are no set guidelines for the interpretation of the values of r, which lies
between +1 and -1. The maximum we can conclude is that the closer the r to +1
or -1, the closer is the relationship between the variables.
-1 ≤ r ≤ +1
r = bxy.byx
(regression coefficients are discussed in the next chapter)
Illustration: (i) Compute the correlation coefficient between the corresponding values of
cost and revenue given in the following table.
2 4 5 6 8 11
Cost ( in 000 Rs) X
Revenue( in 000 Rs) Y 18 12 10 8 7 5
(ii) Multiply each value of cost in the table by 4 and add 2. Multiply each value of revenue
in the table by 2 and subtract 5. Find the correlation coefficient between two new sets of
values. Explain why you do or do not obtain the same result as in (i).
Y x=X-6 y = Y - 10 x2 y2 xy
X
2 18 -4 8 16 64 -32
4 12 -2 2 4 4 -4
5 10 -1 0 1 0 0
6 8 0 -2 0 4 0
8 7 2 -3 4 9 -6
11 5 5 -5 25 25 -25
36 60 0 0 50 106 -67
(ii) Let us define new cost data as X/ and new revenue data as Y/ so that
X/ = 4X + 2 and Y/ = 2Y – 5
Calculation of Karl Pearson’s coefficient of correlation between X/ and Y/
r
x y / /
x y 2 2
536
r 0.92
800 424
The correlation between new series of cost and revenue is same as between the original
series of cost and revenue. This is because the correlation coefficient is independent of
the change in scale and origin.
2. The two variables are affected by a number of factors so that they give rise to a
normal frequency distribution. For example, height, weight, age, demand, sales etc. are
variables which are affected by not one factor but many independent causes and thus
form a normal distribution.
3. There is a cause and effect relationship between the two variables. If variables are
independent of each other, there cannot be any relationship between them. For example,
there is no relationship between size of the shoe and income of a person even if r
calculated on the basis of a sample comes out to be a very significant figure. Such
correlations are called nonsense or spurious correlation. Sometimes two variables are
affected by a third variable which may give rise to such spurious correlation. For
example, sale of ice cream and sale of cold drinks are related to weather conditions of
the area. They may show a positive correlation but they are not related to each other.
Merits
1. Compared to other methods this method take more time to compute the value of
coefficient of correlation.
2. The value of correlation coefficient is unduly affected by the presence of extreme
values.
3. It is based on a large number of assumptions (like linear relationship, normality
of the distributions, cause and effect relationship) which may not always hold well.
4. It is very much likely to be misinterpreted (see web link 1.4).
One of the measures, which help in interpreting the value of correlation coefficient is its
probable error. It helps in testing the reliability of an observed value of r so far as it
depends upon the condition of random sampling. It is an amount, which when added and
subtracted from the correlation coefficient, produces limits within which the population
coefficient of correlation will have 50% chance to lie. Probable error denoted by P.E., is
given by the following formula
P.E. = .6745
Where r is the correlation coefficient of the random sample and N is the number of pairs
of observations in the sample.
Probable error tells us the limit within which the various values of r of the various
samples taken out of the entire group will vary. By adding and subtracting the value of
the probable error from correlation coefficient, we get the limit within which correlation
coefficient of the population will have 50% chance to lie. Symbolically
The following conditions must be fulfilled for the use of probable error:
If the above conditions are not satisfied, the use of probable error may lead to fallacious
conclusions.
The value in the formula of probable error is the standard error of correlation. The
standard error gives a measure of how well a sample represents the population. When
the sample is representative, the standard error will be small. The division by the square
root of the sample size is a reflection of the speed with which an increasing sample size
gives an improved representation of the population. Reason for taking the factor .6745 in
probable error is that in a normal distribution 50% of the observations lie in the range of
, where is the mean and is the standard deviation.
Sometimes, the phenomenon under study cannot be measured quantitatively but can be
serially arranged or ranked. For example, we can rank a group of 10 girls on the basis of
beauty, intelligence or honesty, but their quantitative measurement on the basis of such
attributes is not possible. We can find the coefficient of correlation of ranks by using Karl
Figure 1.24: Charles Edward Spearman (September 10, 1863 - September 7, 1945)
rs = 1 -
Where rs stands for Spearman’s coefficient of correlation for ranks, D for difference of
ranks between paired items in the two series and N for the number of items.
The formula has been derived from the Karl Pearson’s formula and is applicable only to
ranks. Ranks can also be assigned to the phenomena measured quantitatively.
Here each observation in both the series is given a rank according to its size. We can
rank the values either way, from the smallest to largest value or from the largest to
smallest value, but the same way it is to be done for both the series. If the smallest item
is given rank ‘1’ in series one, the smallest value in the series two should be given rank
‘1’. Under ranking method, original values are not taken into account therefore, it gives
only approximate results. The value of rs is a pure number and here also it varies from -1
to +1. It is interpreted the same way as Pearson’s coefficient of correlation i.e.
If r = 0, there is no correlation,
If r = +1, there is perfect positive correlation, and
If r = -1, there is perfect negative correlation between the two variables.
1. Where items cannot be measured in quantitative terms, but they can be arrayed
or ranked, according to some variable attribute, such as beauty, intelligence and
honesty.
2. Where, the variables do not show a linear relationship.
3. Where, a variable departs markedly from normality. The Karl Pearson’s correlation
coefficient assumes that the parent population, from which sample is drawn, is
normal. Spearman’s Rank correlation coefficient is distribution free (or non-
parametric), i.e., it does not make any assumption about the parameters of the
population.
4. Where data are irregular or extreme items are erratic or inaccurate and may
influence the value of r considerably.
5. Where, sample is very small and we want to have a rough estimate of the degree
of correlation without going through the lengthy calculations.
(b) When two or more items are tied for the same rank.
(a) When two or more ranks are not tied for the same rank
(1) When ranks are given
In this situation, we first compute the difference of ranks and then apply the formula to
get the value of correlation coefficient (rs).
Illustration: in a beauty contest, two judges gave the following ranks to 10 girls.
A B C D E F G H I J
Girls
Ranks by 1st judge 1 5 4 8 9 6 10 7 3 2
Ranks by 2nd judge 4 8 7 6 5 9 10 3 2 1
J 2 1 1
Total 74
When we are given the actual data and not the ranks, we assign ranks to different values
of the variables by taking either the highest value as 1 or the lowest value as 1.
However, whether we start with the lowest value or the highest value, we must follow
the same method in both the series.
Illustration: The following data relate to the marks obtained by 10 students of a class in
statistics and costing.
39 65 62 90 82 75 25 98 36 78
Statistics
Costing 47 53 58 86 62 68 60 91 51 84
Y Rx Ry D2 = (Rx – Ry)2
X
39 47 8 10 4
65 53 6 8 4
62 58 7 7 0
90 86 2 2 0
82 62 3 5 4
75 68 5 4 1
25 60 10 6 16
98 91 1 1 0
36 51 9 9 0
78 84 4 3 1
2
∑ D = 30
= 1- 0.182 = .82
Rank coefficient of correlation is 0.82; this means there is high positive correlation between the
given variables.
b) When two or more items are tied for the same rank
When two or more values of a series have same magnitude, it is necessary to rank them as equal.
In such cases, the rank assigned to these equal values is the average of the ranks, which these
values would have got had they slightly differed from each other. For example, if three values
stand for the 5th position in a series, they would have got 5th, 6th and 7th positions had they been
slightly different. Thus, all the three will get rank for the purpose of calculating
coefficient of rank correlation and nextitem will be ranked 8.
When equal ranks are assigned to two or more series, an adjustment is made in the above formula
for calculating the rank correlation coefficient. The adjustment consists of adding to the
value of ∑D2, where m stands for the number of items with common rank. This value is added as
many times as the number of groups having equal ranks in the two series. The formula for the
spearman’s correlation coefficient can, thus, be written as
10 15 23 7 35 55 15 75
Marks in
statistics
Pocket 35 25 45 25 15 5 25 55
money(Rs)
Solution: let X denote the marks in statistics and Y denote the Pocket money; and assign ranks to
different values of
X and Y series by taking the lowest value as 1.
Y Rx Ry D2 = (Rx – Ry)2
X
10 35 2 6 16
15 25 3.5 4 0.25
23 45 5 7 4
7 25 1 4 9
35 15 6 2 16
55 5 7 1 36
15 25 3.5 4 0.25
75 55 8 8 0
∑ D2 = 81.5
rs = 1 –
Here ∑ D2 = 81.5 and N = 8; in series X there are two items having equal ranks (3.5) therefore m 1
= 2 and in series Y there are three items having equal rank (4) therefore m 2 = 3. Putting values,
we get
Merits
Limitations
Summary
Under Scatter diagram method the observed data are plotted on a graph paper taking one
variable on X-axis and other on Y-axis. The scatterdness of the dots so plotted gives the
indication whether the correlation is positive or negative and also an idea about the degree
of such relationship.
Karl Pearson’s method of calculating coefficient of correlation is the most widely used
method in practice. Also known as product moment coefficient of correlation, it measures
the intensity or magnitude of linear relationship between two variable series on the basis of
covariance of the concerned variables. Coefficient of correlation is a pure number without
any unit.
Correlation coefficient is independent of change of origin and scale of the variable X and Y.
The value of correlation coefficient will always lie between 1. Pearson’s coefficient of
correlation tells us both the degree as well as the direction of relationship between two
variables.
Probable error helps in testing the reliability of an observed value of r so far as it depends
upon the condition of random sampling. It is an amount which when added to and
subtracted from the correlation coefficient produces limits within which the population
coefficient of correlation will have 50% chance to lie.
The standard error gives a measure of how well a sample represents the population. When
the sample is representative, the standard error will be small.
Where items cannot be measured in quantitative terms or where the variables do not show
a linear relationship or depart markedly from normality, we apply a method based on ranks
for measuring correlation.
Exercise
1.1 What is time series? What are its important components? Give an example of each component.
1.2 What is meant by analysis of time series? Discuss its importance in business and economics.
1.3 Explain cyclical variations in a time series. How do seasonal variations differ from them?
1.4 What is secular trend? How does it differ from other short term variations in a time series
data?
1.5 Explain briefly the additive and multiplicative models of time series. What are their underlying
assumptions? Which of these models is more popular in practice and why?
References
Berenson and Levine, "Basic Business Statistics: Concepts and Applications", Prentice
Hall.Chou, Ya-lun Holt,Rinehart and Winston, New York. Croxton and Cowden, , Prentice
Hall, London ."Statistical analysis”
“Applied general statistics”
David P. Doane & Lori E.Seward, :Applied Statistics for business and economics" Tata
McGraw Hill Publishing Co. ltd.
Dhingra, I.C., and M.P. Gupta, "Lectures in Business Statistics", Sultan Chand
Douglas A,Lind, William G Marshal & Samuel A. Wathen, "Statistics techniques for business
and economics" Tata McGraw Hill Publishing Co. ltd.
Frank , Harry and Steven C. Althoen, "Statistics: Concepts and Applications", Cambride
Low-priced Editions, 1995.
Gupta, S.C., "Fundamentals of Statistics", Himalaya Publishing House.
Gupta, S.P., and Archana Gupta, "Statistical Methods", Sultan Chand and Sons, New Delhi.
Kakkar N.K. & Vohra N.D. "Statistics-an introductory analysis" jnanada prakashan
Levin, Richard and David S. Rubin, "Statistics for Management", 7th Edition, Prentice Hall of
India.
Sharma J.K., "Business Statistics" second edition ’pearsons education.
Srivastava T.N. & Shailja Rego, "Statistics for Management", Tata McGraw Hill Publishing
Co. ltd.
Siegel, Andrew F., "Practical Business Statistics", International Edition (4th Ed.), Irwin
McGraw Hill.
Spiegel M.D., "Theory and Problems of Statistics", Schaum’s Outlines Series, McGraw Hill
Publishing Co.
Yule and Kendal, "An introduction to the theory of statistics", Charles Griffen & co., London
Web Links
1.1 http://en.wikipedia.org/wiki/Time_series
1.2 http://www.itl.nist.gov/div898/handbook/pmc/section4/pmc41.htm
1.3
http://mathematics.nayland.school.nz/Year_13_Stats/3.1_timeseries/Time_Series_home.htm#top
1.4
http://mathematics.nayland.school.nz/Year_13_Stats/3.1_timeseries/3_seasonal_variation.htm
Glossary
Additive model: A model for the decomposition of time series which assumes that the four
components of time series interact in additive fashion in order to produce the observed values.
Analysis of time series: Analysis of figures comprising a time series for evaluation and
forecasting.
Cyclical variations: Regular but not uniformly periodic short term fluctuations in a time series
caused by business cycles.
Irregular variations: Short term variations in a time series which are purely random and are
the result of unforeseen and unpredictable forces.
Multiplicative model: A model for the decomposition of time series which assumes that the
four components of a time series interact in multiplicative fashion to produce the observed
values.
Seasonal variations: Short term fluctuations in a time series which occur regularly and
periodically within a period of less than one year.
Time series: Data collected over a period of time at regular intervals, and arranged in
chronological sequence.
Trend: General tendency of the data to increase or to decrease or to remain constant over a
long period of time.