You are on page 1of 36

# Correlation

&
Regression

Prof. G.R.C.Nair
1
Correlation Analysis

• Correlation Analysis is a
statistical technique used to
measure the strength of the
association between two
variables.
• This is very useful to predict
future scenario for business.
Scatter Diagram

## • The Dependent Variable is the

variable being predicted or
estimated.
• The Independent Variable
provides the basis for estimation
or it is the estimator.
• A Scatter Diagram is a chart that
portrays the relationship between
the two variables. 3
This scatter plot locates pairs of observations of
advertising expenditures on the x-axis and sales
on the y -axis. We notice that Larger (smaller)
values of sales tend to be associated with larger
(smaller) values of advertising.

## S c a tte rp lo t o f A d ve rtis ing E x p e n d iture s ( X ) a nd S a le s ( Y )

140

120

100
S a le s

80

60

40

20

0
0 10 20 30 40 50 4
A d ve rtis i ng
Direct Linear

## • The scatter of points tends to be

distributed around a positively sloped
straight line.
• The pairs of values of advertising
expenditures and sales are not located
exactly on a straight line.
• The scatter plot reveals a more or less
strong tendency rather than a precise
linear relationship.
• The line represents the nature of the
relationship on average.
5
Y
Inverse Linear

6
X
Y
Direct Nonlinear

X
7
• No association / No correlation

• Correlated ?
Y

X
8
Perfect Negative Correlation

10
9
8
7
6
Y 5
4
3
2
1
0
0 1 2 3 4 5 6 7 8 9 10
X
Perfect Positive Correlation

10
9
8
7
6
5
Y4
3
2
1
0
0 1 2 3 4 5 6 7 8 9 10
X
Zero Correlation

10
9
8
7
6
Y 5
4
3
2
1
0
0 1 2 3 4 5 6 7 8 9 10
X
Strong Positive Correlation

10
9
8
7
6
Y 5
4
3
2
1
0
0 1 2 3 4 5 6 7 8 9 10
X
Nature of Correlation

Correlation can be

• Positive or Negative
• Linear or Nonlinear
• Perfect / Strong / Weak

13
Coefficient of Correlation, r

## Karl Pearson’s Coefficient of Correlation (r)

is a measure of the strength of the linear
relationship between two variables.

## It requires interval or ratio-scaled data.

It can range from -1.00 to 1.00.

## Values of -1.00 or 1.00 indicate perfect and strong

correlation.
Values close to 0.0 indicate weak correlation.

## Negative values indicate an inverse relationship and

positive values indicate a direct relationship.
Formula for r

## We calculate the coefficient of correlation

from the following formulae.
r = Cov (X,Y)/sxsy ,
Cov (X,Y) = Σ [(X-X)(Y-Y)]/(n-1)
r = Σ [(X-X) (Y-Y)] / root of [ Σ (X-X)2 * Σ (Y-
Y)2 ]
Coefficient of Determination

## The coefficient of determination (r2) is the

proportion of the total variation in the
dependent variable (Y ) that is explained or
accounted for (not necessarily caused) by
the variation in the independent variable (X).

## It is the square of the coefficient of

correlation. Ranges from 0 to 1.
It does not give any information on the

## direction of the relationship between the

variables.
Rank Correlation
• Edward Spearman’s Rank Correlation
Coefficient (R) is used to measure the
degree of correlation between two
qualitative variables like, honesty,
beauty, talent for singing, gift dancing
etc which cannot be directly measured.
In this case, they are ranked serially ,
and the correlation ship between the
ranks is calculated as R= 1 – [6 ΣD2 /
N(N2-1)], where, D is difference in rank
for two variables for the same sample. 17
Regression

## • In regression analysis we use the

independent variable (x) to estimate
the dependent variable (y ).

## When the relationship between the variables

is linear, it is called Linear regression.
Both variables must be at least interval scale.

## determine the equation.

Least Square Regression

## The linear regression equation is:

y’ = a + bx, where:
• y’ is the average predicted value of the
dependent variable for any value of x.

## • a is the Y- intercept. It is the estimated y

value when x = 0
• b, the regression coefficient, is the slope of
the line, or the average change in y for each
change of one unit in x
Regression Equation

## • The least squares principle is used

to obtain a and b.
• ΣY=na+bΣX
• Σ XY = a Σ X + b Σ X2 or,
n( ΣXY ) −( ΣX )( ΣY)
b=
n( ΣX ) −( Σ
2
X) 2

ΣY Σ X
a = −b
n n 20
Example -1

## • Dan Ireland, the student body president at

Toledo State University, is concerned about
the cost to students of textbooks. He
believes there is a relationship between the
number of pages in the text and the selling
price of the book. To provide insight into the
problem he selects a sample of eight
textbooks currently on sale in the bookstore.
Draw a scatter diagram. Compute the
correlation coefficient.
Book Page Price (\$)
(X) (Y) X-X Y-Y
Intro to History 500 84
Basic Algebra 700 75
Intro to Psychology 800 99
Intro to Sociology 600 72
Bus. Management 400 69
Intro to Biology 500 81
Fund. of Jazz 600 63
Principles of Nursing 800 93
Σ X Σ Y
ans = 0.614

## Scatter Diagram of Number of Pages and Selling Price of Text

100

90
Price (\$)

80

70

60
400 500 600 700 800
Page 23
Example 1 contn

## Develop a regression equation for the information

given in Example 1 that can be used to estimate
the selling price based on the number of pages.

636 4,900
a= − 0.05143 = 48.0
8 8
8(397,200) − (4,900)(636)
b= 2
= .05143
8(3,150,000) − (4,900)
Example 1 contn

## The regression equation is:

Y’ = 48.0 + .05143X

## • The equation crosses the Y-axis at \$48.

A book with no pages would cost \$48.
• The slope of the line is .05143. Each
25
Example 1 contn

values of Y.

## • Estimate the selling price of an 800 page book.

Y ′ = 48.0 + 0.05143 X
= 48.0 + 0.05143(800) = 89.14
26
Example 2/HW

## • The marks given by 2 judges to the

contestants of a beauty contest is below.
Find the correlation between the tastes of
the 2 judges.
• Contst A B C D E F G H I J
• Judge X 52, 53, 42, 60, 45, 41, 37, 38, 25, 27
• Judge Y 65, 68, 43, 38, 77, 48, 35, 30, 25, 50
• Ans : 0.5394

27
Assumptions

## For each value of x, there is a group of y

values
These y values are normally distributed.

## The means of these normal distributions

of y values all lie on the straight line of
regression.
The standard deviations of these normal

Standard Error

## • Standard Deviation of all values of y is

given by
S.E = root of { Σ ( Y - y’) 2 / (n - 2)}
y’ is the estimated value by regression
equation.
Y is corresponding actual.
Also, S.E=root {(Σ Y2- aΣ Y-bΣ XY)/(n-2)}
29
Confidence Interval

## • Higher the standard error, lower the

reliability of the predicted value of y
• A confidence interval for y’ for a
given value of x can be constructed
as y’ + z S.E or y' + t S.E with n-2 d.f

30
Significance testing

## • If it is necessary to use this sample

regression coefficient ‘b’ for the whole
population, its significance may be tested
• Std error of b = Sb
• Sb = S.E / root ( Σ x2 – nx 2 )
• For ‘t’ test, t = (b - B) / Sb, for d.f =n-2
• Ho: B=0, ie, no linear correlation for the
population. H1: B = 0 or > 0 or < 0
• A confidence interval for ‘b’ also can be
constructed as b + t sb.
31
Example 3

## • Estimate the relationship between sales in Rs lakh

and ad expense in Rs lakh. Find the 95% confidence
interval for the sales when the ad expense is 7 lakhs.
Test if the ad has a positive impact on sales at 5%
significance.
Sales 3 15 6 20 9 25
Advt 1 2 3 4 5 6

## • Ans: X = 3.5, Y = 13, a = 2.4, b = 3.03

• y’= 2.4+3.03 X. When X=7, y’= 23.6
• (2.4 means sales without any ad. For every Re
ad, expect 3 Rs sales increase) 32
• S.E=root{(Σ Y2- aΣ Y-bΣ XY)/(n-2)} =7.1,
• t for 5%at d.f, 4 is 2.776.
• 95% conf int = 23.6 + 2.776 * 7.1 =3. 9 to
43.3
• Ho: B=0, H1:B > 0
• Sb = S.E / root ( Σ x2 – nx 2 ) = 1.7
• t= (b - 0)/sb= 1.785.
• Since it is < t critical at d.f 4 (one tail),
2.132, we cannot conclude that there is
positive impact at 5% significance level
or 95% confidence level. 33
Multiple Regression

## • A variable may depend on more than one

independent variable.
• eg:-Yield of grains depends on rain, fertilizer
used etc

## • Y’ = a + b1X1 + b2 X2 - A three dimensional graph

• Or, even
• Y’= a + b1 X1 + b2 X2 + b3 X3 + b4 X4 + ……

34
Example 4/ HW

## • A professor felt that the hours spent by students on

home work and the marks they get are
correlated. .Test it with the given data.

Student A B C D E F G H I J
Hrs 45 30 90 60 105 65 90 80 55 75
Mark 40 35 75 65 90 50 90 80 45 65

## • Predict the mark of the student who spends 95 hrs

• Obtain a 95% confidence interval for the mark.
35
HW / Assignment
• IIMM Page 521,23, 42,79

## • 2007 terminal –make up part C .Q 5 a

& 5 b.
• 2007 terminal part C Q.6b

36