Regression

Prof. G.R.C.Nair

1

Correlation Analysis

• Correlation Analysis is a

statistical technique used to

measure the strength of the

association between two

variables.

• This is very useful to predict

future scenario for business.

Scatter Diagram

variable being predicted or

estimated.

• The Independent Variable

provides the basis for estimation

or it is the estimator.

• A Scatter Diagram is a chart that

portrays the relationship between

the two variables. 3

This scatter plot locates pairs of observations of

advertising expenditures on the x-axis and sales

on the y -axis. We notice that Larger (smaller)

values of sales tend to be associated with larger

(smaller) values of advertising.

140

120

100

S a le s

80

60

40

20

0

0 10 20 30 40 50 4

A d ve rtis i ng

Direct Linear

distributed around a positively sloped

straight line.

• The pairs of values of advertising

expenditures and sales are not located

exactly on a straight line.

• The scatter plot reveals a more or less

strong tendency rather than a precise

linear relationship.

• The line represents the nature of the

relationship on average.

5

Y

Inverse Linear

6

X

Y

Direct Nonlinear

X

7

• No association / No correlation

• Correlated ?

Y

X

8

Perfect Negative Correlation

10

9

8

7

6

Y 5

4

3

2

1

0

0 1 2 3 4 5 6 7 8 9 10

X

Perfect Positive Correlation

10

9

8

7

6

5

Y4

3

2

1

0

0 1 2 3 4 5 6 7 8 9 10

X

Zero Correlation

10

9

8

7

6

Y 5

4

3

2

1

0

0 1 2 3 4 5 6 7 8 9 10

X

Strong Positive Correlation

10

9

8

7

6

Y 5

4

3

2

1

0

0 1 2 3 4 5 6 7 8 9 10

X

Nature of Correlation

Correlation can be

• Positive or Negative

• Linear or Nonlinear

• Perfect / Strong / Weak

13

Coefficient of Correlation, r

is a measure of the strength of the linear

relationship between two variables.

It can range from -1.00 to 1.00.

correlation.

Values close to 0.0 indicate weak correlation.

positive values indicate a direct relationship.

Formula for r

from the following formulae.

r = Cov (X,Y)/sxsy ,

Cov (X,Y) = Σ [(X-X)(Y-Y)]/(n-1)

r = Σ [(X-X) (Y-Y)] / root of [ Σ (X-X)2 * Σ (Y-

Y)2 ]

Coefficient of Determination

proportion of the total variation in the

dependent variable (Y ) that is explained or

accounted for (not necessarily caused) by

the variation in the independent variable (X).

correlation. Ranges from 0 to 1.

It does not give any information on the

variables.

Rank Correlation

• Edward Spearman’s Rank Correlation

Coefficient (R) is used to measure the

degree of correlation between two

qualitative variables like, honesty,

beauty, talent for singing, gift dancing

etc which cannot be directly measured.

In this case, they are ranked serially ,

and the correlation ship between the

ranks is calculated as R= 1 – [6 ΣD2 /

N(N2-1)], where, D is difference in rank

for two variables for the same sample. 17

Regression

independent variable (x) to estimate

the dependent variable (y ).

is linear, it is called Linear regression.

Both variables must be at least interval scale.

Least Square Regression

y’ = a + bx, where:

• y’ is the average predicted value of the

dependent variable for any value of x.

value when x = 0

• b, the regression coefficient, is the slope of

the line, or the average change in y for each

change of one unit in x

Regression Equation

to obtain a and b.

• ΣY=na+bΣX

• Σ XY = a Σ X + b Σ X2 or,

n( ΣXY ) −( ΣX )( ΣY)

b=

n( ΣX ) −( Σ

2

X) 2

ΣY Σ X

a = −b

n n 20

Example -1

Toledo State University, is concerned about

the cost to students of textbooks. He

believes there is a relationship between the

number of pages in the text and the selling

price of the book. To provide insight into the

problem he selects a sample of eight

textbooks currently on sale in the bookstore.

Draw a scatter diagram. Compute the

correlation coefficient.

Book Page Price ($)

(X) (Y) X-X Y-Y

Intro to History 500 84

Basic Algebra 700 75

Intro to Psychology 800 99

Intro to Sociology 600 72

Bus. Management 400 69

Intro to Biology 500 81

Fund. of Jazz 600 63

Principles of Nursing 800 93

Σ X Σ Y

ans = 0.614

100

90

Price ($)

80

70

60

400 500 600 700 800

Page 23

Example 1 contn

given in Example 1 that can be used to estimate

the selling price based on the number of pages.

636 4,900

a= − 0.05143 = 48.0

8 8

8(397,200) − (4,900)(636)

b= 2

= .05143

8(3,150,000) − (4,900)

Example 1 contn

Y’ = 48.0 + .05143X

A book with no pages would cost $48.

• The slope of the line is .05143. Each

addition page costs about 5 cents.

25

Example 1 contn

values of Y.

Y ′ = 48.0 + 0.05143 X

= 48.0 + 0.05143(800) = 89.14

26

Example 2/HW

contestants of a beauty contest is below.

Find the correlation between the tastes of

the 2 judges.

• Contst A B C D E F G H I J

• Judge X 52, 53, 42, 60, 45, 41, 37, 38, 25, 27

• Judge Y 65, 68, 43, 38, 77, 48, 35, 30, 25, 50

• Ans : 0.5394

27

Assumptions

values

These y values are normally distributed.

of y values all lie on the straight line of

regression.

The standard deviations of these normal

Standard Error

given by

S.E = root of { Σ ( Y - y’) 2 / (n - 2)}

y’ is the estimated value by regression

equation.

Y is corresponding actual.

Also, S.E=root {(Σ Y2- aΣ Y-bΣ XY)/(n-2)}

29

Confidence Interval

reliability of the predicted value of y

• A confidence interval for y’ for a

given value of x can be constructed

as y’ + z S.E or y' + t S.E with n-2 d.f

30

Significance testing

regression coefficient ‘b’ for the whole

population, its significance may be tested

• Std error of b = Sb

• Sb = S.E / root ( Σ x2 – nx 2 )

• For ‘t’ test, t = (b - B) / Sb, for d.f =n-2

• Ho: B=0, ie, no linear correlation for the

population. H1: B = 0 or > 0 or < 0

• A confidence interval for ‘b’ also can be

constructed as b + t sb.

31

Example 3

and ad expense in Rs lakh. Find the 95% confidence

interval for the sales when the ad expense is 7 lakhs.

Test if the ad has a positive impact on sales at 5%

significance.

Sales 3 15 6 20 9 25

Advt 1 2 3 4 5 6

• y’= 2.4+3.03 X. When X=7, y’= 23.6

• (2.4 means sales without any ad. For every Re

ad, expect 3 Rs sales increase) 32

• S.E=root{(Σ Y2- aΣ Y-bΣ XY)/(n-2)} =7.1,

• t for 5%at d.f, 4 is 2.776.

• 95% conf int = 23.6 + 2.776 * 7.1 =3. 9 to

43.3

• Ho: B=0, H1:B > 0

• Sb = S.E / root ( Σ x2 – nx 2 ) = 1.7

• t= (b - 0)/sb= 1.785.

• Since it is < t critical at d.f 4 (one tail),

2.132, we cannot conclude that there is

positive impact at 5% significance level

or 95% confidence level. 33

Multiple Regression

independent variable.

• eg:-Yield of grains depends on rain, fertilizer

used etc

• Or, even

• Y’= a + b1 X1 + b2 X2 + b3 X3 + b4 X4 + ……

34

Example 4/ HW

home work and the marks they get are

correlated. .Test it with the given data.

Student A B C D E F G H I J

Hrs 45 30 90 60 105 65 90 80 55 75

Mark 40 35 75 65 90 50 90 80 45 65

• Obtain a 95% confidence interval for the mark.

35

HW / Assignment

• IIMM Page 521,23, 42,79

& 5 b.

• 2007 terminal part C Q.6b

36

