You are on page 1of 18

Correlation and Regression

Correlation and regression are techniques which are


used to see whether a relationship exists between
two or more different sets of data

Learning Objectives:

∙ To identify, by diagram, whether a possible


relationship exists between two variables;

∙ To quantify the strength of association between


variables using the correlation coefficient;

∙ To show how a relationship can be expressed as


an equation;

∙ To identify linear equations when written and


when graphed;

∙ To examine regression, a widely used linear


model, and to consider its uses and
limitations.

Slide 1 of 16

Correlation and Regression

Scatter Diagrams

A graph known as a scatter diagram is used to


identify the possibility and type of relationship.

Scatter diagram

300

250

200

Sales (£000s) 50
150
0
100
0 5 10 15 20 25 30 35 Advertising (£00s)

y is defined as the variable which it is believed is


being influenced (dependent)

x is defined as the variable which is doing the


influencing (independent).
Slide 2 of 16

Correlation and Regression

Correlation

The strength of a relationship between two sets of


data is measured by Pearson’s correlation coefficient
(r). It is found by the following formula:

r = nΣxy - ΣxΣy

√ nΣx2- (Σx)2nΣx2- (Σx)2

EXCEL: =CORREL(Y DATA, X DATA)


Slide 3 of 16

Correlation and Regression

Example 4.1 – Calculation of r


Sales Expenditure x² y² xy
(y) (x)
25 8 8*8=64 25*25=62 8*25=20
35 12 etc. 5 etc. 0 etc. =
29 11 1 =1225 420 319
24 5 4 841 120
38 14 4 576 532
12 3 1 1444 36
18 6 2 144 108
27 8 1 324 216
17 4 25 729 68
30 9 196 289 270
9 900
36
64
16
81
Σy= Σx= Σx2= Σy2= 7097 Σxy =
255 80 756 2289

and n = 10

Slide 4 of 16

Correlation and Regression

The summary values are substituted into the


correlation coefficient formula and worked
through:

r = nΣxy - ΣxΣy

{nΣx2- (Σx)2}{nΣy2- (Σy)2}

r = (10*2289) – (80*255)
[(10*(756) – 802][(10*7097) – 2552]

r = (22890 – 20400)

(7560 – 6400)(70970-65025)

r = 2490 = 2490
√6896200 2626.0617

so r = 0.948 (to 3 d.p.)

Slide 5 of 16

Correlation and Regression

Interpretation of r

The value of r can only take a value of –1 to+1


inclusive:

+1 Perfect positive correlation exists between


the data. If x is known y can be predicted exactly.
Perfect Positive Correlation

500
400
300
r = +1
y

200
100
0
0 50 100 150 x

+0.8 < +1 Strong positive correlation exists


between the data. As x increases y increases.

Slide 6 of 16

Correlation and Regression

Interpretation of r

+0.4 < +0.8 Moderate positive correlation exists


between the data. As x increases y increases

-0.4 < +0.4 Very little correlation exists between


the data
Very little Correlation

600
500
400

0
y

300
200 0 20 40 60 80 100 120 x
100 r approx. 0

-0.4 < -0.8 Moderate negative correlation exists


between the data. As x increases y decreases.

Slide 7 of 16

Correlation and Regression

Interpretation of r

-0.8 < -1 Strong negative correlation exists


between the data. As x increases y decreases.

Strong Negative Correlation


400
350
300
250
0
y

200
150 0 20 40 60 80 100 120 x
100 r approx. -0.9
50

-1 Perfect negative correlation exists between the


data. If x is known y can be predicted exactly.

Slide 8 of 16

Correlation and Regression

Regression

Regression is a technique which builds a straight line


relationship between two sets of data.

This relationship is of the form

y = a + bx

where a and b are found by the following formulae


b = nΣxy−ΣxΣy
nΣx2-(Σx)2

EXCEL: =SLOPE(Y DATA, X DATA)

a = Σy- bΣx
n n

EXCEL: =INTERCEPT(Y DATA, X DATA)

Slide 9 of 16

Correlation and Regression

Example 4.5 – Calculation of a and b

To calculate use Summary values from Correlation


Calculation: i.e.

Σy SLOPE: Σx2 756 Σy2 Σxy n


255 Σx 80 7097 2289 10

b = nΣxy - ΣxΣy = (10*2289) - (80*255) nΣx2-


(Σx)2(10*756) - (80)2

b = 22890 - 20400 = 2490


7560 - 6400 1160

b = 2.1465517

INTERCEPT:
a = Σy - bΣx = 255 - 2.1465517 * 80 n n 10
10

a = 25.5 - 17.172413 = 8.327587

Slide 10 of 16

Correlation and Regression

Example 4.5 – Calculation of a and b

The final answers (rounded to three decimal places)


are:

a = 8.328 b = 2.147
(note that 3 decimal places were chosen as the data
supplied were in thousands and hundreds)

These give the linear regression equation

y = 8.328 + 2.147x

or, if preferred,

sales = 8.328 + 2.147*advertising expenditure

Slide 11 of 16

Correlation and Regression

Forecasts

Forecasts may be made using the resulting model.

If the x (independent) value used falls within the


original data set then this forecast is known as
interpolation.

e.g. Advertising expenditure = £700 (inside original


range) i.e. x = 7 giving

y = 8.328 + 2.147 * 7 = 23.357 i.e. 23,357


sales are forecast

If the x value falls outside the bounds of the original


data then this forecast is known as extrapolation
and care must be taken in its use.

Expenditure = £1800, so x = 18

y = 8.328 + 2.147 * 18 = 46.974


i.e., 46,974 sales are forecast

Slide 12 of 16

Correlation and Regression

Coefficient of Determination
The coefficient of determination (r2) is another
measure which may be used to assess the
appropriateness of a regression model. This is found
by squaring Pearson’s correlation coefficient and then
expressing as a percentage.

The resulting figure is then used to describe the


percentage variation in the y data which can be
attributed to the variation in x data.

In the Sales – Adv. Costs example

r = 0.948 so r2= 0.899

So it may be said that 89.9% of the variation in sales


of the products is due to variation in the levels of
advertising expenditure.

Slide 13 of 16

Correlation and Regression


Rank Correlation i.e. Spearman's

Used to assess evidence of a relationship between


two sets of data, at least one of which has been
ranked in some way.

Formulae for Calculation:

r = 1 - 6*∑d2
2
n(n - 1)

where: n = number of pairs of observations d =


difference between the rank of x and y.
Slide 14 of 16

Correlation and Regression

Example 4.2 – Calculation of Spearman’s r

Top ten travel destinations large travel company


versus Women's magazine annual reader survey.
Destination Travel Magazine d
d2
Co. Rank Rank
Florida 2 1 1 1
Canary Islands 5 6 -1 1
Greek Islands 3 2 1 1
Germany 4 4 0 0
Spain 6 5 1 1
Caribbean 10 7 3 9
Australia 7 9 -2 4
France 9 10 -1 1
Canada 8 8 0 0
Russia 1 3 -2 4
Σ d2= 22
2
r = 1 - 6 * ∑d = 1 - (6*22)
2
n(n - 1) 10(100-1)

r = 1 - 132 = 1 - 0.1333 = 0.867 (to 3 d.p.) 990

Slide 15 of 16

Correlation and Regression

Example 4.3 – Spearman’s r where one set of data


unranked.

Top 5 floral arrangements versus Internet Sales


2
Arrangement Internet Internet Rank dd
Sales (100’s)

1: Lemon Posy 2: Mixed Blooms


29 2.5 -1.5 2.25 35 1 1 1
3: Blue Symphony 4: Pink Carnival
18 4 -1 1 29 2.5 1.5
5: Lover’s Knot
2.25 16 5 0 0 Σd2 = 6.5

2
r = 1 - 6 * ∑d = 1 - (6*6.5)
2
n(n - 1) 5(25-1)
r = 1 - 391 = 1 - 0.325 = 0.675
120

Therefore we can see that there is a reasonable level


of similarity between the best sellers of the internet
site and the florist’s overall best sellers.
Slide 16 of 16

You might also like