You are on page 1of 22

1

Topic 6

Regression and Correlation

Contents
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
6.2 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
6.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
6.2.2 Calculation of r . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
6.3 Least Squares Regression Line . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
6.3.1 General Equation of Regression Line . . . . . . . . . . . . . . . . . . . 6
6.3.2 Theoretical Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
6.3.3 Calculation of the Regression Line . . . . . . . . . . . . . . . . . . . . . 7
6.4 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
6.5 Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6.5.1 Hypothesis Test for b . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6.5.2 Hypothesis Test for r . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
6.6 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6.7 Summary and Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

Learning Objectives

draw scatterplots to investigate for a relationship between x and y

define the Product-Moment Correlation Co-efficient, r

interpret values of r as giving a good or bad linear relationship between variables

calculate r using a methodical manual method

distinguish between an explanatory variable and a response variable

calculate the equation of a Least Squares Regression Line through a set of points

calculate confidence intervals for the slope of the regression equation

carry out hypothesis tests for b and r

solve a multiple linear regression problem with the aid of a computer package
2 TOPIC 6. REGRESSION AND CORRELATION

6.1 Introduction
One of the most important things that should have been learnt in the last chapter is
how to investigate whether there is a significant difference between two samples. It is
often also interesting to know what type of relationship there is between two sets of
variables. This can be a very useful discovery as the statistics obtained can then be
used in decision making. The goal is to use statistical data to derive a procedure that
will allow a prediction, with a determined level of confidence, of further observations or
measurements in the future. A typical example to be discussed below is the following.
Data is collected on the housing market. It is a good guess that there will be a
dependency between the size of house and the price. The latter is simply a number
and the former can be measured through number of floors, number of rooms etc. There
are also categorical parameters that determine the price, like "good location" or "close
to grocery facilities". The collected data can be plotted on a scatterplot (x-y Cartesian
diagram) and some relationship can be looked for.
Mathematical tools can then be developed that will allow a function to be calculated
(expressing price in terms of number of rooms etc.) that will represent this dependency.
Moreover, it will be desirable to have numerical values representing the confidence that
should be placed in accepting this function. This topic will mainly be concerned with
linear relationships although others will be briefly mentioned.
The starting point in looking a relationship between two sets of results is by making a
visual inspection of given sets of data. Scatterplots are graphs that show the relationship
between two quantitative variables.
Examples are

 Fuel consumption per speed;


 Fuel consumption per weight;
 Spending for leisure per income;
 Prices for hard disks per memory size;
 Performance of PC’s per memory (RAM or ROM) size;

Example
Problem:
Consider the following average high and low temperatures in Montreal (in degrees
Celsius) and the measured gas consumption for heating (in litres) for a house.

Jan Feb March April May June


high -4 -3 3 12 20 25
low -14 -12 -6 2 9 14
gas 894 750 692 486 279 148

c H ERIOT-WATT U NIVERSITY 2004


6.2. CORRELATION 3

July Aug Sept Oct Nov Dec


high 27 26 21 14 6 2
low 17 15 11 5 -1 -10
gas 77 127 248 390 584 688
Solution:
The following is a data plot of gas consumption against low and high temperatures. In
both cases the data plot suggests a strong linear relationship.

In general when interpreting a scatterplot, first look for an overall pattern, and describe
form, direction, and strength. Two variables are positively associated if above average
values of one variable tend to result in above average values of the other. Two variables
are negatively associated if above average values of one variable tend to result in below
average values of the other. In the example above the data are negatively associated.

6.2 Correlation
6.2.1 Definition
Correlation is a number that measures strength and direction of the linear relationship
between two data sets.
If random variables x and y are given, each consisting of n individual data, then the
correlation, r, between x and y is defined as

   
     
 
c H ERIOT-WATT U NIVERSITY 2004
4 TOPIC 6. REGRESSION AND CORRELATION

where  and  are sample means and sx and sy are the standard deviations of the
samples.
It can be shown that correlation has the following properties:

 r does not have a dimension;


 correlation is invariant under change of units of measurements;
 r is independent from interchanging the role of x and y;
 r is always a number between -1 and 1.

r is defined as the Product Moment Correlation Coefficient.


Values close to -1 or 1 indicate strong linear relationships, whilst values near 0 indicate
a lack of relationship. Negative values mean that as one variable increases, the other
decreases. Some examples of scatterplots are given below.

Note
The equation given above is not usually the most efficient way to calculate r. Even
although it may look more complicated, a better one for computational purposes is:

 $ %  " !# %  


!
 "&'#)(  *+&-,  .&'#)( *+&-,

/c H ERIOT-WATT U NIVERSITY 2004


6.3. LEAST SQUARES REGRESSION LINE 5

6.2.2 Calculation of r

Example
Problem:
Find the correlation coefficient for the data set:

x 1 2 3 4
y 3 6 7 3
Solution:
For the equation 021 3 46587:9;4<5=467
!
> ? 3!4<5"@A9)BC4D5E+@-F ? 3!4<7.@A9)BC467E+@GF
to be used, it is best to set up a table.

x y x2 y2 xy
1 3 1 9 3
2 6 4 36 12
3 7 9 49 21

5 1JILK
4 3 16 9
4<7 1MIN 4<5 @ 1POQK 4<7 @ 1JILKO 4<587 1SRUT
12
H

Also, n = 4.
021 RWV RUT 9 LI KWVXIN
Now,
RYVZOQK 9 ILK @-F V W R VXILKO 9 I N -@ F
> ? ?

So, 0[1 K \ V_^.I 1`K.abKc \ c


W
] \

This is close to 0 so it shows that there is little relationship between x and y.

6.3 Least Squares Regression Line


It is now desired to draw the "best straight line" through the points on the scatterplot.
Given two random variables (data sets), when a relationship is being investigated, one
of the variables is usually considered to be an explanatory variable (often denoted
x), and the other a response variable (often denoted y). A regression line is a line
that describes how the response variable y changes when the explanatory variable x
changes. If plotted on a graph, this line will indeed be the best one that can be drawn
through the points.

dc H ERIOT-WATT U NIVERSITY 2004


6 TOPIC 6. REGRESSION AND CORRELATION

6.3.1 General Equation of Regression Line


The general equation of a straight line is usually given as, y = mx + c where m is the
gradient (slope) of the line and y the intercept with the y axis.
In statistics, the constants are usually referred to instead by the letters a and b.
The equation y = a + bx is calculated as follows:

e = f!g<h8i - g6hg<i
f!g<hkj'l)e mCg<hon+j
prq g isl g h
f
Note that n is the number of points and that b has to be calculated first as a depends on
it.
Regression lines are used, among others, for prediction. Of course, a regression line is
only useful if a linear relationship between the data sets is expected.

6.3.2 Theoretical Derivation


This section aims to show how the regression equations are formed.
The notation t q`pvu
e
i h is used to denote a regression line for the data set x and y. Here
i t stands for the predicted value of the response variable (as opposed to the observed
q pxu e
value). The difference of the predicted data and observed data of any point, y, is given
by the value w q t
A
i W
l i m honylWi , is called the error (or residual) of the observed data
y, and is the vertical distance between y and the regression line. The least squares
regression line is the regression line that minimises the sum of (the squares of) errors.

Thus, it is desired to minimise z q g w j q g|{}m pu e hn~liU€ j


The problem is then to find values a and b that will make S as small as possible. Partial
differentiation of S with respect to both a and b and equating the results to zero will
achieve this. Two simultaneous equations are formed and they can then be solved to
give the values a and b quoted earlier.

c H ERIOT-WATT U NIVERSITY 2004


6.3. LEAST SQUARES REGRESSION LINE 7

‚ = ƒ!„<…"†!‡ˆ„<…=„<†
ƒ!„6…"‰A‡)‚ ŠC„6…o‹+‰
Œr „ †:‡ „ …
ƒ
Notes

Ž In least squares regression analysis the roles of x and y are distinct. Changing
them round gives different regression lines (what is happening is that rather than
minimising errors vertically, they are being minimised horizontally).
Ž A change of one standard deviation in x corresponds to a change of r standard
deviations in y.
Ž `
It is always the case that
„6
6.3.3 Calculation of the Regression Line
As in the correlation section, there is a convenient way of calculating the regression line
through a set of points by simply setting up a table.

Examples

1.
Problem:
Find the regression line for the data set:

x 0 1 2 3 4
y 0 5 11 13 19
It is assumed that x is the explanatory variable and y the response.
Solution:

x y x2 xy
0 0 0 0
1 5 1 5
2 11 4 22
2 13 9 39
4 19 16 76
Sum = 10 Sum = 48 Sum = 30 Sum = 142
Note that n = 5.

‚ = ƒ „ …"†!‡ „ … „ †
ƒ!„6…"‰A‡)‚ ŠC„6…o‹+‰
Œr „ †:‡ „ …
ƒ
‘c H ERIOT-WATT U NIVERSITY 2004
8 TOPIC 6. REGRESSION AND CORRELATION

Now,

’!“ ”"•!–ˆ“ =” “ •r—P˜s™š›Uœ–šLžW™Ÿ›U [—Pœ¢¡Qž


’ “ ”k£A–)¤ “ o” ¥+£¦—P˜r™Z¡Qž–)šLž¢£¦—P˜Qž
§©¨
ª — ¢œ ¡Qž —`›¬«®­ and
Q˜ ž
¯r—  =–X›¬˜ «®­W™XšLž —`ž.«°›
U
›

The equation is therefore y = 0.4 + 4.6x


To draw this line on a scatterplot, simply choose two values of x (usually one high and
one low) and calculate y.
x=0 ± 0.4 + 4.6 ™ 0 = 0.4
x=5 ± y = 0.4 + 4.6 ™ 5 = 23.4
The line is then plotted on the scattergraph as seen below.

Notice that here, since the example was purely theoretical, it did not matter which values
should be used for x and which for y - they were simply used "as they came". However
sometimes the order is important. If, for example, you were examining the effect of how
cholesterol levels of humans depend on age, it would be necessary to make cholesterol
the response variable (y) and age the explanatory variable (x). It would be nonsensical
to have an equation that calculates a person’s age given their cholesterol level.

2. Montreal heating data


Problem:
Similar calculations to those in the abstract example above are carried out on the data
relating to the use of gas for heating in section 6.1 and the following parameters for the
least squares regression lines are found to be:

²c H ERIOT-WATT U NIVERSITY 2004


6.3. LEAST SQUARES REGRESSION LINE 9

r b a
High -0.9940 -23.9072 743.8478
Low -0.9912 -24.6422 508.6054
Solution:
The two equations for the regression lines are thus
Gas Used = 743.8478 -23.9072 x High Temperature
Gas Used = 508.6054 -24.6422 x Low Temperature
Again, it is much more sensible to find equations for "gas used" in terms of "temperature"
rather than the other way round, so "gas used" is taken as the response variable and
"temperature" the explanatory one.
The lines can now be drawn on the earlier graphs as follows:

3.
Problem:
The following four data sets are from Frank J. Anscombe, Graphs in statistical analysis.

x1 10 8 13 9 11 14 6 4 12 7 5
y1 8.04 6.95 7.58 8.81 8.33 9.96 7.24 4.26 10.84 4.82 5.68
x2 10 8 13 9 11 14 6 4 12 7 5
y2 9.14 8.14 8.74 8.77 9.26 8.1 6.13 3.10 9.13 7.26 4.74
x3 10 8 13 9 11 14 6 4 12 7 5
y3 7.46 6.77 12.74 7.11 7.81 8.84 6.08 5.39 8.15 6.42 5.73
x4 8 8 8 8 8 8 8 8 8 8 19
y4 6.58 5.76 7.71 8.84 8.47 7.04 5.25 5.56 7.91 6.89 12.50
Solution:
The correlation and least squares lines for the four data sets are shown in the following
list:

³c H ERIOT-WATT U NIVERSITY 2004


10 TOPIC 6. REGRESSION AND CORRELATION

Set 1: r = 0.8164, y1 = 0.5001x1 + 3.0001


Set 2: r = 0.8162, y2 = 0.5001x2 + 3.0009
Set 3: r = 0.8163, y3 = 0.4997x3 + 3.0025
Set 4: r = 0.8165, y4 = 0.4999x4 + 3.0017
The scatterplots can then be plotted as usual.

On inspection of the plots it is realised that only the third and fourth represent strong
linear relationships (both with one influential outlier). The first data set represents a
moderate linear relationship, while the second represents a curved relationship. This
shows that correlation and the least squares regression line can suggest a linear
relationship even if there is none so visual inspection of the data is vital.

EC Data
The following data represent the number of members in the EC Council of Ministers of
(1) current EC members and of (2) potential EC members, and the populations of the
member states:
(1) Current members

´c H ERIOT-WATT U NIVERSITY 2004


6.4. CONFIDENCE INTERVALS 11

Country Current Members Population (millions)


Germany 29 82.038
Great Britain 29 59.247
France 29 58.966
Italy 29 57.610
Spain 27 39.394
Netherlands 13 15.760
Greece 12 10.533
Belgium 12 10.213
Portugal 12 9.980
Sweden 10 8.854
Austria 10 8.082
Denmark 7 5.313
Finland 7 5.160
Ireland 7 3.744
Luxembourg 4 0.429
(2) The agreed number of seats of potential members according to the EC meeting in
Nice end of 2000:

Country Potential Members Population(millions)


Poland 27 33.667
Romania 14 22.489
Hungary 12 10.092
Bulgaria 10 10.230
Slovakia 7 5.393
Czech Republic 12 10.290
Lithuania 7 3.701
Latvia 4 2.439
Slovenia 4 1.978
Estonia 4 1.446
Cyprus 4 0.752
Malta 3 0.379
Calculate the correlation coefficient and the equations of the regression lines for both
sets of results, draw the scatterplots including the least squares regression lines and
comment on the suitability of the equation. Make sure you choose the response and
explanatory variables sensibly.

6.4 Confidence Intervals


The last section showed that regression lines can always be found even if the graph
looks anything but linear! It is therefore important to develop a procedure in order to

µc H ERIOT-WATT U NIVERSITY 2004


12 TOPIC 6. REGRESSION AND CORRELATION

gain confidence in the least squares regression line.


Some assumptions are made about the about the error term ¶

· For the random variable ¶U¸¹6¶º`»


· The standard deviation of ¶ is a constant ¼
· The distribution of ¶ is Normal
· Errors associated with different observations are independent
É
Under the assumptions above let ½¢¾¿º À;ÉÁÃÂÄv¾ ÂÅ ÆÈÇ . Then Á Ä Ì ¾ ÆËÊ+Ç has a chi-square
Ä
distribution with (n-2) degrees of freedom, and is an unbiased estimator Ç for ¼ ¾
It is desirable to find confidence intervals for the slope of the line, b. It can be shown that
b follows the t distribution with (n-2) degrees of freedom, with parameters mean = b and
standard error = ÍÎ Î
Ï ¹ÑÐ Ò:ÓJÒÔ Õ+¾ËÖØ× ÓÚÙ
¹ÑÐ ÛÜÓÞÛÝ Õ ¾
Note that the standard error can also be calculated from the formula

ß~àbáYà ã ä¾Õ
º väâ å ØÐ æç× Ó Ú
Ó Ù
Thus the end-points of a 95% confidence interval for the slope b are

b è t é-ê éëé É â ØÐ æç× Ó ä ¾ Õ


¾ëì-í Ä ¾ ä å Ó Ù
Ú
The value t(n-2),0.025 can be obtained from tables in the same way as it was when the
difference between two sample means was being examined.
In general, the end points of a (1- 2î )100% confidence interval are given by:
ä
b è tï Éí ¾ äâ å ØÐ æç× Ó ¾ Õ
Ä Ó Ù
Ú

Example
Problem:
Consider the data below:

x 1 2 3 4 5 6
y 1 2 2 4 4 6
Find a 95% confidence interval for the slope of the regression line.

ðc H ERIOT-WATT U NIVERSITY 2004


6.4. CONFIDENCE INTERVALS 13

Solution:
It is easily confirmed that a = -0.1333, b = 0.9429 and r = 0.9613.
The standard error is equal to
ñ ¢ò ùLú     ù ú

ôò ó õØöøû ÷ 
  ü ó õØöø÷
  ö
÷ü ýÿþ ö þ ý þ ö ö
The "cut-off" points for the t distribution þ with 4 degrees of freedom and a 95% confidence

interval are 2.776.
Therefore, the 95% confidence interval 0.9429 2.776
0.1351.
This gives the interval [0.5679, 1.3179].
The two extreme regression lines are y = 0.5679x - 0.1333 and y = 1.3179x - 0.1333.
All three equations are drawn on the graph below.

Thus, with 95% confidence it can be said that the true linear regression line is within
the cone displayed in the data plot, the boundaries of the cone being the two regression
lines with slope 0.5679 and 1.3179 respectively. The cone seems to be very large, but
keep in mind that there is only a small number of data.

Response time to 999 calls.


A study is made into the response time to 999 calls. It measured the time, y, (in minutes)
it takes the ambulance to arrive at the scene against the distance, x, (in kilometres)
between the station and the scene. The following data are collected:

x 3.4 1.8 4.6 2.3 3.1 5.2 0.6 2.9 2.7 4.0 2.3 1.0 6.3 4.5 3.5
y 2.5 1.8 3.0 2.6 2.9 3.9 1.6 2.3 2.0 3.4 2.5 1.8 2.6 2.8 2.8
Calculate the least squares regression line and draw it on a scatterplot together with two
more lines that determine a cone formed by the 95% confidence interval values for b.


c H ERIOT-WATT U NIVERSITY 2004
14 TOPIC 6. REGRESSION AND CORRELATION

6.5 Hypothesis Tests


Hypothesis tests can be used to check whether certain parameters produced in the
methods of analysis in this chapter are significantly different from zero.

6.5.1 Hypothesis Test for b


It would be useful to know if a calculated regression line really does have an upward (or
downward) slope, or if instead that the points actually lie on a horizontal line. If it can
be shown that the value of b is zero then it can be concluded that the line is horizontal
(or at least that it doesn’t have a significant slope). A hypothesis test can be used to do
this. It has already been mentioned that b follows a t distribution with (n-2) degrees of
freedom and the formula to calculate
  the standard error of b (the slope of the regression

line) has been given as,       . This can be used in the test statistic that can be
calculated for b in much the same way as it was for small samples in Topic 4.
As an example, take the data used in the last section.

x 1 2 3 4 5 6
y 1 2 3 4 5 6
b is calculated as 0.9429 and the standard error as 0.1351
Set up the hypotheses:
H0 : b = 0
H1 : b  0
"
The test statistic is ! %#  $ &$
" $ ')(  ' "
This calculates as " $ +*,- /. 1 0%230
Drawing a t distribution curve with 4 degrees of freedom and a significance level of 5%
results in the following diagram.

4
c H ERIOT-WATT U NIVERSITY 2004
6.5. HYPOTHESIS TESTS 15

Since the test statistic is in the shaded area the alternative hypothesis (H 1 ) is accepted.
There is evidence at the 5% level that the slope of the regression line is not equal to
zero.
Note that for convenience b is used as the notation in the hypotheses even although in
the hypothesis test it actually refers to the population as opposed to the sample.

6.5.2 Hypothesis Test for r


As has been mentioned so far, the correlation coefficient, r, measures the closeness of
the linear relationship between any two variables. The values range from -1 to +1 with
the extreme values meaning a very good relationship and values close to 0 implying no
relationship.
A t statistic can be used for r, in much the same way as it was with b, to determine
whether r is significantly different from zero.
The test statistic is given by the equation 576 9 ::<;= 8 :CBD=E with (n - 2) degrees of
8?>@<A @@
freedom.

Examples

1.
Problem:

x 1 3 4 5 3 8
y 2 6 5 8 1 3
The scatterplot is drawn below

Solution:
The plot does not look very linear and this is borne out by calculation of r, which is
0.2241.
A hypothesis test is now carried out.
H0 : r = 0
H1 : r 6 F 0

G
c H ERIOT-WATT U NIVERSITY 2004
16 TOPIC 6. REGRESSION AND CORRELATION

The test statistic is HJI L M<M<NO K MTSUOV


K?P?QR Q<Q
This calculates as 0.4599
Compare with the t distribution with 4 degrees of freedom.

Since the test statistic is not in the shaded region the null hypothesis is accepted. There
is no evidence, at the 5% level, of a significant relationship between the two variables.
One-tailed tests can also be used.

2.
Problem:
Recall the Montreal heating data in section 6.1produced the following results:

r b a
High -0.9940 -23.9072 743.8478
Low -0.9912 -24.6422 508.6054
The two equations for the regression lines were
Gas Used = 743.8478 -23.9072 W High Temperature
Gas Used = 508.6054 -24.6422 W Low Temperature
There were 12 data points considered.
Solution:
Consider the first value of r, namely -0.9940. A hypothesis test can be used to confirm
that this is significantly less than 0.
H0 : r X 0
H1 : r Y 0
The test statistic, HZI L MM<NO K MCSDOV calculates as -28.7.
K?P?Q<R QQ
The t curve is drawn below (10 degrees of freedom) with a significance level of 0.1%
shaded.

[
c H ERIOT-WATT U NIVERSITY 2004
6.6. MULTIPLE LINEAR REGRESSION 17

Since the test statistic is in the shaded area, the null hypothesis is rejected. There is
evidence using a significance level of 0.001 that the value of r is less than 0. It can be
said, then, that there is a very highly significant negative correlation between the two
variables.
This can similarly be repeated for the other value of r in the example.
Once again in this section the "r" in the hypothesis test actually refers to the population
rather than the sample.

6.6 Multiple Linear Regression


This topic ends with a brief look at multiple linear regression. The calculations are
carried out from first principles but would more usually be analysed by a statistical
computer package such as Minitab.
Consider the following table showing information about apartment blocks sold in a big
city:

Number Apartments (x) Number Floors (y) Price (in $million) (z)
60 10 78.2
40 5 45.4
80 10 100.0
30 6 35.7
60 3 80.5
40 6 42.890
90 12 120.4
80 7 90.5
Using the method of least squares, it is desired to find the equation of a plane that will
predict the price of an apartment block (z) in terms of the numbers of apartments (x)
and number of floors (y).
The equation of the plane is z = a + bx + cy where a, b and c are parameters to be
estimated.

\
c H ERIOT-WATT U NIVERSITY 2004
18 TOPIC 6. REGRESSION AND CORRELATION

For an observed data point z, the error is given by: ]_^a`cbef d` ^a`gbihfbkjmlnbio-p
For least squares, it is required to minimise qr]ts .
Therefore the problem becomes:
Minamise u_^ qwv `gbihxbkjml7byomp{z s .
This equation can be partially differentiated with respect to a, b and c to obtain:
|
S
|
^~}  v `gbihfbijml7bio-p€z v bctz
| a
S
| ~ ^ }  v `gbihfbijml7bio-p€z v b‚lƒz
| j
S
| ^ }  v `gbihfbijml7bio-p€z v b‚p{z
o
To minimise, equate all 3 to 0 (the "2’s" cancel):

b } ` +h } …„†j } l‡„ˆo } p^a‰


b } lŠ`‹„ˆh } l‡„Œj } l s „ˆo } lpŽ^/‰
b } p`‹„ˆh } pg„†j } lpg„ˆo } p s ^a‰

Now for the data above a series of values such as can be calculated and substituted
into these equations. It can be shown that:

} l s ^/3‰‰ } l‘^/’“%‰ , } l”^•“D’3‰


} p s ^•“D––— } l`‡^•“%‰€˜–%™U } y = 59
} ` s ^/š3‰3““D–—›œ“€ } p`f^•“U™3––—›1’— } `x^š–—›1š

Also note that ž 1 = n, the number of points (in this case 8).
The equations become:
8a + 480b + 59c = 593.5
480a + 32200b + 3840c = 40197
59a + 3840b + 499c = 4799.8
The method of solution of simultaneous equations can be used to solve for a, b and c
(matrix methods can also be employed).
The solution is a = -7.76106, b = 1.30665 and c = 0.48129
The equation of the least squares plane through the data is
z = -7.76106 + 1.30665x + 0.48129y

Ÿ
c H ERIOT-WATT U NIVERSITY 2004
6.7. SUMMARY AND ASSESSMENT 19

6.7 Summary and Assessment


At this stage you should be able to:

  draw scatterplots to investigate for a relationship between x and y


  define the Product-Moment Correlation Co-efficient, r
  interpret values of r as giving a good or bad linear relationship between variables
  calculate r using a methodical manual method
  distinguish between an explanatory variable and a response variable
  calculate the equation of a Least Squares Regression Line through a set of points
  calculate confidence intervals for the slope of the regression equation
  carry out hypothesis tests for b and r
  solve a multiple linear regression problem with the aid of a computer package

¡
c H ERIOT-WATT U NIVERSITY 2004
20 GLOSSARY

Glossary
Correlation
is a number that measures strength and direction of the linear relationship between
two data sets.

¢
c H ERIOT-WATT U NIVERSITY 2004
ANSWERS: TOPIC 6 21

Answers to questions and activities


6 Regression and Correlation
EC Data (page 10)

The correlation co-efficient, r, calculates as 0.9552, which suggests a good linear


correlation. However the graph does appear rather curved.
The regression equation is y = 7.130 + 0.347x

For the second table, r calculates as 0.9700 and the regression equation is y = 0.6452x
+ 3.3924 The graph is shown below.

The first scatter plot does not necessarily suggest a linear relationship between number
of inhabitants and seats, but the second does look linear. The combined data set does,
in fact, suggest a logarithmic relationship instead of a linear one.

Response time to 999 calls. (page 13)

If you have Microsoft Excel, the regression equation and correlation coefficient can be
obtained very quickly. The command "Correl" (from the function menu) will produce the
Product Moment Correlation Co-efficient, r, whilst the commands "slope" and "intercept"
will give you b and a respectively.

£
c H ERIOT-WATT U NIVERSITY 2004
22 ANSWERS: TOPIC 6

This package is used on the response times for 999 calls data and the values obtained
are:
r = 0.746049
b = 0.29909
a = 1.60559
The regression equation is therefore y = 0.29909x + 1.60559
Now, confidence intervals for b can be calculated. The standard error of b is given by
¥1ª«« « ¥T¬®­D¯ D­ «3¶·
¤¦¥§¥¨ © © © © ¨ ¥ ®¬ ­ ­
¥T¬®­D¯ ­D± « ° ²)³µ´ © © © ©
© ©
³˜¸
The value of t0.025, 13 is found from tables to be 2.160. (If you are using Excel, you
might like to try using the "Tinv" function to also obtain this number, but note that it
automatically does a two-tailed test so 0.05 has to be entered as opposed to 0.025).
So the 95% confidence interval for b is 0.29909 ¹ 2.160 x 0.074040
This gives an interval of [0.1392, 0.4590].
Therefore, the two lines to produce the cone are given by
y = 0.1392x + 1.60559
y = 0.4590x + 1.60559
The lines are drawn on the scatterplot as follows

º
c H ERIOT-WATT U NIVERSITY 2004

You might also like