You are on page 1of 57

CORRELATION AND REGRESSION

ANALYSIS

LINEAR REGRESSION AND CORRELATION

ANALYSIS
….

The purpose of linear regression is to develop a mathematical

relationship (Model) between variables that can be used to

estimate the value of one variable if the value of another

variable is known. The relationship that is developed has a form

of a straight line that is why it is called linear regression.


…..

Linear regression analysis is further classified into two,

ie simple linear regression and multiple linear

regressions.
SIMPLE LINEAR REGRESSION

In simple linear regression, we develop a relationship between

one dependent variable against one independent (explanatory)

variable. The relationship we need to establish here has got the

form of Y = α + βX + ε
...
Where Y is dependent variable

α is the intercept in the Y – axis

β is the gradient(slope) of the relationship

X is the independent (explanatory) variable

ε is the random error in Y

Since we cannot fit exactly the line we need, as the case of inferential statistics is, we estimate

the relationship by Y = a + bX
….

Where Y is the dependent variable

a is estimating the α

b is estimating the β

and X is the independent variable


….

In order to establish the relationship, we need to find

the value of ‘a’ and ‘b’, by using the method of least

squares, values of ‘a’ and ‘b’ as shown below to be:


….

b = n ΣXY – ΣX ΣY

n ΣX2 - ( ΣX )2

a = { ΣY - b ΣX }
n
INTERPRETATION OF THE REGRESSION COEFFIENTS

‘a’ and ‘b’ are sometime called regression coefficient and have the
following interpretations:

‘a’ ( Y – intercept shows the value that dependent variable will take

without depending on explanatory variables.)


….

On the other hand ‘b’ (slope) has two interpretations.

First, it shows the direction of the relationship. If its value is positive, then we say

that there is a positive relationship between the regressed variable. conversely, if

the value of the slope is negative, then we understand that the two variables are

negatively related.

The second interpretation is that, it shows the amount by which Y will change by

increasing one unit of X.


OR

Simple linear regression is the most commonly
used technique for determining how one
variable of interest (response/dependent
variable) is affected by the changes in another
variable ( explanatory/independent/predictor
variable). For example, A teacher might want to
relate the weights of students to their heights
using a linear regression model.
There are three main purpose of
using the simple linear regression:

1. To describe the linear dependence of one
variable to another.

2. To predict values of one variable from the
values of another variable for which more data
are available.

3. To correct for the linear dependence of one
variable on another, in order to clarify other
features of its variability.
…..

Regression line: The line that best
fits a set of data points.


Regression equation: The equation
of the regression line.
Example.....

Distance and transport cost from
the industry to the market are
presented in the table below;
….
Distance 2 4 6 8
(km) x

Transport 3 7 5 10
cost
(Million
Tsh) y
…...

a) Find the linear regression equation for the
following two sets of data.

b)Graph the regression equation using the data
points.

c) Describe the apparent relationship between
distance and transport cost.

d)Interpret the slope of the regression line in
terms of transport.
….

e) Use the regression equation to
predict the transport cost for the
market which is 3km away from the
industry.
SOLUTION

a) from,

b = n ΣXY – ΣX ΣY

n ΣX2 - ( ΣX )2

a = { ΣY - b ΣX }

n
…..
x 2 4 6 8

y 3 7 5 10

xy 6 28 30 80

x2 4 16 36 64
…..

b= 0.95 ,

a= 1.5

From Y = a + bX

The linear regression equation will be;

Y = 1.5 + 0.95X
….

b) Plot the graph.......

c) Because the slope of the
regression line is positive, transport
cost tends to increase as the distance
increases, which is no particular
suprise.
…..

d) Because x represents distance in Km and y
represents transport cost in million Tsh, the
slope of 0.95 indicates that for each unit change
in distance the cost increase by Tsh. 0.95
million.

e) For a 3km, x=3, and the regression equation
yields the predicted cost of

Y= 1.5 + 0.95 *3= 4.35
...

Interpretation: The estimated
transport cost for the market which
is 3km place is Tsh 4.35 million.
Individual Activity....

Given the following two sets of
data;
…..

x 1 2 3 4
y 2 4 5 7
…....

a) Find the least square
regression line.

b) Estimate the value of y when
x=10.
SIMPLE CORRELATION ANALYSIS

Definition........

The linear correlation coefficient is
a descriptive measure of the
strength and direction of the linear
(straight-line) relationship between
two variables.
…........

In simple correlation analysis, we are interested in assessing the

strength of the relationship between the two variables that we

have regressed. The assessment is done using a special measure

called Pearson’s Product Moment correlation coefficient. This

coefficient is denoted by ‘r’ and is defined by;


….

r = n ΣXY – ΣX ΣY

sqr({n ΣX2 - ( ΣX )2 }{ n ΣY2 - ( ΣY )2 })

r takes value between -1 and +1 ie

-1 ≤ r ≤ +1
INTERPRETATION OF THE CORRELATION

COEFFICIENT

-When ‘r’ is 1 or very close to it, then we say that, there is a very strong

positive relationship between the dependent and the independent variable.

-When ‘r’ is -1 or sometimes close to it, we maintain that the relationship

between X and Y is very strongly negative.

-Also, when ‘r’ is 0 , we conclude that there is no relationship at all

between the two variables being regressed.


Understanding the Linear
Correlation Coefficient

r reflects the slope of the scatter plot.
The linear correlation coefficient is
positive when the scatter plot shows a
positive slope and is negative when
the scatter plot shows a negative
slope.
….

The magnitude of r indicates the strength of the
linear relationship. A value of r close to −1 or
to1 indicates a strong linear relationship
between the variables and that the variable x is
a good linear predictor of the variable y.
….

The sign of r suggests the type of linear
relationship. A positive value of r suggests that
the variables are positively linearly correlated
and a negative value of r suggests that the
variables are negatively linearly correlated.
Example....

An auto manufacturing company wanted to
investigate how the price of one of its car
models depreciates with age. The research
department at the company took a sample of 11
cars of this model and collected the following
information on the ages (in years) and prices (in
hundreds of dollars) of these cars.
…..

Age 5 4 6 5 5 5 6 6 2 7 7
(X)

Price 85 103 70 82 89 98 66 95 169 70 48


(Y)
….

(a) Determine the regression equation for the
data.

(b) Graph the regression equation and the data
points.

(c) Describe the apparent relationship between
age and price of car.

(d) Interpret the slope of the regression line in
terms of prices for cars.
…..

(e) Use the regression equation to predict the
price of a 3-years old car.

(f) Compute the linear correlation coefficient, r, of
the data.

(g) Interpret the value of r obtained in part (f) in
terms of the linear relationship between the
variables age and price of car

(h) Discuss the graphical implications of the
value of r.
SOLUTION...

a) Creating a table that will help in finding the
regression equation.
….
Age 5 4 6 5 5 5 6 6 2 7 7
(X)

Price 85 103 70 82 89 98 66 95 169 70 48


(Y)

xy 425 412 420 410 445 490 396 570 338 490 336

x2 25 16 36 25 25 25 36 36 4 49 49

y2 7225 1060 4900 6724 7921 9604 4356 9025 2856 4900 2304
9 1
….

b = n ΣXY – ΣX ΣY

n ΣX2 - ( ΣX )2

a = { ΣY - b ΣX }

n
….

Y= a + bX,

Y=195.46 – 20.26 X


b) Draw the graph..... of price Vs Age.
….

c) Since the slope of the regression line is
negative, price tends to decrease as age
increases, which is no particular surprise.

d) Since x represents age in years and y


represents price in hundreds of dollars, the
slope of −20.26 indicates that price of car
depreciate an estimated $2026 per year.
…..

e) For a 3-year-old car the regression equation
yields the predicted
y ˆ = 195.47 − 20.26 × 3 = 134.69
Interpretation: The estimated price of a 3-year-
old car is $13469.
….

f)

r = n ΣXY – ΣX ΣY

{n ΣX2 - ( ΣX )2 }{ n ΣY2 - ( ΣY )2 }
…..

R = 11*(4732) – 58*975
sqr{11*326-(58)2 (11*96129-(975)2)}


R= -0.927
...

g) Interpretation:The linear correlation
coefficient, r = −0.924, suggests a strong
negative linear correlation between age and
price of car. In particular, it indicates that as age
increases, there is a strong tendency for price
to decrease, which is not surprising. It also
implies that the regression equation,
y ˆ = 195.47 − 20.26x is extremely useful for
making predictions.
...

h) Because the correlation coefficient,
r = −0.927., is quite close to −1, the data points
should be clustered closely about the
regression line
SCATTER DIAGRAMS

Scatter diagrams are the plots on the X-Y

plane of the dependent variable data against

the independent variable data. The common

forms of scatter diagrams are indicated below;


…..
COEFFICIENT OF DETERMINATION

The coefficient of determination is a measure used in
statistical analysis that assesses how well a model
explains and predicts future outcomes. It is indicative of
the level of explained variability in the dataset. In
Example 1,we determined the regression equation,
y ˆ = 1.75 + 0.95x for data on distance and
transport cost, where x represents distance, in km, and
y represents predicted transport cost, in million Tsh. We
also applied the regression equation to predict the
transport cost of market located 3km away
….

y ˆ = 1.75 + 0.95 × 3 = 4.6

But how valuable are such
predictions? Is the regression
equation useful for predicting
transport cost, or could we do just as
well by ignoring distance?
….

In general, several methods exist for
evaluating the utility of a regression equation
for making predictions. One method is to
determine the percentage of variation in the
observed values of the response variable
that is explained by the regression (or
predictor variable), as discussed below. To
find this percentage, we need to define two
measures of variation:
….

i. the total variation in the
observed values of the response
variable and

ii. the amount of variation in the
observed values of the response
variable that is explained by the
regression.
….

The coefficient of determination, R2, is used to
analyze how differences in one variable can be
explained by a difference in a second variable. For
example, when a person gets pregnant has a direct
relation to when they give birth.


More specifically, R-squared gives you the
percentage variation in y explained by x-variables.
The range is 0 to 1 (i.e. 0% to 100%) of the variation
in y can be explained by the x-variables).
….

The coefficient of determination, R2, is similar to
the correlation coefficient, R. The correlation
coefficient formula will tell you how strong of a
linear relationship there is between two
variables. R Squared is the square of the
correlation coefficient, r (hence the term r
squared).
Finding R Squared / The
Coefficient of Determination

Step 1: Find the correlation coefficient, r (it may
be given to you in the question). Example, r =
0.543.

Step 2: Square the correlation coefficient.
0.5432 = .295

Step 3: Convert the correlation coefficient to a
percentage.
.295 = 29.5%
Meaning of the Coefficient of
Determination

The coefficient of determination can be thought of as a
percent. It gives you an idea of how many data points
fall within the results of the line formed by the
regression equation. The higher the coefficient, the
higher percentage of points the line passes through
when the data points and line are plotted. If the
coefficient is 0.80, then 80% of the points should fall
within the regression line. Values of 1 or 0 would
indicate the regression line represents all or none of the
data, respectively. A higher coefficient is an indicator of
a better goodness of fit for the observations.

You might also like