You are on page 1of 42

Simple linear Regression and Correlation

• Regression Analysis:- is concerned with describing and


evaluating the relationship between a dependent variable
and one or more independent variables.
• Regression Analysis: is a statistical technique that can be
used to develop a mathematical equation showing how
variables are related.
• is used for bringing out the nature of relationship and using
it to know the best approximate value of the other variable.
• Therefore, we will deal with the problem of estimating
and/or predicting the population mean/average values of
the dependent variable on the basis of known values of the
independent variable (s).
• The variable whose value is to be
estimated/predicted is known as dependent
variable
• The variables which help us in determining the
value of the dependent variable are known as
independent variables.
• A regression equation which involves only two
variables, a dependent and an in dependent
referred to us simple regression.
• This model assumes that the dependent
variable is influenced only by one systematic
variable and the error term.
• However, when several variables (necessarily
more than two) are included in the model, it is
called multiple/multivariate regression.
• The relationship between any two variables
may be linear or non-linear.
• Linear implies a constant absolute change in
the dependent variable in response to a unit
changes in the independent variable.
• Non-linear implies varying marginal change in
the dependent variable in response to
changes in the independent variable.
The Scatter Diagram
• Consider the following data collected by taking a
sample of five industries in a given industrial
sector on their input (number of workers) and
output (thousands of birr).
Industry output (Y)in input(X)(no. of Paired data
thousand of birr workers) (X,Y)

1 4 2 2,4
2 7 3 3,7
3 3 1 1,3
4 9 5 5,9
5 17 9 9,17

Output level (Yi) is believed to depend on


number of workers (Xi). Accordingly, Yi is a
dependent variable and Xi is independent
variable.
In order to visualize the form of regression we plot these points
on a graph as shown in fig. 6.1. What we get is a scatter diagram.
• When carefully observed, the scatter diagram at
least shows the nature of relationship; whether
positive or negative and whether the curve is
linear or non-linear.
• When the general course of movement of the
paired points is best described by a straight line
• the next task is to fit a regression line which lies
as close as possible to every point on the scatter
diagram.
• This can be done by means of either free hand
drawing or the method of least squares.
• However, the latter is the most widely used
method.
The regression Equation
• Regression equation is a statement of equality
that defines the relationship between two
variables.
• The equation of the line which is to be used in
predicting the value of the dependent variable
takes the form Ye= a + bx.
• The most universally used and statistically
accepted method of fitting such an equation is
the method of least squares.
The Method of Least Squares
• This method requires that a straight line is to be fitted
being the vertical deviations of the observed Y values
from the straight line (predicted Y values) is the
minimum.

• If e1, e2, …… en are the vertical deviations of observed Y


values from the straight line (predicted Y values – Ye),
fitting a straight line in keeping with the above condition
requires that (for n sample size)
• This can be done by partially
differentiating with respect to “a”
and “b” and equating them to zero.
 ei is the error made when taking Ye
instead of Y. Therefore, ei = Yi– Ye
.
• To find the value of b partially derivate
with respect to b
 Con’t
Example
• Suppose we want to study the relationship
between input (number of workers) and
output (thousands of Birr) of five factories
given in above table.
• To fit the regression line of Yi (thousands of
Birr) on Xi(number of workers, we can employ
the method of least squares as follows:
 Arrange the data in tabular form
Yi Xi Yi.Xi
4 2 8 4
7 3 21 9
3 1 3 1
9 5 45 25
17 9 153 81

40 20 230 120
Mean
8 4
Solution
• Substituting these values in the above
equations, we get

 Therefore the least square regression


equation equals
• Estimate the amount of Birr that a factory will
have if it has 8 workers i.e Xi=8

• Consequently, if a factory has 8 workers, its level


of output will be 15 thousand ETB.
Example 6.2. In what follows you are provided with
sample observations on price and quantity
supplied of a commodity X by a competitive firm.
a) Construct the scatter diagram
b) What is the linear regression of Yi(quantity
supplies) on Xi(price of the commodity X).
c) Suppose price of the commodity X be 32, what
will be the quantity supplied by the firm?
• Tab. 6.3. Data on price and quantity supplied.
• If the price of x is 32, the estimated quantity
supplied will be approximately equal to 51
units.
Regression of X on Y
• In the above sub-topic we have explored
regression of Y on X type.
• Sometimes, it is possible and of interest to fit
the regression of X on Y type, i.e., being Y as
independent and X dependent.
• In such cases, the general form of the equation
is given by
• Where Xe = expected value of X
• a0 – X-intercept
• b0 – slope of the regression:
• Applying the principle of least squares as
before, the constants ao & bo are given as
follows

N.B. The regression equation of Y on X type and


of X on Y type coincide at ( , )
Correlation
• The correlation coefficient measures the degree to
which two variables are related /associated
• simply correlation denoted by r.
• For more than two variables we have multiple
correlations.
• Two variables may have either positive
correlation, negative correlation or may not be
correlated.
• Furthermore, depending on the form of
relationship the correlation between two variables
may be linear or non-linear.
• When higher values of X are associated with higher values
of Y and lower values of X are associated with lower
values of Y, then the correlation is said to be positive or
direct.
• Examples:
– Income and expenditure
– Number of hours spent in studying and the score obtained
– Height and weight
– Distance covered and fuel consumed by car.
• When higher values of X are associated with lower values
of Y and lower values of X are associated with higher
values of Y, then the correlation is said to be negative or
inverse.
• Examples:
– Demand and supply
• The correlation between X and Y may be one of the following
 Perfect positive (slope=1)
 Positive (slope between 0 and 1)
 No correlation (slope=0)
 Negative (slope between -1 and 0)
 Perfect negative (slope=-1)
• The presence of correlation between two variables may be due
to three reasons:
One variable being the cause of the other. The
cause is called “subject” or “independent” variable,
while the effect is called “dependent” variable.
Both variables being the result of a common cause.
That is, the correlation that exists between two
variables is due to their being related to some third
force.
Example:
Let X1= be ESLCE result
Y1=be rate of surviving in the University
Y2=be the rate of getting a scholar ship.
• Both X1&Y1 and X1&Y2 have high positive correlation, likewise
• Y1 & Y2 have positive correlation but they are not directly
related, but they are related to each other via X 1.

Chance:
• The correlation that arises by chance is called spurious
correlation.

• Examples:
• Price of teff in Addis Ababa and grade of students in USA.
• Weight of individuals in Ethiopia and income of individuals in
Kenya.
Con’t
• Therefore, in this section, we shall be concerned with
quantifying the degree of association between two
variables with linear relationship.
• Contrary to regression analysis explained in the
previous section
• the computation of coefficient of correlation does
not require one variable to be designated as
dependent and the other as independent.
• The measure of the degree of relationship between
any two variables known as the pearsonian
coefficient of correlation, usually denoted by r, is
defined
• is termed as the product – moment formula.
• It can be further simplified as

NB. The building blocks of this formula are,


therefore, , , , ,
and n(sample size).
r

XY  nXY
[ X  nX ] [ Y  nY ]
2 2 2 2
Properties of pearsonian coefficient of correlation

• when r=0,there is no linear correlation


• When r=1/-1perfect positive /negative
correlation.
• Adding a constant number to each value of
X and Y, as well as multiplying each value
by a constant does not affect the value of r.
• When r is positive and close to 1 then there is
high positive correlation while when it is close
to zero it shows low positive correlation.
• Similarly, when r is negative and close to -1 then
there is high negative correlation while when it
is close to zero it shows low negative correlation.
• It is free of any units used.
• Example find the pearsonian coefficient of
correlation for the two variables in the data of
table
Yi Xi
1 4 2 4 16 8
2 7 3 9 49 21
3 3 1 1 9 3
4 9 5 25 81 45
5 17 9 81 289 153
∑(total ) 40 20 120 444 230
• Interpretation: It implies strong positive relation
between X & Y.
Spearman’s Rank Correlation Coefficient
• The above formula and procedure is only applicable on
quantitative data, but when we have qualitative data
like efficiency, honesty, intelligence, etc
• We calculate what is called Spearman’s rank correlation
coefficient as follows:
• The pearsonian coefficient of correlation cannot be
used in cases when the direct quantitative
measurement of the phenomenon under study is
not possible.
• In such cases, we make use of the rank correlation
coefficient.
Steps involved to calculate the spearman’s coefficient of
rank correlation:
1. Rank the X values among themselves giving rank (1) to
the largest (or smallest value and (2) to the next
largest (or smallest) value and so on.
2. Rank the Y-values among themselves in a similar way
to that of X.
3. When there are ties in rank, i.e., when there are values
sharing the same rank, assign to each of the filed
observation, the mean of the ranks they jointly
occupy and the next rank to be over looked.
4. Find the sum of the squares of the differences
between ranks of two variables
5. Apply the formula

Where n= number of pairs of observations


di = difference between ranks of X and Y

• As the steps above indicate, rs may be calculated


for numerical data after ranking the values
according to numerical size.
Example Consider the ranks given by two Judges for five ladies in a beauty contest:

Judge A Judge B
Ladies RA RB
AZEB 1 2
TIZITA 3 4
FATUMA 4 3
LEMLEM 2 1
CHALTU 5 5

di
1 1
1 1
-1 1
-1 1
0 0
Total 4
• Interpretation: Since rs= 0.8, it implies that
there is similarity between the ranks of Judge
A and Judge B.
• Aster and Almaz were asked to rank 7 different
types of lipsticks, see if there is correlation
between the tests of the ladies.
Summary
• In a simple relationship, there are only two variables under
study.
• In multiple relationships, many variables are under study
• A positive correlation exists when both variables increase
or decrease at the same time
• In a negative correlation, as one variable increases the
other variable decreases, and vice versa
• If the correlation coefficient is close to +1, we may say,
there is a higher degree of positive correlation.
• On the other hand, if the correlation coefficient is near to –
1, we can say, there is higher degree of negative
correlation.
A University wishes to establish whether there is
a connection between academic and sporting
achievements. Calculate the Spearman’s rank
correlation coefficient. Eight pupils are selected
and ranked.
• We say the relationship is perfectly positive, if an
increase or decrease in one variable is
accompanied by the same amount of increase or
decrease in the other variable.
• Perfectly negative implies an increase or
decrease in one variable is accompanied by the
same amount of change in the other variable in
the reverse direction.
6 d 2

rs  1  ; n  8

n n 1 2

662 
rs  1   0.26
88 2
1 
The connection between academic and
sporting achievement is positive weak
correlation.

You might also like