You are on page 1of 47

Simple Regression And Correlation

CHAPTER 5

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 1


Simple Regression

A regression model is a mathematical equation that describes the


relationship between two or more variables. A simple regression model
includes only two variables: one independent and one dependent. The
dependent variable is the one being explained, and the independent
variable is the one used to explain the variation in the dependent variable.
For example we want to estimate the heights of children on the basis of
their ages.
The heights would be the dependent variable and ages would be the
independent variable.
Second example is “ In estimating the yield of crops on the basis of
fertilizer used.”
Now the yield will be dependent variable and amount of fertilizer would be
independent variable.

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 2


Linear Regression

The relationship between two variables in a regression


analysis is expressed by a mathematical equation called a
regression equation or model. A regression equation, when
plotted, may assume one of many possible shapes,
including a straight line. A regression equation that gives a
straight-line relationship between two variables is called a
linear regression model; otherwise, the model is called a
nonlinear regression model. In this chapter, only linear
regression models are studied.

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 3


Simple Linear Regression Model
The Population Regression Model. OR
Probabilistic Regression Model.

Dependent variable Intercept Slope

Y = α + ꞵ Xi + ε

Explained variable unknown constant Independent variable Error term

Predicted variable Population Parameter Fixed variable Disturbance


term
Random variable Predictor Noise term
Response variable Regression Explanatory variable
Regressand Regressor

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 4


Continue
Deterministic Regression Model OR
Statistic Regression Model
A model in which we can determine a unique value of dependent variable for each value
of independent variable is called deterministic model.

y-Intercept Slope

y = a + b xi

Dependent variable unknown constant Independent variable

Random variable unbiased estimators of α and ꞵ Fixed variable


Endogenous variable Regression coefficient Exogenous variable

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 5


Continue
The values of α and ꞵ in the population regression line are called the true values of the y-
intercept and slope, respectively.
However, population data are difficult to obtain. As a result, we almost always use sample
data to estimate probabilistic model. The values of the y-intercept and slope calculated
from sample data on x and y are called the estimated values of α and ꞵ and are denoted
by a and b, respectively. Using a and b, we write the estimated regression model as
Ῡ = a + bx
where (read as y hat) is the estimated or predicted value of y for a given value of x. This
equation is called the estimated regression model; it gives the regression of y on x.

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 6


Interpretation of a and b

Interpretation of a
a is conditional mean when x=0

Interpretation of b
b is the slope, also stated as the change in mean of Y per 1 unit change in x.
Note that when b is positive, an increase in x will lead to an increase in y, and a
decrease in x will lead to a decrease in y. In other words, when b is positive, the
movements in x and y are in the same direction. Such a relationship between x and y
is called a positive linear relationship. The regression line in this case slopes upward
from left to right. On the other hand, if the value of b is negative, an increase in x will
lead to a decrease in y, and a decrease in x will cause an increase in y. The changes in
x and y in this case are in opposite directions. Such a relationship between x and y is
called a negative linear relationship. The regression line in this case slopes downward
from left to right.

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 7


Simple Linear Regression Assumptions

 linear relationship exists between dependent and independent variable.


Note: if the relation is not linear, it may be possible to transform one or both
variables so that there is a linear relation.
 we say the relationship between Y and x is linear if the means of the conditional
distributions of Y|x lie on a straight line
 The independent variable is uncorrelated with the residuals; that is, the independent
variable is not random.
 The expected value of the disturbance term is zero; that is, E( εi)=0
 Var ( εi) = E( εi 2) = σ2 for all i, the variance of error term is constant; that is, the
disturbance or residual terms are all drawn from a distribution with an identical
variance. In other words, the disturbance terms are homoscedasticity . [A violation of
this is referred to as heteroscedasticity .]
 The residuals are independently distributed; that is, the residual or disturbance for
one observation is not correlated with that of another observation. [A violation of this
is referred to as autocorrelation ] ; that is E( εi ,εj) = 0 for all i ≠ j
 The disturbance term is normally distributed with a mean of zero and a constant
variance σ2..

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 8


Estimation

We wish to use the sample data to estimate the population parameters: the slope β
and the intercept α .

•Least squares estimation


To choose the ‘best fitting line’ using least squares estimation, we minimize the
sum of the squared vertical distances of each point to the fitted line.

Observed value
Data (y)

Estimated Regression
Line

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 9


Continue
We let ‘hats’ denote predicted values or estimates of parameters, so we have:
i = a + bxi
where ˆ yi is the estimated conditional mean for xi, a is the estimator for α ,
and b is the estimator for β
We wish to choose a and b such that we minimize the sum of the squared
vertical distances of each point to the fitted line, i.e. minimize
∑ei 2 = ∑ (– ) 2
Or minimize the function g:
g (a,b) = ∑ (– ) 2

g(a,b) = ∑( )2

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 10


Continue
 This vertical distance of a point from the fitted line is called a residual. The
residual for observation i is denoted ei and

ei = –

 So, in least squares estimation, we wish to minimize the sum of the squared
residuals (or error sum of squares SSE).
 To minimize
g(a,b) = ∑( y – a – bx )2
we take the derivative of g with respect to a and b, set equal to zero, and solve.

= - 2 ∑( y – a – bx ) = 0

= - 2 ∑( y – a – bx ) xi = 0

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 11


Continue
Simplifying the above gives:
∑yi = na + b ∑ xi
∑ xy = a ∑ x + b ∑ x2
And these two equations are known as the least squares normal
equations.
Direct Elimination:
Estimate of the slope:
b y.x =
b y.x =
b y.x =

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 12


Continue
Estimate of the Y –intercept a:

a =
OR a =Ῡ-bẍ

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 13


Example
The values of x and their corresponding values of y are shown in the table below

x 0 1 2 3 4
y 2 3 5 4 6

a) Find the least square regression line y = a + b x.


b) Estimate the value of y when x = 10.
Solution:
The estimated regression line y on x is
= a + bxi
and the two normal equations are
∑yi = na + b ∑ xi
∑ xy = a ∑ x + b ∑ x2

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 14


Continue
We use a table to calculate a and b

x y xy x2
0 2 0 0
1 3 3 1
2 5 10 4
3 4 12 9
4 6 24 16
Σx = 10 Σy = 20 Σx y = 49 Σx2 = 30

We now calculate a and b using the least square regression formulas for
a and b.
b y.x =

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 15


Continue

b y.x =

b y.x = = = 0.9

a = -b
=∑y/n and =∑x/n
= 20 / 5 = 10 / 5
=4 =2
a = 4 – (0.9 *2)
a = 4 – 1.8 = 2.2

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 16


Continue
Now that we have the estimated least square regression line of y
on x is
= 2.2 + 0.9 x
substitute x by 10 to find the value of the corresponding y.
= 2.2 + ( 0.9 * 10)
= 2.2 + 9
= 11.2

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 17


Properties of the Least Square
Regression Line
The least squares linear regression line has the following properties.
1. The least squares regression line always goes through the point (X̄ ,Ῡ), the
means of the data.
2. The sum of the deviations of the observed values of Yi from the least
squares regression line is always equal to zero, i,.e ∑ (y – ý) = 0.
3. The sum of the squares of the deviations of the observed values from the
least squares regression line is a minimum, i.e ∑ (y – ý) 2 = minimum.
4. The least squares regression line obtained from a random sample is the
line of best fit because a and b are the unbiased estimates of the
parameters .

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 18


Standard Deviation of Regression or
Standard Error of Estimate
The Standard deviation of regression ( or residual standard deviation )is a
statistical term used to describe the difference in standard deviations of observed
values versus predicted values as shown by points in a regression analysis.
Residual standard deviation is also referred to as the standard deviation of points
around a fitted line or the standard error of estimate.
The standard deviation of the residuals calculates how much the data points spread
around the regression line.
The result is used to measure the error of the regression line's predictability.

The Formulas for population Residual Standard Deviation Is

σy.x =
Where N is the population size.

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 19


Continue

For sample data, we estimate σy.x by sy.x which is defined as

sy.x =

Alternate formula

sy.x =
Where n is the sample size.

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 20


Coefficient of Determination r2

Coefficient of determination (r2) is a statistical measure that


represents the proportion of the variance for a dependent
variable that's explained by an independent variable or
variables in a regression model.
OR
The coefficient of determination , r2, is the percentage of
variation in the dependent variable explained by the
independent variables.

So, if the r2 of a model is 0.50, then approximately half of the


observed variation can be explained by the model's inputs.

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 21


Coefficient of Determination r2
Total variation = Unexplained variation + Explained variation
∑ (y – ) 2 = + ∑ (– ) 2
SST = SSE + SSR

The coefficient of determination, denoted by r2, represents the proportion of SST


that is explained by the use of the regression model. The computational formula
for r2 is

r2 = =

r2 =

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 22


Continue
r2 = =1-
Alternate formula
r2 =

• If r2 = 1 it means that 100% variation is explained by


regression line.
• If r2 = 0 it means that none of the variability is
explained by regression line.
This shows that 0 1

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 23


Example

For example, assuming you have a set of four observed values for an x y
unnamed experiment, the table below shows y values observed and
recorded for given values of x:
1 1
Solution:

The estimated regression line y on x is 2 4


= a + bxi
and the two normal equations are
∑yi = na + b ∑ xi 3 6
∑ xy = a ∑ x + b ∑ x2

4
7

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 24


Continue
We use a table to calculate a and b

x y xy x2
1 1 1 1
2 4 8 4
3 6 18 9
4 7 28 16
Σx = 10 Σy = 18 Σx y = 55 Σx2 = 30

We now calculate a and b using the least square regression formulas for
a and b.
b y.x =

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 25


Continue

b y.x =

b y.x = = =2

a = -b
=∑y/n and =∑x/n
= 18 / 4 = 10 / 4
= 4.5 = 2.5
a = 4.5 – (2 *2.5)
a = 4.5 – 5 = -0.5
Now that we have the estimated least square regression line of y on x is
= -0.5 + 2 x

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 26


Continue

If the linear equation or slope of the line predicted by the data in the model
is given as =-0.5 + 2x where = predicted y value, the residual for each
observation can be found.
The residual is equal to (y - ), so for the first set, the actual y value is 1

and the predicted yest value given by the equation is = -0.5 +2(1) = 1.5.
The residual value is thus 1 – 1.5 = -0.5, a negative residual value.
For the second set of x and y data points, the predicted y value when x is 2 and y
is 4 can be calculated as -0.5 + 2 (2) = 3.5.
The residual value is thus 4 – 3.5 = 0.5, a positive residual value.
In this case, the actual and predicted values are the same, so the residual value
will be zero. You would use the same process for arriving at the predicted values
for y in the remaining two data sets.

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 27


Continue

x y Residual (y-) (y-)2 y2

1 1 1.5 -0.5 0.25 1


2 4 3.5 0.5 0.25 16
3 6 5.5 0.5 0.25 36
4 7 7.5 -0.5 0.25 49
Σx=10 Σy=18 Σ(y-) = 0 Σ(y-)2 = 1 Σy2 =102

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 28


Continue
Observe that the sum of the squared residuals = 6, which represents the
numerator of the residual standard deviation equation.

sy.x = =

sy.x .70711
r2 = 1 -
= / n = 102 - (18)2 / 4
= 102 – 81 = 21
r2 = 1 - = 1 – 0.047619 = 0.952381
It means 95% variation is explained by the regression line.

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 29


Comparison of Standard Error of the
Regression vs. R-squared
• The standard error of the regression provides the absolute measure of the typical
distance that the data points fall from the regression line. S is in the units of the
dependent variable.
• R-squared provides the relative measure of the percentage of the dependent
variable variance that the model explains. R-squared can range from 0 to 100%.
• The standard error of the regression has several advantages. S tells you straight up
how precise the model’s predictions are using the units of the dependent variable.
This statistic indicates how far the data points are from the regression line on
average. You want lower values of S because it signifies that the distances between
the data points and the fitted values are smaller. S is also valid for both linear and
nonlinear regression models. This fact is convenient if you need to compare the fit
between both types of models.
• For R-squared, you want the regression model to explain higher percentages of the
variance. Higher R-squared values indicate that the data points are closer to the
fitted values. While higher R-squared values are good, they don’t tell you how far the
data points are from the regression line. Additionally,
R-squared is valid for only linear models. You can’t use R-squared to compare a
linear model to a nonlinear model.

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 30


Correlation Coefficient
The correlation coefficient , r, is a measure of the strength of the relationship
between or among variables.
In other words, the linear correlation coefficient measures how closely the
points in a scatter diagram are spread around the regression line. The
correlation coefficient calculated for the population data is denoted by ρX,Y
(Greek letter rho) and the one calculated for sample data is denoted by r.
 Note that the square of the correlation coefficient is equal to the coefficient
of determination.
Note: Correlation does not imply causation. We may say that two
variables X and Y are correlated, but that does not mean that X causes Y or
that Y causes X – they simply are related or associated with one another.

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 31


Scatter Diagram

Scatter Diagrams are convenient mathematical tools to


study the correlation between two random variables.
As the name suggests, they are a form of a sheet of
paper upon which the data points corresponding to the
variables of interest, are scattered. Judging by the
shape of the pattern that the data points form on this
sheet of paper, we can determine the association
between the two variables, and can further apply the
best suitable correlation analysis technique.
OR
A plot of paired observations is called a scatter
diagram.

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 32


Interpretation of Scatter Diagrams

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 33


Direction of the Correlation
 Positive relationship: Variables change in the same directions.
 As X is increasing, Y is increasing
 As X is decreasing, Y is decreasing
Indicated by sign
For Example
(+) or ( - )
 As height increase, so does weight.
 Water consumption and temperature.
 Study time and grades.

 Negative relationship: Variables change in opposite directions.


 As X is increasing, Y is decreasing
 As X is decreasing, Y is increasing
For Example
 As Laptop time increase, grades decrease.
 Price and quantity demanded.

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 34


Example
Question: Draw the scatter diagram for the given pair of variables and understand the
type of correlation between them.
Solution: Marks obtained
No. of Students
Here, we take the two variables for (out of 100)
consideration as: 40-50 12
M: The marks obtained out of 100
S: Number of students 50-60 10

Since the values of M is in the form of 60-70 8


bins, we can use the centre point of each
70-80 7
class in the scatter diagram instead.
So let us first choose the axes of our diagram. 80-90 5
X-axis – Marks obtained out of 100
Y-axis – Number of Students 90-100 2

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 35


Continue
The data points that we need to plot according to the given dataset are –
(45,12), (55,10), (65,8), (75,7), (85,5), (95,2)

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 36


Continue

From the shape of the curve, clearly, only a fewer


number of students get high marks. This implies a
negative correlation between the two variables.

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 37


Pearson Product Moment Correlation
Coefficient (r)
Pearson correlation (r) is the most common correlation
coefficients. This measures the strength and direction of the
linear relationship between two variables. It cannot capture
nonlinear relationships between two variables and cannot
differentiate between dependent and independent variables. It is
also called coefficient of simple correlation or total correlation.

The population correlation coefficient for a bivariate distribution,


denoted by ρy,x has already defined as
ρy,x =

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 38


Short Computational Formula

Sample correlation coefficient formula


r y.x =
ry.x =

ry.x =

ry.x =

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 39


Interpretation of Correlation
Coefficient (r)
 The value of correlation coefficient ‘ r ‘ range from -1 to +1.
 If r = 1, then the correlation between the two variables is said
to be perfect and positive.
For a positive increase in one variable, there is also a positive
increase in the second variable.
 If r = -1, then the correlation between the two variables is said
to be perfect and negative.
This shows that the variables move in opposite directions for a
positive increase in one variable, there is a decrease in the
second variable.
 If r = 0, then there exists no correlation between two variables
It means there is no linear relationship between them.

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 40


Continue

Correlation Strength of Correlation Strength of


Coefficient (+ r ) Relationship Coefficient ( - r ) Relationship
values (Positive) values (Negative)

1.0 Perfect ( + ) -1.0 Perfect ( - )


0.8 to 0.99 Very Strong ( + ) -0.8 to -0.99 Very Strong ( - )
0.6 to 0.8 Strong ( + ) - 0.6 to -0.8 Strong ( - )
0.4 to 0.6 Moderate ( + ) -0.4 to - 0.6 Moderate ( - )
0.2 to 0.4 Weak ( + ) -0.2 to -0.4 Weak ( - )
0 to 0.2 Very weak ( + ) 0 to -0.2 Very weak ( - )

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 41


Example
Calculate the linear correlation coefficient for the following data.

X 4 8 12 16
Y 5 10 15 20

Solution:
For finding the linear coefficient of these data, we need to first
construct a table for the required values.

ry.x =

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 42


Continue
x y x2 y2 XY
4 5 16 25 20
8 10 64 100 80
12 15 144 225 180
16 20 256 400 320
Σ x = 40 Σ y =50 Σ x2 = 480 Σ y2 = 750 Σ xy = 600

According to the formula of linear correlation we have,


ry.x =
ry.x =

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 43


Continue

ry.x =

ry.x =

ry.x =

ry.x = 1
Hence there is perfect positive correlation between X and Y.

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 44


Properties of Correlation Coefficient

The sample correlation coefficient has the following properties.


1. The correlation coefficient is symmetrical with respect X and Y, i.e r x.y = ry.x .
2. The correlation coefficient is the geometric mean of the two regression
coefficients, i.e. ,
r =
3. The correlation coefficient is independent of the origin and scale. By this we
mean that if we take deviations of X and Y from some suitable origins or
transform X and Y into u and v respectively, it will not effected the correlation
coefficient. i.e rx.y = ru.v
4. Correlation coefficient values less than +0.8 or greater than -0.8 are not
considered significant.
5. Correlation coefficient lies between -1 and +1. i.e -1 + 1

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 45


Assignment 5
Q1:The data on ages (in years) and prices (in hundreds of dollars) for eight cars of
a specific model are given below.
Age 8 3 6 9 2 5 6 3
Price 18 94 50 21 145 42 36 99

a) Find the regression line with price as a dependent variable and age as an
independent variable .
b) Predict the price of a 7 years old car of this model.
c) Estimate the price of an 18 year-old car of this model. Comment on this finding.
Q2: Calculate and analyze the correlation coefficient between the number of study
hours and the number of sleeping hours of different students.
Number of Study 2 4 6 8 10
Hours

Number of Sleeping 10 9 8 7 6
Hours

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 46


Continue

Q3: The following information is obtained from a sample data set.


n = 12, ∑x = 66, ∑y = 588, ∑xy = 2244, ∑x2 = 396 and ∑y2 = 58734
Find the value of standard error and coefficient of determination r 2.

Q4:The following data give the experience (in years) and monthly salaries (in
hundreds of dollars) of nine randomly selected secretaries.
Experience
14 3 5 6 4 9 18 5 16
Monthly salary
62 29 37 43 35 60 67 32 60

a. Find the least squares regression line with experience as an independent


variable and monthly salary as a dependent variable.
Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 47

You might also like