Chapter 5

Simple Regression And Correlation
CHAPTER 5
Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 1

Simple Regression
A regression model is a mathematical equation that describes the

relationship between two or more variables. A simple regression model
includes only two variables: one independent and one dependent. The
dependent variable is the one being explained, and the independent
variable is the one used to explain the variation in the dependent variable.
For example we want to estimate the heights of children on the basis of
their ages.
The heights would be the dependent variable and ages would be the
independent variable.
Second example is “ In estimating the yield of crops on the basis of
fertilizer used.”
Now the yield will be dependent variable and amount of fertilizer would be
independent variable.

Linear Regression
The relationship between two variables in a regression

analysis is expressed by a mathematical equation called a
regression equation or model. A regression equation, when
plotted, may assume one of many possible shapes,
including a straight line. A regression equation that gives a
straight-line relationship between two variables is called a
linear regression model; otherwise, the model is called a
nonlinear regression model. In this chapter, only linear
regression models are studied.

Simple Linear Regression Model
The Population Regression Model. OR
Probabilistic Regression Model.
Dependent variable Intercept Slope
Y = α + ꞵ Xi + ε
Explained variable unknown constant Independent variable Error term
Predicted variable Population Parameter Fixed variable Disturbance

term
Random variable Predictor Noise term
Response variable Regression Explanatory variable
Regressand Regressor

Continue
Deterministic Regression Model OR
Statistic Regression Model
A model in which we can determine a unique value of dependent variable for each value
of independent variable is called deterministic model.
y-Intercept Slope
y = a + b xi
Dependent variable unknown constant Independent variable
Random variable unbiased estimators of α and ꞵ Fixed variable

Endogenous variable Regression coefficient Exogenous variable

Continue
The values of α and ꞵ in the population regression line are called the true values of the y-
intercept and slope, respectively.
However, population data are difficult to obtain. As a result, we almost always use sample
data to estimate probabilistic model. The values of the y-intercept and slope calculated
from sample data on x and y are called the estimated values of α and ꞵ and are denoted
by a and b, respectively. Using a and b, we write the estimated regression model as
Ῡ = a + bx
where (read as y hat) is the estimated or predicted value of y for a given value of x. This
equation is called the estimated regression model; it gives the regression of y on x.

Interpretation of a and b
Interpretation of a
a is conditional mean when x=0
Interpretation of b
b is the slope, also stated as the change in mean of Y per 1 unit change in x.
Note that when b is positive, an increase in x will lead to an increase in y, and a
decrease in x will lead to a decrease in y. In other words, when b is positive, the
movements in x and y are in the same direction. Such a relationship between x and y
is called a positive linear relationship. The regression line in this case slopes upward
from left to right. On the other hand, if the value of b is negative, an increase in x will
lead to a decrease in y, and a decrease in x will cause an increase in y. The changes in
x and y in this case are in opposite directions. Such a relationship between x and y is
called a negative linear relationship. The regression line in this case slopes downward
from left to right.

Simple Linear Regression Assumptions
 linear relationship exists between dependent and independent variable.

Note: if the relation is not linear, it may be possible to transform one or both
variables so that there is a linear relation.
 we say the relationship between Y and x is linear if the means of the conditional
distributions of Y|x lie on a straight line
 The independent variable is uncorrelated with the residuals; that is, the independent
variable is not random.
 The expected value of the disturbance term is zero; that is, E( εi)=0
 Var ( εi) = E( εi 2) = σ2 for all i, the variance of error term is constant; that is, the
disturbance or residual terms are all drawn from a distribution with an identical
variance. In other words, the disturbance terms are homoscedasticity . [A violation of
this is referred to as heteroscedasticity .]
 The residuals are independently distributed; that is, the residual or disturbance for
one observation is not correlated with that of another observation. [A violation of this
is referred to as autocorrelation ] ; that is E( εi ,εj) = 0 for all i ≠ j
 The disturbance term is normally distributed with a mean of zero and a constant
variance σ2..

Estimation
We wish to use the sample data to estimate the population parameters: the slope β
and the intercept α .
•Least squares estimation

To choose the ‘best fitting line’ using least squares estimation, we minimize the
sum of the squared vertical distances of each point to the fitted line.
Observed value
Data (y)
Estimated Regression
Line

Continue
We let ‘hats’ denote predicted values or estimates of parameters, so we have:
i = a + bxi
where ˆ yi is the estimated conditional mean for xi, a is the estimator for α ,
and b is the estimator for β
We wish to choose a and b such that we minimize the sum of the squared
vertical distances of each point to the fitted line, i.e. minimize
∑ei 2 = ∑ (– ) 2
Or minimize the function g:
g (a,b) = ∑ (– ) 2
g(a,b) = ∑( )2

Continue
 This vertical distance of a point from the fitted line is called a residual. The
residual for observation i is denoted ei and
ei = –
 So, in least squares estimation, we wish to minimize the sum of the squared
residuals (or error sum of squares SSE).
 To minimize
g(a,b) = ∑( y – a – bx )2
we take the derivative of g with respect to a and b, set equal to zero, and solve.
= - 2 ∑( y – a – bx ) = 0
= - 2 ∑( y – a – bx ) xi = 0

Continue
Simplifying the above gives:
∑yi = na + b ∑ xi
∑ xy = a ∑ x + b ∑ x2
And these two equations are known as the least squares normal
equations.
Direct Elimination:
Estimate of the slope:
b y.x =
b y.x =
b y.x =

Continue
Estimate of the Y –intercept a:
a =
OR a =Ῡ-bẍ

Example
The values of x and their corresponding values of y are shown in the table below
x 0 1 2 3 4
y 2 3 5 4 6
a) Find the least square regression line y = a + b x.

b) Estimate the value of y when x = 10.
Solution:
The estimated regression line y on x is
= a + bxi
and the two normal equations are
∑yi = na + b ∑ xi
∑ xy = a ∑ x + b ∑ x2

Continue
We use a table to calculate a and b
x y xy x2
0 2 0 0
1 3 3 1
2 5 10 4
3 4 12 9
4 6 24 16
Σx = 10 Σy = 20 Σx y = 49 Σx2 = 30
We now calculate a and b using the least square regression formulas for
a and b.
b y.x =

Continue
b y.x =
b y.x = = = 0.9
a = -b
=∑y/n and =∑x/n
= 20 / 5 = 10 / 5
=4 =2
a = 4 – (0.9 *2)
a = 4 – 1.8 = 2.2

Continue
Now that we have the estimated least square regression line of y
on x is
= 2.2 + 0.9 x
substitute x by 10 to find the value of the corresponding y.
= 2.2 + ( 0.9 * 10)
= 2.2 + 9
= 11.2

Properties of the Least Square
Regression Line
The least squares linear regression line has the following properties.
1. The least squares regression line always goes through the point (X̄ ,Ῡ), the
means of the data.
2. The sum of the deviations of the observed values of Yi from the least
squares regression line is always equal to zero, i,.e ∑ (y – ý) = 0.
3. The sum of the squares of the deviations of the observed values from the
least squares regression line is a minimum, i.e ∑ (y – ý) 2 = minimum.
4. The least squares regression line obtained from a random sample is the
line of best fit because a and b are the unbiased estimates of the
parameters .

Standard Deviation of Regression or
Standard Error of Estimate
The Standard deviation of regression ( or residual standard deviation )is a
statistical term used to describe the difference in standard deviations of observed
values versus predicted values as shown by points in a regression analysis.
Residual standard deviation is also referred to as the standard deviation of points
around a fitted line or the standard error of estimate.
The standard deviation of the residuals calculates how much the data points spread
around the regression line.
The result is used to measure the error of the regression line's predictability.
The Formulas for population Residual Standard Deviation Is
σy.x =
Where N is the population size.

Continue
For sample data, we estimate σy.x by sy.x which is defined as
sy.x =
Alternate formula
sy.x =
Where n is the sample size.

Coefficient of Determination r2
Coefficient of determination (r2) is a statistical measure that

represents the proportion of the variance for a dependent
variable that's explained by an independent variable or
variables in a regression model.
OR
The coefficient of determination , r2, is the percentage of
variation in the dependent variable explained by the
independent variables.
So, if the r2 of a model is 0.50, then approximately half of the

observed variation can be explained by the model's inputs.

Coefficient of Determination r2
Total variation = Unexplained variation + Explained variation
∑ (y – ) 2 = + ∑ (– ) 2
SST = SSE + SSR
The coefficient of determination, denoted by r2, represents the proportion of SST

that is explained by the use of the regression model. The computational formula
for r2 is
r2 = =
r2 =

Continue
r2 = =1-
Alternate formula
r2 =
• If r2 = 1 it means that 100% variation is explained by

regression line.
• If r2 = 0 it means that none of the variability is
explained by regression line.
This shows that 0 1

Example
For example, assuming you have a set of four observed values for an x y
unnamed experiment, the table below shows y values observed and
recorded for given values of x:
1 1
Solution:
The estimated regression line y on x is 2 4

= a + bxi
and the two normal equations are
∑yi = na + b ∑ xi 3 6
∑ xy = a ∑ x + b ∑ x2
4
7

Continue
We use a table to calculate a and b
x y xy x2
1 1 1 1
2 4 8 4
3 6 18 9
4 7 28 16
Σx = 10 Σy = 18 Σx y = 55 Σx2 = 30
We now calculate a and b using the least square regression formulas for
a and b.
b y.x =

Continue
b y.x =
b y.x = = =2
a = -b
=∑y/n and =∑x/n
= 18 / 4 = 10 / 4
= 4.5 = 2.5
a = 4.5 – (2 *2.5)
a = 4.5 – 5 = -0.5
Now that we have the estimated least square regression line of y on x is
= -0.5 + 2 x

Continue
If the linear equation or slope of the line predicted by the data in the model
is given as =-0.5 + 2x where = predicted y value, the residual for each
observation can be found.
The residual is equal to (y - ), so for the first set, the actual y value is 1
and the predicted yest value given by the equation is = -0.5 +2(1) = 1.5.
The residual value is thus 1 – 1.5 = -0.5, a negative residual value.
For the second set of x and y data points, the predicted y value when x is 2 and y
is 4 can be calculated as -0.5 + 2 (2) = 3.5.
The residual value is thus 4 – 3.5 = 0.5, a positive residual value.
In this case, the actual and predicted values are the same, so the residual value
will be zero. You would use the same process for arriving at the predicted values
for y in the remaining two data sets.

Continue
x y Residual (y-) (y-)2 y2
1 1 1.5 -0.5 0.25 1

2 4 3.5 0.5 0.25 16
3 6 5.5 0.5 0.25 36
4 7 7.5 -0.5 0.25 49
Σx=10 Σy=18 Σ(y-) = 0 Σ(y-)2 = 1 Σy2 =102

Continue
Observe that the sum of the squared residuals = 6, which represents the
numerator of the residual standard deviation equation.
sy.x = =
sy.x .70711
r2 = 1 -
= / n = 102 - (18)2 / 4
= 102 – 81 = 21
r2 = 1 - = 1 – 0.047619 = 0.952381
It means 95% variation is explained by the regression line.

Comparison of Standard Error of the
Regression vs. R-squared
• The standard error of the regression provides the absolute measure of the typical
distance that the data points fall from the regression line. S is in the units of the
dependent variable.
• R-squared provides the relative measure of the percentage of the dependent
variable variance that the model explains. R-squared can range from 0 to 100%.
• The standard error of the regression has several advantages. S tells you straight up
how precise the model’s predictions are using the units of the dependent variable.
This statistic indicates how far the data points are from the regression line on
average. You want lower values of S because it signifies that the distances between
the data points and the fitted values are smaller. S is also valid for both linear and
nonlinear regression models. This fact is convenient if you need to compare the fit
between both types of models.
• For R-squared, you want the regression model to explain higher percentages of the
variance. Higher R-squared values indicate that the data points are closer to the
fitted values. While higher R-squared values are good, they don’t tell you how far the
data points are from the regression line. Additionally,
R-squared is valid for only linear models. You can’t use R-squared to compare a
linear model to a nonlinear model.

Correlation Coefficient
The correlation coefficient , r, is a measure of the strength of the relationship
between or among variables.
In other words, the linear correlation coefficient measures how closely the
points in a scatter diagram are spread around the regression line. The
correlation coefficient calculated for the population data is denoted by ρX,Y
(Greek letter rho) and the one calculated for sample data is denoted by r.
 Note that the square of the correlation coefficient is equal to the coefficient
of determination.
Note: Correlation does not imply causation. We may say that two
variables X and Y are correlated, but that does not mean that X causes Y or
that Y causes X – they simply are related or associated with one another.

Scatter Diagram
Scatter Diagrams are convenient mathematical tools to

study the correlation between two random variables.
As the name suggests, they are a form of a sheet of
paper upon which the data points corresponding to the
variables of interest, are scattered. Judging by the
shape of the pattern that the data points form on this
sheet of paper, we can determine the association
between the two variables, and can further apply the
best suitable correlation analysis technique.
OR
A plot of paired observations is called a scatter
diagram.

Interpretation of Scatter Diagrams

Direction of the Correlation
 Positive relationship: Variables change in the same directions.
 As X is increasing, Y is increasing
 As X is decreasing, Y is decreasing
Indicated by sign
For Example
(+) or ( - )
 As height increase, so does weight.
 Water consumption and temperature.
 Study time and grades.
 Negative relationship: Variables change in opposite directions.

 As X is increasing, Y is decreasing
 As X is decreasing, Y is increasing
For Example
 As Laptop time increase, grades decrease.
 Price and quantity demanded.

Example
Question: Draw the scatter diagram for the given pair of variables and understand the
type of correlation between them.
Solution: Marks obtained
No. of Students
Here, we take the two variables for (out of 100)
consideration as: 40-50 12
M: The marks obtained out of 100
S: Number of students 50-60 10
Since the values of M is in the form of 60-70 8

bins, we can use the centre point of each
70-80 7
class in the scatter diagram instead.
So let us first choose the axes of our diagram. 80-90 5
X-axis – Marks obtained out of 100
Y-axis – Number of Students 90-100 2

Continue
The data points that we need to plot according to the given dataset are –
(45,12), (55,10), (65,8), (75,7), (85,5), (95,2)

Continue
From the shape of the curve, clearly, only a fewer

number of students get high marks. This implies a
negative correlation between the two variables.

Pearson Product Moment Correlation
Coefficient (r)
Pearson correlation (r) is the most common correlation
coefficients. This measures the strength and direction of the
linear relationship between two variables. It cannot capture
nonlinear relationships between two variables and cannot
differentiate between dependent and independent variables. It is
also called coefficient of simple correlation or total correlation.
The population correlation coefficient for a bivariate distribution,

denoted by ρy,x has already defined as
ρy,x =

Short Computational Formula
Sample correlation coefficient formula

r y.x =
ry.x =
ry.x =
ry.x =

Interpretation of Correlation
Coefficient (r)
 The value of correlation coefficient ‘ r ‘ range from -1 to +1.
 If r = 1, then the correlation between the two variables is said
to be perfect and positive.
For a positive increase in one variable, there is also a positive
increase in the second variable.
 If r = -1, then the correlation between the two variables is said
to be perfect and negative.
This shows that the variables move in opposite directions for a
positive increase in one variable, there is a decrease in the
second variable.
 If r = 0, then there exists no correlation between two variables
It means there is no linear relationship between them.

Continue
Correlation Strength of Correlation Strength of

Coefficient (+ r ) Relationship Coefficient ( - r ) Relationship
values (Positive) values (Negative)
1.0 Perfect ( + ) -1.0 Perfect ( - )

0.8 to 0.99 Very Strong ( + ) -0.8 to -0.99 Very Strong ( - )
0.6 to 0.8 Strong ( + ) - 0.6 to -0.8 Strong ( - )
0.4 to 0.6 Moderate ( + ) -0.4 to - 0.6 Moderate ( - )
0.2 to 0.4 Weak ( + ) -0.2 to -0.4 Weak ( - )
0 to 0.2 Very weak ( + ) 0 to -0.2 Very weak ( - )

Example
Calculate the linear correlation coefficient for the following data.
X 4 8 12 16
Y 5 10 15 20
Solution:
For finding the linear coefficient of these data, we need to first
construct a table for the required values.
ry.x =

Continue
x y x2 y2 XY
4 5 16 25 20
8 10 64 100 80
12 15 144 225 180
16 20 256 400 320
Σ x = 40 Σ y =50 Σ x2 = 480 Σ y2 = 750 Σ xy = 600
According to the formula of linear correlation we have,

ry.x =
ry.x =

Continue
ry.x =
ry.x =
ry.x =
ry.x = 1
Hence there is perfect positive correlation between X and Y.

Properties of Correlation Coefficient
The sample correlation coefficient has the following properties.

1. The correlation coefficient is symmetrical with respect X and Y, i.e r x.y = ry.x .
2. The correlation coefficient is the geometric mean of the two regression
coefficients, i.e. ,
r =
3. The correlation coefficient is independent of the origin and scale. By this we
mean that if we take deviations of X and Y from some suitable origins or
transform X and Y into u and v respectively, it will not effected the correlation
coefficient. i.e rx.y = ru.v
4. Correlation coefficient values less than +0.8 or greater than -0.8 are not
considered significant.
5. Correlation coefficient lies between -1 and +1. i.e -1 + 1

Assignment 5
Q1:The data on ages (in years) and prices (in hundreds of dollars) for eight cars of
a specific model are given below.
Age 8 3 6 9 2 5 6 3
Price 18 94 50 21 145 42 36 99
a) Find the regression line with price as a dependent variable and age as an
independent variable .
b) Predict the price of a 7 years old car of this model.
c) Estimate the price of an 18 year-old car of this model. Comment on this finding.
Q2: Calculate and analyze the correlation coefficient between the number of study
hours and the number of sleeping hours of different students.
Number of Study 2 4 6 8 10
Hours
Number of Sleeping 10 9 8 7 6
Hours

Continue
Q3: The following information is obtained from a sample data set.

n = 12, ∑x = 66, ∑y = 588, ∑xy = 2244, ∑x2 = 396 and ∑y2 = 58734
Find the value of standard error and coefficient of determination r 2.
Q4:The following data give the experience (in years) and monthly salaries (in
hundreds of dollars) of nine randomly selected secretaries.
Experience
14 3 5 6 4 9 18 5 16
Monthly salary
62 29 37 43 35 60 67 32 60
a. Find the least squares regression line with experience as an independent

variable and monthly salary as a dependent variable.

Chapter 5

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 5

Uploaded by

Copyright:

Available Formats

Simple Regression And Correlation

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 1

A regression model is a mathematical equation that describes the

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 2

The relationship between two variables in a regression

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 3

Dependent variable Intercept Slope

Explained variable unknown constant Independent variable Error term

Predicted variable Population Parameter Fixed variable Disturbance

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 4

Dependent variable unknown constant Independent variable

Random variable unbiased estimators of α and ꞵ Fixed variable

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 5

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 6

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 7

 linear relationship exists between dependent and independent variable.

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 8

•Least squares estimation

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 9

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 10

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 11

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 12

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 13

a) Find the least square regression line y = a + b x.

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 14

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 15

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 16

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 17

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 18

The Formulas for population Residual Standard Deviation Is

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 19

For sample data, we estimate σy.x by sy.x which is defined as

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 20

Coefficient of determination (r2) is a statistical measure that

So, if the r2 of a model is 0.50, then approximately half of the

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 21

The coefficient of determination, denoted by r2, represents the proportion of SST

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 22

• If r2 = 1 it means that 100% variation is explained by

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 23

The estimated regression line y on x is 2 4

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 24

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 25

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 26

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 27

x y Residual (y-) (y-)2 y2

1 1 1.5 -0.5 0.25 1

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 28

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 29

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 30

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 31

Scatter Diagrams are convenient mathematical tools to

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 32

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 33

 Negative relationship: Variables change in opposite directions.

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 34

Since the values of M is in the form of 60-70 8

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 35

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 36

From the shape of the curve, clearly, only a fewer

Abdul Wali Khan University Mardan Pakistan. www.awkum.edu.pk 37

The population correlation coefficient for a bivariate distribution,