You are on page 1of 22


Engineering Analysis &

Lect. # 11
Regression Analysis

Dr. Nazeer Ahmad Anjum

Mechanical Engineering Program
Engineering University Taxila

Regression Analysis 2
A statistical procedure used to find relationships
among a set of variables
For example, in a chemical process, suppose
that the yield of the product is related to the
process-operating temperature.
Regression analysis can be used to build a
model to predict yield at a given temperature

Regression Analysis 3

Relating two data matrices/tables to each other

Purpose: prediction and interpretation

Y-data X-data


Correlation 4
A correlation is a relationship between two variables.
Typically, we take x to be the independent variable.
We take y to be the dependent variable. Data is
represented by a collection of ordered pairs (x, y).
The strength and direction of a linear relationship
between two variables is represented by the
correlation coefficient.
Suppose that there are n ordered pairs (x, y) that
make up a sample from a population. The correlation
coefficient r is given by:

Regression Analysis 5
In regression analysis, there is a dependent
variable, which is the one you are trying to
explain, and one or more independent variables
that are related to it.
Example: predicting sales as a function of
marketing, Sales dependent (y), & marketing
independent (x)
You can express the relationship as a linear
equation, such as:

y = a + bx

Regression Analysis 6
y = a + bx
• y is the dependent variable
• x is the independent variable
• a is a constant
• b is the slope of the line
• For every increase of 1 in x, y changes by an amount equal
to b
• Some relationships are perfectly linear and fit this equation
exactly. Your cell phone bill, for instance, may be:

Total Charges = Base Fee + 30% (overage minutes)

If you know the base fee and the number of overage

minutes, you can predict the total charges exactly.
Regression Analysis 7
y = a + bx


Example Problem 8
The time x in years that an employee spent at a
company and the employee's hourly pay, y, for 5
employees are listed in the table below. Calculate
and interpret the correlation coefficient and
equation of regression line. Also predict the hourly
pay rate of an employee who has worked for 20
x 5 3 4 10 15
y 25 20 21 35 38

Solution Example 9
x y x2 y2 Xy
5 25 25 625 125
3 2 9 400 60
4 21 16 441 84
10 35 100 1225 350
15 38 225 1444 570
xs y x2 y2 xy Calculate the numerator
= 802
37 139 375 4135 1189

calculate the denominator = 827.72 & r = 0.97

Equation of regression line is: y = 16.11+ 1.58x

Hourly pay rate worked for 20 years
y = 16.11+ 1.58*20 = 47:71 2/7/2019

Problems 10
Problem 1
Consider the following set of points: {(-2 , -1) , (1 , 1) , (3 , 2)}
a) Find the least square regression line for the given data points.
b) Plot the given points and the regression line in the same rectangular system
of axes.

Problem 2
a) Find the least square regression line for the following set of data {(-1 , 0),(0
, 2),(1 , 4),(2 , 5)}
b) Plot the given points and the regression line in the same rectangular
system of axes.

Regression Analysis 11
Developing a procedure to find out equation that
fits a data.
It is used to predict the value of a certain
parameter dependent on same data.
In statistics terminology the variable being
predicted is called the dependent variable and the
variable(s) used to predict the data is called
independent variable.
Example: in tensile testing force is independent and
elongation is dependent variable.


Simple Linear Regression Model12

Single dependent variable, fitted by a straight line (linear
Multiple regression analysis involves using curvilinear
functions to analysis data .
Restaurant example: Sales ‘y’ are dependent on student
population ‘x’. The equation that describes how ‘y’ is related
to ‘x’ and an error term is called the REGRESSION MODEL,

y=0 + 1x +  (1)

0 and 1 are the parameters of the model, and  is the error
term. The error term accounts for the variability in ‘y’ that
cannot be explained by the linear relationship between x and
Regression Equation 13
The population of restaurants can also be divided into sub-
populations. One subpopulation consists of all KFC’s
restaurants located near college campuses with 8000
students. Other consists of all restaurants located near
college campuses with 9000 students; and so on. Thus, a
distribution of ‘y’ values is associated with campuses with
8000 students; another distribution of ‘y’ values with 9000
students. Each distribution of ‘y’ values has its own mean or
expected value.
Thus each expected value of ‘y’, denoted by E(y), is related
to ‘x’, is called regression equation.


Regression Equation 14


E(y)=0+1xi (2)
The graph of the simple linear regression equation
is a straight line; 0 is the y-intercept of the
regression line, 1 is the slope, and E(y) is the
mean or expected value of y for a given value of x.

Regression Equation 15

The regression line shows

that the mean value of y is
related positively to x, with
larger values of E(y)
associated with larger values
of x.


Regression Equation 16

The regression line here

shows the mean value of y is
related negatively to x, with
smaller values of E(y)
associated with larger values
of x.

Regression Equation 17

The regression line here

shows the case in which the
mean value of y is not related
to x; that is, the mean value
of y is the same for every
value of x.


Estimated Regression Equation 18

0 and  must be estimated using sample data.
Sample parameters (bo and b1) are computed as
estimates of population parameters (0 and 1)
by substituting in the regression equation, then
estimated regression equation is obtain.
Estimated regression equation for simple linear
regression is 𝒐 𝟏 (3)

is the point estimation of E(y), i.e. mean value

of y for a given value of x

Estimated Regression Equation 19


Regression Analysis
Regression 20
Dependent variable (y)

Independent variable (x)

Regression is the attempt to explain the variation in a

dependent variable using the variation in independent
Regression is thus an explanation of interconnection.
If the independent variable(s) sufficiently explain the
variation in the dependent variable, the model can be used
for prediction. 2/7/2019
Regression Analysis 21
Some of the relationships
may not be so exact. 220

Weight, for instance, is to 200

some degree a function of
height, but there are 180

variations that height does
not explain.
If you take a sample of 140
actual heights and
weights, you might see
something like the graph to 100
the right. 60 65 70 75


Regression Analysis 22
The line in the graph shows the
average relationship described by
the equation. Often, none of the
220 actual observations lie on the line.
The difference between the line and
200 any individual observation is the
180 The equation is:

Weight = C + 5.7*Height + 
160 This equation does not mean that
people who are short enough will have
140 a negative weight. The
observations that contributed to this
analysis were all for heights between
5’ and 6’4”. The model will likely
provide a reasonable estimate for
100 anyone in this height range.
60 65 70 75

Regression Analysis 23
Regression finds the line that best fits the observations. It
does this by finding the line that results in the lowest sum of
squared errors.
Since the line describes the mean of the effects of the
independent variables, by definition, the sum of the actual
errors will be zero.
If you add up all of the values of the dependent variable and
you add up all the values predicted by the model, the sum is
the same.
That is, the sum of the negative errors (for points below the
line) will exactly offset the sum of the positive errors (for points
above the line).
Summing just the errors wouldn’t be useful because the sum
is always zero. So, instead, regression uses the sum of the
squares of the errors. An Ordinary Least Squares (OLS)
regression finds the line that results in the lowest sum of
squared errors. 2/7/2019

Linear regression 24
•Linear dependence: constant rate of increase of one
variable with respect to another.
•Regression analysis describes the relationship
between two (or more) variables.
– Income and educational level
– Demand for electricity and the weather
– Home sales and interest rates
•Our focus:
–Gain some understanding of the mechanics.
• The regression line
• Regression error
– Learn how to interpret and use the results.
– Learn how to setup a regression analysis.
Simple Linear regression
Simple Linear Regression 25

Dependent variable (y)

𝑦 = b0 + b1X ± 

b1 = slope
b0 (y intercept) = ∆y/ ∆x

Independent variable (x)

The output of a regression is a function that

predicts the dependent variable based upon
values of the independent variables.
Simple regression fits a straight line to the data.


Simple Linear regression

Simple Linear Regression 26

Observation: y
Dependent variable

Prediction: y^

Independent variable (x)

The function will make a prediction for each

observed data point.
The observation is denoted by y and the
prediction is denoted by .
Simple Linear regression
Simple Linear Regression 27

Prediction error: 

Observation: y
Prediction: y^


For each observation, the variation can be described as:

Actual = Explained + Error

Regression Analysis 28
Least squares (LS) used
y for estimation of regression coefficients

b0 y=b0+b1x+

Simple linear regression

Least Square Regression
Regression 29
A least squares regression selects the line with the lowest
total sum of squared prediction errors. This value is called
the Sum of Squares of Error, or SSE.
To estimate the parameters
Restaurant Student Population Quarterly Sales For the ith restaurant
i (1000s), xi (Rs. 1000s), Yi
1 2 58
2 6 105
3 8 88
4 8 118
𝑦𝑖 𝑏 𝑏 𝑥 (3)
5 12 117
6 16 13
7 20 15
8 20 169
9 22 149
10 26 202 2/7/2019

Least Square Regression

Regression 30
Scatter diagram of student population and quarterly sales

Least Regression Analysis
Calculating SSR 31

Dependent variable
Population mean: y

Independent variable (x)

The Sum of Squares Regression (SSR) is the sum of

the squared differences between the prediction for
each observation and the population mean.


Least Regression Analysis

Calculating SSR 32
𝑦𝑖 estimated value of quarterly sales ($1000s) for the
ith restaurant
B0 = The y intercept of the estimated regression line
B1 = The slope of the estimated regression line
xi = Size of the student population (1000s) for the ith

Here the difference b/w observed values yi & estimated

value 𝑦𝑖 , must be reduced, using LEAST SQUARE
Least Square Criterion 𝒊 𝒊

yi = observed value of the dependent variable for the ith

𝑦𝑖 =estimated value of the dependent variable for the ith
observation 2/7/2019
Least Regression Analysis
Calculating SSR 33
The parameter can be calculated by differential calculus

∑ ̅
∑ ̅

xi = value of the independent variable for the ith observation

yi = value of the dependent variable for the ith observation
𝑥̅ mean value for the independent variable
𝑦 mean value for the dependent variable
n = Total number of observations

Least Regression Analysis

Calculating SSR 34
Restaurant Student Population Quarterly Sales
i (1000s), xi (Rs. 1000s), Yi
1 2 58
2 6 105 ∑
𝑥̅ 14
3 8 88
4 8 118 ∑
𝑦 130
5 12 117
6 16 13
7 20 15
8 20 169
9 22 149
10 26 202
𝑥 𝑦

n = 10 𝑥 140 𝑦 1300
Least Regression Analysis
Calculating SSR 35
Rt xi Yi xi -𝑥̅ yi - (xi - 𝒙 yi - 𝒚 (xi - 𝑥̅ 2

1 2 58
2 6 105
3 8 88
4 8 118
5 12 117
6 16 13
7 20 15
8 20 169
9 22 149
10 26 202
n = 10
𝑥 140 𝑦 1300 (xi −𝒙 yi −𝒚 𝟐
(xi −𝒙

∑ ̅ 𝑦𝑖 60 5𝑥
𝑏 ∑ ̅
=5 = 60

Graph of Estimated Calculating

Regression SSR Equation36

Coefficient ofCalculating
SSR 37

How well does the equation of regression define the

data. For this a coefficient of determination is used. It
provides a measure of goodness of fit.
For the ith observation the difference b/w the observed
value of the dependent variable yi & the estimated value
of the variable 𝑦𝑖 is called the ith residual . Thus ith
residual is yi - 𝑦𝑖 .


Coefficient ofCalculating
SSR 38
The sum of square due to error (SSE) is a
measure of the error.
SSE = 𝒊 𝒊
In the example, xi = 2, yi = 58

𝑦𝑖 = 60 +5x = 60 + 5(2) = 70

 yi - 𝑦𝑖 = 58 – 70 = -12
(yi - 𝑦𝑖 )2 = 144
After computing & squaring the residuals then sum all the
errors that is used to predict sales.

Coefficient ofCalculating
SSR 39
Estimation of sales with out knowing size of population &
without any related variables
Rest xi Yi Predicted Sales Error Squared Error
i 𝑦𝑖 60 5𝑥 (yi -𝑦𝑖 ) (yi -𝑦𝑖 )2

1 2 58
2 6 105
3 8 88
4 8 118
5 12 117
6 16 13
7 20 15
8 20 169
9 22 149
10 26 202 𝑆𝑆𝐸= 1530

Coefficient Regression
of Determination
Formulas 40
The Total Sum of Squares (SST) is equal to SSR (Sum of
squares due to regression)+ SSE (Sum of squares due to
SSR = ∑ (𝑦𝑖 – 𝑦 )2 (sum of squares due to regression variation)
SSE = ∑ ( yi – 𝑦𝑖 ) (measure of unexplained variation)
SST = SSR + SSE = ∑ ( yi – 𝑦)2 (measure of total variation in y)

Coefficient ofCalculating
SSR 41
Computation of total sum of squares

Rest xi Yi deviation Squared Deviation

i (yi - 𝑦)2
yi -
1 2 58
2 6 105
3 8 88
4 8 118
5 12 117
6 16 13
7 20 15
8 20 169
9 22 149
10 26 202

𝐒𝐒T= 15730 2/7/2019

Coefficient Regression
of Determination
Formulas 42

For best fit, yi = 𝑖  SSE = 0 &

SST = SSR,  SSR/SST = 1

Poorest fit, SSR = 0 &  SST = SSE

SSR/SST = lies b/w zero & 1 and is called the
coefficient of determination r2 = SSR/SST, for
considered example r2 = 0.9027

Coefficient Regression
of Determination
Formulas 43




Exercise problems
Regression Formulas 44

Book: Statistics for Business and Economics - 11th

Edition - Anderson, Willaims, Sweeny
Chapter # 14, Simple Linear Regression Analysis
Problems 1 through 4, 6, 8, 12, 15, 16, 17, 18, 20


You might also like