You are on page 1of 22

ME-5101

Engineering Analysis &


Statistics
Lect. # 11
Regression Analysis

Dr. Nazeer Ahmad Anjum


Mechanical Engineering Program
Engineering University Taxila

Regression Analysis 2
A statistical procedure used to find relationships
among a set of variables
For example, in a chemical process, suppose
that the yield of the product is related to the
process-operating temperature.
Regression analysis can be used to build a
model to predict yield at a given temperature
level.

2/7/2019
Regression Analysis 3

Relating two data matrices/tables to each other

Purpose: prediction and interpretation

Y-data X-data

2/7/2019

Correlation 4
A correlation is a relationship between two variables.
Typically, we take x to be the independent variable.
We take y to be the dependent variable. Data is
represented by a collection of ordered pairs (x, y).
The strength and direction of a linear relationship
between two variables is represented by the
correlation coefficient.
Suppose that there are n ordered pairs (x, y) that
make up a sample from a population. The correlation
coefficient r is given by:

2/7/2019
Regression Analysis 5
In regression analysis, there is a dependent
variable, which is the one you are trying to
explain, and one or more independent variables
that are related to it.
Example: predicting sales as a function of
marketing, Sales dependent (y), & marketing
independent (x)
You can express the relationship as a linear
equation, such as:

y = a + bx
2/7/2019

Regression Analysis 6
y = a + bx
• y is the dependent variable
• x is the independent variable
• a is a constant
• b is the slope of the line
• For every increase of 1 in x, y changes by an amount equal
to b
• Some relationships are perfectly linear and fit this equation
exactly. Your cell phone bill, for instance, may be:

Total Charges = Base Fee + 30% (overage minutes)

If you know the base fee and the number of overage


minutes, you can predict the total charges exactly.
2/7/2019
Regression Analysis 7
y = a + bx

2/7/2019

Example Problem 8
The time x in years that an employee spent at a
company and the employee's hourly pay, y, for 5
employees are listed in the table below. Calculate
and interpret the correlation coefficient and
equation of regression line. Also predict the hourly
pay rate of an employee who has worked for 20
years.
x 5 3 4 10 15
y 25 20 21 35 38

2/7/2019
Solution Example 9
x y x2 y2 Xy
5 25 25 625 125
3 2 9 400 60
4 21 16 441 84
10 35 100 1225 350
15 38 225 1444 570
xs y x2 y2 xy Calculate the numerator
= 802
37 139 375 4135 1189

calculate the denominator = 827.72 & r = 0.97

Equation of regression line is: y = 16.11+ 1.58x


Hourly pay rate worked for 20 years
y = 16.11+ 1.58*20 = 47:71 2/7/2019

Problems 10
Problem 1
Consider the following set of points: {(-2 , -1) , (1 , 1) , (3 , 2)}
a) Find the least square regression line for the given data points.
b) Plot the given points and the regression line in the same rectangular system
of axes.

Problem 2
a) Find the least square regression line for the following set of data {(-1 , 0),(0
, 2),(1 , 4),(2 , 5)}
b) Plot the given points and the regression line in the same rectangular
system of axes.

2/7/2019
Regression Analysis 11
Developing a procedure to find out equation that
fits a data.
It is used to predict the value of a certain
parameter dependent on same data.
In statistics terminology the variable being
predicted is called the dependent variable and the
variable(s) used to predict the data is called
independent variable.
Example: in tensile testing force is independent and
elongation is dependent variable.

2/7/2019

Simple Linear Regression Model12


Single dependent variable, fitted by a straight line (linear
regression).
Multiple regression analysis involves using curvilinear
functions to analysis data .
Restaurant example: Sales ‘y’ are dependent on student
population ‘x’. The equation that describes how ‘y’ is related
to ‘x’ and an error term is called the REGRESSION MODEL,
SIMPLE LINEAR REGRESSION MODEL

y=0 + 1x +  (1)


0 and 1 are the parameters of the model, and  is the error
term. The error term accounts for the variability in ‘y’ that
cannot be explained by the linear relationship between x and
y.
2/7/2019
Regression Equation 13
The population of restaurants can also be divided into sub-
populations. One subpopulation consists of all KFC’s
restaurants located near college campuses with 8000
students. Other consists of all restaurants located near
college campuses with 9000 students; and so on. Thus, a
distribution of ‘y’ values is associated with campuses with
8000 students; another distribution of ‘y’ values with 9000
students. Each distribution of ‘y’ values has its own mean or
expected value.
Thus each expected value of ‘y’, denoted by E(y), is related
to ‘x’, is called regression equation.

2/7/2019

Regression Equation 14

SIMPLE LINEAR REGRESSION EQUATION

E(y)=0+1xi (2)
The graph of the simple linear regression equation
is a straight line; 0 is the y-intercept of the
regression line, 1 is the slope, and E(y) is the
mean or expected value of y for a given value of x.

2/7/2019
Regression Equation 15
POSITIVE LINEAR RELATIONSHIP

The regression line shows


that the mean value of y is
related positively to x, with
larger values of E(y)
associated with larger values
of x.

2/7/2019

Regression Equation 16
NEGATIVE LINEAR RELATIONSHIP

The regression line here


shows the mean value of y is
related negatively to x, with
smaller values of E(y)
associated with larger values
of x.

2/7/2019
Regression Equation 17
NO RELATIONSHIP

The regression line here


shows the case in which the
mean value of y is not related
to x; that is, the mean value
of y is the same for every
value of x.

2/7/2019

Estimated Regression Equation 18


0 and  must be estimated using sample data.
Sample parameters (bo and b1) are computed as
estimates of population parameters (0 and 1)
by substituting in the regression equation, then
estimated regression equation is obtain.
Estimated regression equation for simple linear
regression is 𝒐 𝟏 (3)

is the point estimation of E(y), i.e. mean value


of y for a given value of x

2/7/2019
Estimated Regression Equation 19

2/7/2019

Regression Analysis
Regression 20
Dependent variable (y)

Independent variable (x)

Regression is the attempt to explain the variation in a


dependent variable using the variation in independent
variables.
Regression is thus an explanation of interconnection.
If the independent variable(s) sufficiently explain the
variation in the dependent variable, the model can be used
for prediction. 2/7/2019
Regression Analysis 21
Some of the relationships
may not be so exact. 220

Weight, for instance, is to 200


some degree a function of
height, but there are 180

Weight
variations that height does
160
not explain.
If you take a sample of 140
actual heights and
120
weights, you might see
something like the graph to 100
the right. 60 65 70 75
Height

2/7/2019

Regression Analysis 22
The line in the graph shows the
average relationship described by
the equation. Often, none of the
220 actual observations lie on the line.
The difference between the line and
200 any individual observation is the
error.
180 The equation is:
Weight

Weight = C + 5.7*Height + 
160 This equation does not mean that
people who are short enough will have
140 a negative weight. The
observations that contributed to this
120
analysis were all for heights between
5’ and 6’4”. The model will likely
provide a reasonable estimate for
100 anyone in this height range.
60 65 70 75
Height

2/7/2019
Regression Analysis 23
Regression finds the line that best fits the observations. It
does this by finding the line that results in the lowest sum of
squared errors.
Since the line describes the mean of the effects of the
independent variables, by definition, the sum of the actual
errors will be zero.
If you add up all of the values of the dependent variable and
you add up all the values predicted by the model, the sum is
the same.
That is, the sum of the negative errors (for points below the
line) will exactly offset the sum of the positive errors (for points
above the line).
Summing just the errors wouldn’t be useful because the sum
is always zero. So, instead, regression uses the sum of the
squares of the errors. An Ordinary Least Squares (OLS)
regression finds the line that results in the lowest sum of
squared errors. 2/7/2019

Linear regression 24
•Linear dependence: constant rate of increase of one
variable with respect to another.
•Regression analysis describes the relationship
between two (or more) variables.
•Examples:
– Income and educational level
– Demand for electricity and the weather
– Home sales and interest rates
•Our focus:
–Gain some understanding of the mechanics.
• The regression line
• Regression error
– Learn how to interpret and use the results.
– Learn how to setup a regression analysis.
2/7/2019
Simple Linear regression
Simple Linear Regression 25

Dependent variable (y)


є
𝑦 = b0 + b1X ± 

b1 = slope
b0 (y intercept) = ∆y/ ∆x

Independent variable (x)

The output of a regression is a function that


predicts the dependent variable based upon
values of the independent variables.
Simple regression fits a straight line to the data.

2/7/2019

Simple Linear regression


Simple Linear Regression 26

Observation: y
Dependent variable

Prediction: y^

Zero
Independent variable (x)

The function will make a prediction for each


observed data point.
The observation is denoted by y and the
prediction is denoted by .
2/7/2019
Simple Linear regression
Simple Linear Regression 27

Prediction error: 

Observation: y
Prediction: y^

Zero

For each observation, the variation can be described as:

y=^
y+
Actual = Explained + Error
2/7/2019

Regression Analysis 28
Least squares (LS) used
y for estimation of regression coefficients

b0 y=b0+b1x+
b1

Simple linear regression


2/7/2019
Least Square Regression
Regression 29
A least squares regression selects the line with the lowest
total sum of squared prediction errors. This value is called
the Sum of Squares of Error, or SSE.
To estimate the parameters
Restaurant Student Population Quarterly Sales For the ith restaurant
i (1000s), xi (Rs. 1000s), Yi
1 2 58
2 6 105
3 8 88
4 8 118
𝑦𝑖 𝑏 𝑏 𝑥 (3)
5 12 117
6 16 13
7 20 15
8 20 169
9 22 149
10 26 202 2/7/2019

Least Square Regression


Regression 30
Scatter diagram of student population and quarterly sales

2/7/2019
Least Regression Analysis
Calculating SSR 31

Dependent variable
Population mean: y

Independent variable (x)

The Sum of Squares Regression (SSR) is the sum of


the squared differences between the prediction for
each observation and the population mean.

2/7/2019

Least Regression Analysis


Calculating SSR 32
𝑦𝑖 estimated value of quarterly sales ($1000s) for the
ith restaurant
B0 = The y intercept of the estimated regression line
B1 = The slope of the estimated regression line
xi = Size of the student population (1000s) for the ith
restaurant

Here the difference b/w observed values yi & estimated


value 𝑦𝑖 , must be reduced, using LEAST SQUARE
CRITERION
𝟐
Least Square Criterion 𝒊 𝒊

yi = observed value of the dependent variable for the ith


observation
𝑦𝑖 =estimated value of the dependent variable for the ith
observation 2/7/2019
Least Regression Analysis
Calculating SSR 33
The parameter can be calculated by differential calculus

∑ ̅
∑ ̅
(4)

xi = value of the independent variable for the ith observation


yi = value of the dependent variable for the ith observation
𝑥̅ mean value for the independent variable
𝑦 mean value for the dependent variable
n = Total number of observations
2/7/2019

Least Regression Analysis


Calculating SSR 34
Restaurant Student Population Quarterly Sales
i (1000s), xi (Rs. 1000s), Yi
1 2 58
2 6 105 ∑
𝑥̅ 14
3 8 88
4 8 118 ∑
𝑦 130
5 12 117
6 16 13
7 20 15
8 20 169
9 22 149
10 26 202
n=
𝑥 𝑦

n = 10 𝑥 140 𝑦 1300
2/7/2019
Least Regression Analysis
Calculating SSR 35
Rt xi Yi xi -𝑥̅ yi - (xi - 𝒙 yi - 𝒚 (xi - 𝑥̅ 2

i
1 2 58
2 6 105
3 8 88
4 8 118
5 12 117
6 16 13
7 20 15
8 20 169
9 22 149
10 26 202
n = 10
𝑥 140 𝑦 1300 (xi −𝒙 yi −𝒚 𝟐
(xi −𝒙

∑ ̅ 𝑦𝑖 60 5𝑥
𝑏 ∑ ̅
=5 = 60
2/7/2019

Graph of Estimated Calculating


Regression SSR Equation36

2/7/2019
Coefficient ofCalculating
Determination
SSR 37

How well does the equation of regression define the


data. For this a coefficient of determination is used. It
provides a measure of goodness of fit.
For the ith observation the difference b/w the observed
value of the dependent variable yi & the estimated value
of the variable 𝑦𝑖 is called the ith residual . Thus ith
residual is yi - 𝑦𝑖 .

2/7/2019

Coefficient ofCalculating
Determination
SSR 38
The sum of square due to error (SSE) is a
measure of the error.
𝟐
SSE = 𝒊 𝒊
In the example, xi = 2, yi = 58

𝑦𝑖 = 60 +5x = 60 + 5(2) = 70

 yi - 𝑦𝑖 = 58 – 70 = -12
(yi - 𝑦𝑖 )2 = 144
After computing & squaring the residuals then sum all the
errors that is used to predict sales.

2/7/2019
Coefficient ofCalculating
Determination
SSR 39
Estimation of sales with out knowing size of population &
without any related variables
Rest xi Yi Predicted Sales Error Squared Error
i 𝑦𝑖 60 5𝑥 (yi -𝑦𝑖 ) (yi -𝑦𝑖 )2

1 2 58
2 6 105
3 8 88
4 8 118
5 12 117
6 16 13
7 20 15
8 20 169
9 22 149
10 26 202 𝑆𝑆𝐸= 1530
2/7/2019

Coefficient Regression
of Determination
Formulas 40
The Total Sum of Squares (SST) is equal to SSR (Sum of
squares due to regression)+ SSE (Sum of squares due to
error)
Mathematically,
SSR = ∑ (𝑦𝑖 – 𝑦 )2 (sum of squares due to regression variation)
SSE = ∑ ( yi – 𝑦𝑖 ) (measure of unexplained variation)
SST = SSR + SSE = ∑ ( yi – 𝑦)2 (measure of total variation in y)

2/7/2019
Coefficient ofCalculating
Determination
SSR 41
Computation of total sum of squares

Rest xi Yi deviation Squared Deviation


i (yi - 𝑦)2
yi -
1 2 58
2 6 105
3 8 88
4 8 118
5 12 117
6 16 13
7 20 15
8 20 169
9 22 149
10 26 202

𝐒𝐒T= 15730 2/7/2019

Coefficient Regression
of Determination
Formulas 42

For best fit, yi = 𝑖  SSE = 0 &


SST = SSR,  SSR/SST = 1

Poorest fit, SSR = 0 &  SST = SSE


SSR/SST = lies b/w zero & 1 and is called the
coefficient of determination r2 = SSR/SST, for
considered example r2 = 0.9027

2/7/2019
Coefficient Regression
of Determination
Formulas 43
SSE

SST

SSR

2/7/2019

Exercise problems
Regression Formulas 44

Book: Statistics for Business and Economics - 11th


Edition - Anderson, Willaims, Sweeny
Chapter # 14, Simple Linear Regression Analysis
Problems 1 through 4, 6, 8, 12, 15, 16, 17, 18, 20

2/7/2019

You might also like