You are on page 1of 23

Building Empirical Models

Introduction
• A model relating the variables based on observed
data in which two or more variables of interest are
related, and the mechanistic model relating these
variables is unknown is called an empirical model.
Introduction y is the salt concentration (milligrams/liter)
found in surface streams in a particular
watershed and x is the percentage of
the watershed area consisting of paved roads.
There is no obvious physical mechanism that
relates the salt concentration to the roadway
area, but the scatter diagram indicates that
some relationship, possibly linear, does exist.
Introduction
A linear relationship will not pass exactly
through all of the points, but there is an
indication that the points are scattered
randomly about a straight line. Therefore, it is
probably reasonable to assume that the mean
of the random variable Y (the salt
concentration) is related to roadway area x by
the following straight-line relationship:

where the slope and intercept of the


line are unknown parameters.
Introduction
• The notation 𝐸 𝑌| 𝑥 represents the
expected value of the response
variable 𝑌 at a particular value of the
regressor variable 𝑥.
• Although the mean of Y is a linear
function of x, the actual observed
value y does not fall exactly on a
straight line.
• The appropriate way to generalize this
to a probabilistic linear model is to
assume that the expected value of Y is
a linear function of x, but that for a
fixed value of x the actual value of Y is
determined by the mean value
function (the linear model) plus a
random error term 𝝐.
Introduction
Introduction
• If x is fixed, the random component 𝝐 on the right hand side of the model in
equation 6-1 determines the properties of Y.
• Suppose that the mean and variance of ϵ are 0 and 𝜎 2 , respectively. Then the
Mean of given x is :

• The Variance of Y given x is :

The distribution of Y for


a given value of x for
the salt concentration –
roadway area data.
Introduction
● The true regression model
𝛍𝑌|𝑥 = 𝜷𝟎 + 𝜷𝟏 𝒙 is a line of
mean values; that is, the
height of the regression line at
any value of x is simply the
expected value of Y for that x.
● The slope, 𝜷1 , can be
interpreted as the change in
the mean of Y for a unit
change in x.
● The variability of Y at a
particular value of x is
determined by the error
variance 𝝈2 . This implies that
there is a distribution of Y
values at each x and that the
variance of this distribution is
the same at each x.
Introduction
• There are many empirical model building situations in which there is more than
one regressor variable. Once again, a regression model can be used to describe
the relationship. A regression model that contains more than one regressor
variable is called a multiple regression model.
• An example, suppose that the effective life of a cutting tool depends on the
cutting speed and the tool angle. A multiple regression model that might
describe this relationship is

• where Y represents the tool life, x1 represents the cutting speed, x2 represents
the tool angle, and ϵ is a random error term. This is a multiple linear regression
model with two regressors.
• The term linear is used because the equation is a linear function of the
unknown parameters 𝛽0 , 𝛽1 and 𝛽2 .
Introduction
The parameter 𝛽0 is the intercept of
the plane. We sometimes call 𝛽1 and
𝛽2 partial regression coefficients
because 𝛽1 measures the expected
change in Y per unit change in x1
when x2 is held constant, and 𝛽2
measures the expected change in Y
per unit change in x2 when x1 is held
constant.

The contour plot

A contour plot of the regression model—


that is, lines of constant E(Y ) as a function
of x1 and x2. Note that the contour lines in
this plot are straight lines.
Introduction

The parameters 𝛽𝑗 , 𝑗 = 0, 1, . . . , k, are called regression coefficients. This model


describes a hyperplane in the space of the regressor variables 𝑥𝑗 and Y. The
parameter 𝛽𝑗 represents the expected change in response Y per unit change in 𝑥𝑗
when all the remaining regressors 𝑥𝑖 𝑖 ≠ 𝑗 are held constant.
Least Squares Estimation
• The case of simple linear regression considers a single regressor or predictor x and a
dependent or response variable Y.
• Suppose that the true relationship between Y and x is a straight line and that the
observation Y at each level of x is a random variable.
• The expected value of Y for each value of x is

• where the intercept 𝛽0 and the slope 𝛽1 are unknown regression coefficients.
• Assumed that each observation, Y, can be described by the model

• where 𝜖 is a random error with mean zero and variance 𝜎 2 . The random errors
corresponding to different observations are also assumed to be uncorrelated random
variables.
Least Squares Estimation
• Suppose that we have n pairs
of observations (x1, y1), (x2,
y2), . . . (xn, yn).

• Figure on the side shows a


typical scatter plot of observed
data and a candidate for the
estimated regression line.

• The estimates of 𝛽0 and


𝛽1 should result in a line that is
(in some sense) a “best fit” to
the data.
Deviations of the data from the estimated
regression model.
Least Squares Estimation
• The estimating approach of regression coefficients is called the method
of least squares.
• For n observations in the sample the equation become:

• and the sum of the squares of the deviations of the observations from
the true regression line is
Least Squares Estimation
Least Squares Estimation
• The fitted or estimated regression line is therefore

• Note that each pair of observations satisfies the relationship

• where 𝑒𝑖 = 𝑦𝑖 − 𝑦ො𝑖 is called the residual.


• The residual describes the error in the fit of the model to the 𝑖th
observation 𝑦𝑖 .
• The residuals is used to provide information about the adequacy of the
fitted model.
Practical interpretation:
• Using the linear regression model, we
would predict that the salt concentration
in surface streams, where the percentage
of paved roads in the watershed is 1.25%,
is 𝑦ො = 2.6765 + 17.5467 = 24.61
milligrams/liter.
• The predicted value can be interpreted
either as an estimate of the mean salt
concentration when roadway area 𝑥 =
1.25% or as an estimate of a new
observation when 𝑥 = 1.25%.
24.61
• These estimates are, of course, subject to
error; that is, it is unlikely that either the
true mean salt concentration or a future
observation would be exactly 24.61
milligrams/liter when the roadway area is
1.25
1.25%.
• Successively substitutes each value of
𝑥𝑖 (𝑖 = 1, 2, … . , 𝑛) in the sample into the
fitted regression model, calculates the
fitted values 𝑦ො𝑖 = 𝛽መ0 + 𝛽መ1 𝑥𝑖 , and then
finds the residuals as 𝑒𝑖 = 𝑦𝑖 − 𝑦ො𝑖 , 𝑖 =
1, 2, … , 𝑛.
• For example, the ninth observation has
𝑥10 = 0.60 and 𝑦10 = 9.3, and the
regression model predicts that 𝑦ො10 =
13.205,
• The corresponding residual is 𝑒10 =
𝑦ො10 = 13.205
𝑒10 = −3.905 9.3 − 13.205 = −3.905
𝑦10 = 9.3

𝑥10 = 0.6
Error Sum of Squares
• The residuals from the fitted regression model are used to estimate the
variance of the model errors 𝜎 2 . Recall that 𝜎 2 determines the amount of
variability in the observations on the response 𝑦 at a given value of the
regressor variable 𝑥.
• The sum of the squared residuals is used to obtain the estimate of 𝜎 2 .
Coefficient Estimators, Simple Linear Regression
Standard Error of the Slope and Intercept, Simple Linear Regression
Regression and Analysis of Variance

You might also like