Introduction to Statistical Model Assessment

Introduction to Statistical Methods
BITS Pilani ISM Team

Pilani Campus
BITS Pilani
Pilani Campus
DSECL ZC413
Lecture No. 12
Assessing Model Adequacy
• A plot of the observed pairs (xi, yi) is a necessary first step in deciding on the form of a mathematical
relationship between x and y.
• It is possible to fit many functions other than a linear one (y = b0 + b1x) to the data, using either the
principle of least squares or another fitting method.
• Once a function of the chosen form has been fitted, it is important to check the fit of the model to see
whether it is in fact appropriate.
• One way to study the fit is to superimpose a graph of the best-fit function on the scatter plot of the data.
• However, any tilt or curvature of the best-fit function may obscure some aspects of the fit that should be
investigated.
• Furthermore, the scale on the vertical axis may make it difficult to assess the extent to which observed
values deviate from the best-fit function.
BITS Pilani, Pilani Campus

Residuals and Standardized Residuals
• A more effective approach to assessment of model adequacy is to compute the fitted or predicted values and
the residuals ei = yi – and then plot various functions of these computed quantities.
• We then examine the plots either to confirm our choice of model or for indications that the model is not
appropriate. Suppose the simple linear regression model is correct, and let be the
equation of the estimated regression line. Then the ith residual is ei = yi – .
• To derive properties of the residuals, let represent the ith residual as a random variable (rv) before
observations are actually made. Then
• Because is a linear function of the Yj’s, so is Yi – (the coefficients depend on the xj’s). Thus the
normality of the Yj’s implies that each residual is normally distributed.
Residuals and Standardized Residuals
It can also be shown that
(13.2)
Replacing 2 by s2 and taking the square root of Equation (13.2) gives the estimated standard deviation of a
residual.
Let’s now standardize each residual by subtracting the mean value (zero) and then dividing by the estimated
standard deviation.
The standardized residuals are given by
If, for example, a particular standardized residual is 1.5, then the residual itself is 1.5 (estimated) standard
deviations larger than what would be expected from fitting the correct model.

Example 1
We already know the presented data on x = burner area liberation rate and y = NOx emissions.
Here we reproduce the data and give the fitted values, residuals, and standardized residuals. The
estimated regression line is y = –45.55 + 1.71x, and r2 = .961. The standardized residuals are not a
constant multiple of the residuals because the residual variances differ somewhat from one another.

Diagnostic Plots
The basic plots that many statisticians recommend for an assessment of model validity and usefulness are
the following:
1. ei (or ei) on the vertical axis versus xi on the horizontal axis
2. ei (or ei) on the vertical axis versus on the horizontal axis
3. on the vertical axis versus yi on the horizontal axis
4. A normal probability plot of the standardized residuals
• Plots 1 and 2 are called residual plots (against the independent variable and fitted values,
respectively), whereas Plot 3 is fitted against observed values.
• If Plot 3 yields points close to the 45° line [slope +1 through (0, 0)], then the estimated regression
function gives accurate predictions of the values actually observed. Thus Plot 3 provides a visual
assessment of model effectiveness in making predictions. Provided that the model is correct, neither
residual plot should exhibit distinct patterns.
• Plot 4 allows the analyst to assess the plausibility of the assumption that  has a normal distribution.

Example 2
The plot of versus y confirms the impression given by r2 that x is effective in

predicting y and also indicates that there is no observed y for which the
predicted value is terribly far off the mark.
Both residual plots show no unusual pattern or discrepant values. There is one
standardized residual slightly outside the interval (–2, 2), but this is not
surprising in a sample of size 14.
The normal probability plot of the standardized residuals is reasonably straight.

In summary, the plots leave us with no qualms about either the appropriateness Plots for the data from Example 1
of a simple linear relationship or the fit to the given data.
Difficulties
Quite frequently the plots will suggest one or more of the following difficulties:
1. A nonlinear probabilistic relationship between x and y is appropriate.
2. The variance of  (and of Y) is not a constant 2 but depends on x.
3. The selected model fits the data well except for a very few discrepant or
outlying data values, which may have greatly influenced the choice of the
best-fit function.
4. The error term  does not have a normal distribution.
5. When the subscript i indicates the time order of the observations, the i’s exhibit
dependence over time.
6. One or more relevant independent variables have been omitted from the model.

Regression with Transformed Variables
A function relating y to x is intrinsically linear if, by means of a transformation on x and/or y, the function
can be expressed as y = 0 + 1x, where x = the transformed independent variable and y = the
transformed dependent variable.
Four of the most useful intrinsically linear functions are given in Table 13.1.
Useful
Intrinsicall Table 13.1
y Linear
Functions
In each case, the appropriate transformation is either a log transformation—either base 10 or natural
logarithm (base e)—or a reciprocal transformation.

Representative graphs of the four functions appear in Figure 13.3.
Graphs of the intrinsically

linear functions given in Table
13.1

A probabilistic model relating Y to x is intrinsically linear if, by means of a transformation on Y and/or x, it can be
reduced to a linear probabilistic model
Y = 0 + 1x + .
The intrinsically linear probabilistic models that correspond to the four functions of Table 13.1 are as follows:
Table
13.1
Useful Intrinsically Linear

Functions

a. Y = ex  , a multiplicative exponential model, from which ln(Y) = Y = 0 + 1x +  with
x = x, 0 = ln(), 1 = , and  = ln().
b. Y = x  , a multiplicative power model, so that log(Y) = Y = 0 + 1x +  with x = log(x),

0 = log(x) + , and  = log().
c. Y =  +  log(x) + , so that x = log(x) immediately linearizes the model.
d. Y =  +   1/x + , so that x = 1/x yields a linear model.
The additive exponential and power models, Y = ex +  and Y = x + , are not intrinsically
linear.
Notice that both (a) and (b) require a transformation on Y and, as a result, a
transformation on the error variable .
In fact, if  has a lognormal distribution with

and V() = independent of x, then the transformed models for
both (a) and (b) will satisfy all the assumptions regarding the linear
probabilistic model; this in turn implies that all inferences for the parameters
of the transformed model based on these assumptions will be valid.
If  2 is small, Yx  ex in (a) or x in (b).

The major advantage of an intrinsically linear model is that the parameters 0 and 1 of the transformed model can be
immediately estimated using the principle of least squares simply by substituting x and y into the estimating
formulas:
Parameters of the original nonlinear model can then be estimated by transforming back and/or if necessary.
Once a prediction interval for y when x = x has been calculated, reversing the transformation gives a PI for y itself.
(13.5)
In cases (a) and (b), when  2 is small, an approximate CI for Yx results from taking antilogs of the limits in the
CI for (strictly speaking, taking antilogs gives a CI for the median of the Y distribution, i.e., for Yx .
Because the lognormal distribution is positively skewed, ; the two are approximately equal if  2 is close to
0.)

Logistic Regression
Consider a dichotomous response variable with possible values 1 and 0 corresponding to success and
failure.
Let p = P(S) = P(Y = 1). Frequently, the value of p will depend on the value of some quantitative variable x.
For example, the probability that a car needs warranty service of a certain kind might well depend on the
car’s mileage, or the probability of avoiding an infection of a certain type might depend on the dosage
in an inoculation.
Instead of using just the symbol p for the success probability, we now use p(x) to emphasize the
dependence of this probability on the value of x.
The simple linear regression equation Y = 0 + 1x +  is no longer appropriate, for taking the mean value
on each side of the equation gives

Logistic Regression
• Whereas p(x) is a probability and therefore must be between 0 and 1, 0 + 1x need
not be in this range.
• Instead of letting the mean value of Y be a linear function of x, we now consider a
model in which some function of the mean value of Y is a linear function of x.
• In other words, we allow p(x) to be a function of 0 + 1x rather than 0 + 1x itself. A
function that has been found quite useful in many applications is the logit function
• As x increases, the probability of success increases. For 1 negative, the success
probability would be a decreasing function of x.
A graph of a
logit Figure 13.8 shows a
function
graph of p(x) for
particular values of
0 and 1 with 1 > 0.

Logistic Regression
Logistic regression means assuming that p(x) is related to x by the logit function. Straightforward
algebra shows that
• The expression on the left-hand side is called the odds.
• If, for example, , then when x = 60 a success is three times as likely as a failure.
• We now see that the logarithm of the odds is a linear function of the predictor. In particular, the
slope parameter 1 is the change in the log odds associated with a one-unit increase in x.
• This implies that the odds itself changes by the multiplicative factor when x increases by 1 unit.

Example 6
• Fitting the logistic regression to sample data requires that the parameters 0 and 1 be estimated. This
is usually done using the maximum likelihood technique.
• The details are quite involved, but fortunately the most popular statistical computer packages will do
this on request and provide quantitative and pictorial indications of how well the model fits.
• Here is data, in the form of a comparative stem-and-leaf display, on launch temperature and the
incidence of failure of O-rings in 23 space shuttle launches prior to the Challenger disaster of 1986 (Y =
yes, failed; N = no, did not fail).
• Observations on the left side of the display tend to be smaller than those on the right side.
Leaf : Ones digit

Stem: Tens digit

Example 6 cont’d
• Figure 13.9 shows Minitab output for a logistic regression analysis and a graph of the estimated logit function from the R
software.
• We have chosen to let p denote the probability of failure. The graph of decreases as temperature increases because
failures tended to occur at lower temperatures than did successes.
• The estimate of 1 and its estimated standard deviation are = –.232 and = .1082, respectively.
• We assume that the sample size n is large enough here so that has approximately a normal distribution.
(a) Logistic
regression output (b) graph of
from Minitab estimated logistic
function
and classification
probabilities from R
Figure 13.9
Example 6 cont’d
• If 1 = 0 (i.e., temperature does not affect the likelihood of O-ring failure), the
test statistic has approximately a standard normal distribution.
• The reported value of this ratio is z = –2.14, with a corresponding two-tailed P

value of .032 (some packages report a chi square value which is just z2, with
the same P-value).
• At significance level .05, we reject the null hypothesis of no temperature

effect.
• The estimated odds of failure for any particular temperature value x is

Example 6 cont’d
• This implies that the odds ratio—the odds of failure at a temperature of x + 1

divided by the odds of failure at a temperature of x—is
• The interpretation is that for each additional degree of temperature, we

estimate that the odds of failure will decrease by a factor of .79 (21%). A 95%
CI for the true odds ratio also appears on output.
• In addition, Minitab provides three different ways of assessing model lack-of-

fit: the Pearson, deviance, and Hosmer-Lemeshow tests. Large P-values are
consistent with a good model.
Polynomial Regression
The kth-degree polynomial regression model equation is
Y = 0 + 1x + 2x2 +…+ kxk + 
where  is a normally distributed random variable with
 = 0 =2
cubic
Quadratic
regression
regression
From (13.6) and (13.7), it follows immediately that model
model
Y  x = 0 + 1x + … + 𝛽𝑘 𝑥 𝑘 = 2
In words, the expected value of Y is a kth-degree polynomial function of x, whereas the variance of Y, which controls the
spread of observed values about the regression function, is the same for each value of x.
The observed pairs (x1, y1), . . . ,(xn, yn) are assumed to have been generated independently from the model
(13.6).

Estimating Parameters
To estimate the s, consider a trial regression function y = b0 + b1x + … + bkxk. Then the goodness
of fit of this function to the observed data can be assessed by computing the sum of squared
deviations
According to the principle of least squares, the estimates are those

values of b0, b1, …, bk that minimize Expression (13.9).
It should be noted that when x1, x2, …, xn are all different, there is a polynomial of degree n – 1 that
fits the data perfectly, so that the minimizing value of (13.9) is 0 when k = n – 1.
However, in virtually all applications, the polynomial model (13.6) with large k is quite unrealistic.

To find the minimizing values in (13.9), take the k + 1 partial derivatives
and equate them to 0. This gives a system of normal
equations for the estimates.
Because the trial function b0 + b1x + . . . + bkxk is linear in b0, . . . , bk (though not in x),
the k + 1 normal equations are linear in these unknowns:
(13.10)

 2 and R2
To make further inferences, the error variance 2 must be estimated.

With , the ith residual is , and the sum of
squared residuals (error sum of squares) is SSE = ( ) 2.
The estimate of 2 is then
(13.11)
where the denominator n – (k + 1) is used because k + 1 df are lost in

estimating 0, 1, …, k.

 2 and R2
• If we again let SST = ( )2, then SSE/SST is the proportion of the total variation in the observed
yi’s that is not explained by the polynomial model.
• The quantity 1 – SSE/SST, the proportion of variation explained by the model, is called the coefficient
of multiple determination and is denoted by R2.
• Consider fitting a cubic model to the data in Example 7. Because this model includes the quadratic as a
special case, the fit will be at least as good as the fit to a quadratic. More generally, with SSEk = the
error sum of squares from a kth-degree polynomial, SSEk  SSEk and R2k  R2k whenever k > k.
• Because the objective of regression analysis is to find a model that is both simple (relatively few
parameters) and provides a good fit to the data, a higher-degree polynomial may not specify a better
model than a lower-degree model despite its higher R2 value.

Statistical Intervals and Test Procedures
A 100(1 – )% CI for i, the coefficient of xi in the polynomial regression function, is
A test of H0: i = i0 is based on the t statistic value
The test is based on n – (k + 1) df and is upper-, lower-, or two-tailed according to

whether the inequality in Ha is >, <, or .

A point estimate of Y  x—that is, of
The estimated standard deviation of the corresponding estimator is rather

complicated. Many computer packages will give this estimated standard
deviation for any x value upon request. This, along with an appropriate
standardized t variable, can be used to justify the procedures in next slide.

Let x denote a specified value of x. A 100(1 – )% CI for Y  x is
With denoting the

calculated value of for the given data, and denoting the estimated standard deviation of the statistic
the formula for the CI is much like the one in the case of simple linear regression:
A 100(1 – )% PI for a future y value to be observed when

x = x is

Multiple Regression Analysis
The general additive multiple regression model equation is
Y = 0 + 1x1 + 2x2 + ... + kxk +  (13.15)
where E() = 0 and V() =  2. In addition, for purposes of testing hypotheses and calculating CIs or PIs,
it is assumed that  is normally distributed.
Let be particular values of x1,...,xk. Then (13.15) implies that
Thus just as 0 + 1x describes the mean Y value as a function of x in simple linear regression, the true
(or population) regression function 0 + 1x1 + . . . + kxk gives the expected value of Y as a function of x1,..., xk.
The i’s are the true (or population) regression coefficients. The regression coefficient 1 is interpreted as the
expected change in Y associated with a 1-unit increase in x1 while x2,..., xk are held fixed. Analogous
interpretations hold for 2,..., k.
• With k predictors, the data consists of n (k + 1) tuples (x11, x21,..., xk1, y1),
(x12, x22,..., xk2, y2), . . . , (x1n, x2n ,. . . , xkn, yn), where xij is the value of the ith
predictor xi associated with the observed value yj.
• The observations are assumed to have been obtained independently of one

another according to the model (13.15). To estimate the parameters 0, 1,...,
k using the principle of least squares, form the sum of squared deviations of
the observed yj’s from a trial function y = b0 + b1x1 + ... + bkxk :
f(b0, b1,..., bk) = [yj – (b0 + b1x1j + b2x2j + . . . + bkxkj )]2
• The least squares estimates are those values of the bi’s that minimize f(b0,...,
bk).
Taking the partial derivative of f with respect to each bi (i = 0,1, . . . , k) and
equating all partials to zero yields the following system of normal equations:
b0n + b1x1j + b2x2j +. . . + bkxkj = yj (13.18)
These equations are linear in the unknowns b0, b1,..., bk. Solving (13.18) yields the least
squares estimates ..
Example 12
The article “How to Optimize and Control the Wire Bonding Process: Part II” (Solid State Technology, Jan.
1991: 67–72) described an experiment carried out to assess the impact of the variables x1 = force (gm),
x2 = power (mW), x3 = tempertaure (C), and x4 = time (msec) on y = ball bond shear strength (gm).
The following data was generated to be consistent with the information given in the article:

Example 12
•A statistical computer package gave the following least squares estimates:
–37.48 .2117 .4983 .1297 .2583
•Thus we estimate that .1297 gm is the average change in strength

associated with a 1-degree increase intemperature when the other three
predictors are held fixed; the other estimated coefficients are interpreted
in a similar manner.

Example 12
• The estimated regression equation is y = –37.48 + .2117x1 + .4983x2 + .1297x3 +

.2583x4
• A point prediction of strength resulting from a force of 35 gm, power of 75 mW,

temperature of 200° degrees, and time of 20 msec is
• = –37.48 + .2117(35) + .4983(75) + .1297(200) +.2583(20) = 38.41 gm
• This is also a point estimate of the mean value of strength for the specified values of
force, power, temperature, and time.

Inferences in Multiple Regression
• Before testing hypotheses, constructing CIs, and making predictions, the adequacy of the
model should be assessed and the impact of any unusual observations investigated.
•
• Because each is a linear function of the yi’s, the standard deviation of each is the
product of  and a function of the xij’s.
• An estimate of this SD is obtained by substituting s for .

The function of the xij ’s is quite complicated, but all standard statistical software packages
compute and show the
Inferences concerning a single i are based on the standardized variable
which has a t distribution with n – (k + 1) df.

The point estimate of , the expected value of Y when
,...., is
The estimated standard deviation of the corresponding estimator is again a complicated

expression involving the sample xij’s.
However, appropriate software will calculate it on request. Inferences about are

based on standardizing its estimator to obtain a t variable having n – (k + 1) df.

1. A 100(1 – )% CI for i, the coefficient of xi in the regression function, is
2. A test for H0: i = i0 uses the t statistic value
based on n – (k + 1) df.
The test is upper-, lower-, or two-tailed according to

whether Ha contains the inequality >, < or ≠.

3. A 100(1 – )% CI for is
where is the statistic and is the calculated value of .

Introduction to Statistical Model Assessment

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction to Statistical Model Assessment

Uploaded by

Copyright:

Available Formats

Introduction to Statistical Methods

BITS Pilani ISM Team

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

The plot of versus y confirms the impression given by r2 that x is effective in

The normal probability plot of the standardized residuals is reasonably straight.

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

Graphs of the intrinsically

BITS Pilani, Pilani Campus

Useful Intrinsically Linear

BITS Pilani, Pilani Campus

b. Y = x  , a multiplicative power model, so that log(Y) = Y = 0 + 1x +  with x = log(x),

c. Y =  +  log(x) + , so that x = log(x) immediately linearizes the model.

d. Y =  +   1/x + , so that x = 1/x yields a linear model.

In fact, if  has a lognormal distribution with

If  2 is small, Yx  ex in (a) or x in (b).

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

• The expression on the left-hand side is called the odds.

BITS Pilani, Pilani Campus

Leaf : Ones digit

BITS Pilani, Pilani Campus

• The reported value of this ratio is z = –2.14, with a corresponding two-tailed P

• At significance level .05, we reject the null hypothesis of no temperature

BITS Pilani, Pilani Campus

• This implies that the odds ratio—the odds of failure at a temperature of x + 1

• The interpretation is that for each additional degree of temperature, we

• In addition, Minitab provides three different ways of assessing model lack-of-

Y = 0 + 1x + 2x2 +…+ kxk + 

where  is a normally distributed random variable with

BITS Pilani, Pilani Campus

According to the principle of least squares, the estimates are those

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

To make further inferences, the error variance 2 must be estimated.

The estimate of 2 is then

where the denominator n – (k + 1) is used because k + 1 df are lost in

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

A test of H0: i = i0 is based on the t statistic value

The test is based on n – (k + 1) df and is upper-, lower-, or two-tailed according to

BITS Pilani, Pilani Campus

The estimated standard deviation of the corresponding estimator is rather

BITS Pilani, Pilani Campus

With denoting the

A 100(1 – )% PI for a future y value to be observed when

BITS Pilani, Pilani Campus

Y = 0 + 1x1 + 2x2 + ... + kxk +  (13.15)

Let be particular values of x1,...,xk. Then (13.15) implies that

• The observations are assumed to have been obtained independently of one

f(b0, b1,..., bk) = [yj – (b0 + b1x1j + b2x2j + . . . + bkxkj )]2

BITS Pilani, Pilani Campus

•Thus we estimate that .1297 gm is the average change in strength

BITS Pilani, Pilani Campus

• The estimated regression equation is y = –37.48 + .2117x1 + .4983x2 + .1297x3 +

• A point prediction of strength resulting from a force of 35 gm, power of 75 mW,

BITS Pilani, Pilani Campus

• An estimate of this SD is obtained by substituting s for .