Professional Documents
Culture Documents
Both least squares and logistic regression methods estimate parameters in the model
so that the fit of the model is optimized. Least squares regression minimizes the sum of
squared errors to obtain parameter estimates, whereas Minitab's logistic regression obtains
maximum likelihood estimates of the parameters.
Response Estimation
Use... To... type method
Regression perform simple, multiple regression or continuous least squares
polynomial least squares regression
Stepwise perform stepwise, forward selection, or continuous least squares
backward elimination to identify a useful
subset of predictors
Best identify subsets of the predictors based continuous least squares
Subsets on the maximum R criterion
Fitted Line perform linear and polynomial continuous least squares
Plot regression with a single predictor and
plot a regression line through the data
PLS perform regression with ill-conditioned continuous biased, non-
data least squares
Binary perform logistic regression on a categorical maximum
Logistic response with only two possible values, likelihood
such as presence or absence
Ordinal perform logistic regression on a categorical maximum
Logistic response with three or more possible likelihood
values that have a natural order, such
as none, mild, or severe
Nominal perform logistic regression on a categorical maximum
Logistic response with three or more possible likelihood
values that have no natural order, such
as sweet, salty, or sour
Regression Data
Enter response and predictor variables in numeric columns of equal length so that each
row in your worksheet contains measurements on one observation or subject.
Minitab omits all observations that contain missing values in the response or in the predictors,
from calculations of the regression equation and the ANOVA table items.
If you have variables which are either closely related or are nearly constant, Minitab
may judge them as ill-conditioned and omit some or all of them from the model. If your data are
ill-conditioned, you can modify how Minitab handles these variables by using the TOLERANCE
command in the Session window.
To Do a Linear Regression
REGRESSION GRAPHS
• Residuals for Plots: You can specify the type of residual to display on the residual
plots.
• Regular: Choose to plot the regular or raw residuals .
• Standardized: Choose to plot the standardized residuals .
• Deleted: Choose to plot the Studentized deleted residuals .
• Residual Plots
• Individual plots: Choose to display one or more plots.
• Histogram of residuals: Check to display a histogram of the residuals.
• Normal plot of residuals: Check to display a normal probability plot of the residuals.
• Residuals versus fits: Check to plot the residuals versus the fitted values.
• Residuals versus order: Check to plot the residuals versus the order of the data. The
row number for each data point is shown on the x-axisfor example, 1 2 3 4... n.
• Four in one: Choose to display a layout of a histogram of residuals, a normal plot of
residuals, a plot of residuals versus fits, and a plot of residuals versus order.
• Residuals versus the variables: Enter one or more columns containing the variables
against which you want to plot the residuals. Minitab displays a separate graph for each
column.
Long tails in the plot may indicate skewness in the data. If one or two bars are far from
the others, those points may be outliers. Because the appearance of the histogram changes
depending on the number of intervals used to group the data, use the normal probability plot
and goodness-of-fit tests to assess the normality of the residuals.
• Normal plot of residuals. The points in this plot should generally form a straight line if
the residuals are normally distributed. If the points on the plot depart from a straight line,
the normality assumption may be invalid. If your data have fewer than 50 observations,
the plot may display curvature in the tails even if the residuals are normally distributed.
As the number of observations decreases, the probability plot may show substantial
variation and nonlinearity even if the residuals are normally distributed. Use the
probability plot and goodness-of-fit tests, such as the Anderson-Darling statistic, to
assess whether the residuals are normally distributed.
You can display the Anderson-Darling statistic (AD) on the plot, which can
indicate whether the data are normal.
o If the p-value is lower than the chosen -level, the data do not follow a normal
distribution.
o To display the Anderson-Darling statistic, choose Tools > Options > Individual
Graphs > Residual Plots.
• Residuals versus fits. This plot should show a random pattern of residuals on both
sides of 0. If a point lies far from the majority of points, it may be an outlier. Also, there
should not be any recognizable patterns in the residual plot. The following may indicate
error that is not random:
o a series of increasing or decreasing points
o a predominance of positive residuals, or a predominance of negative residuals
o patterns, such as increasing residuals with increasing fits
Characteristics of an
adequate regression model Check using... Possible solutions
Linear relationship between Lack-of-fit-tests • Add higher-order term to
response and predictors. Residuals vs variables plot model.
• Transform variables.
Residuals have constant Residuals vs fits plot • Transform variables.
variance. • Weighted least squares.
Residuals are independent Durbin-Watson statistic • Add new predictor.
of (not correlated with) one Residuals vs order plot • Use time series analysis.
another. • Add lag variable.
Residuals are normally Histogram of residuals • Transform variables.
distributed. Normal plot of residuals • Check for outliers.
Residuals vs fit plot
Normality test
No unusual observations or Residual plots • Transform variables.
outliers. Leverages • Remove outlying
Cook's distance
5 Eliza P. Dela Cruz, Ph.D. Mathematics Education
Isabela State University, Echague, Isabela
DFITS observation.
Data are not ill-conditioned. Variance inflation factor • Remove predictor.
(VIF) • Partial least squares
Correlation matrix of regression.
predictors • Transform variables.
If you determine that your model does not meet the criteria listed above, you should :
1. Check to see whether your data are entered correctly, especially observations identified
as unusual.
2. Try to determine the cause of the problem. You may want to see how sensitive your
model is to the issue. For example, if you have an outlier, run the regression without
that observation and see how the results differ.
3. Consider using one of the possible solutions listed above. See [9], [28] for more
information.
REGRESSION OPTIONS
ILL-CONDITIONED DATA
Ill-conditioned data relates to problems in the predictor variables, which can cause both
statistical and computational difficulties. There are two types of problems: multicollinearity and
a small coefficient of variation. The checks for ill-conditioned data in Minitab have been heavily
influenced by Velleman et al.
Multicollinearity
Multicollinearity means that some predictors are correlated with other predictors. If this
correlation is high, Minitab displays a warning message and continues computation. The
predicted values and residuals still are computed with high statistical and numerical accuracy,
but the standard errors of the coefficients will be large and their numerical accuracy may be
affected. If the correlation of a predictor with other predictors is very high, Minitab eliminates
the predictor from the model, and displays a message.
To identify predictors that are highly collinear, you can examine the correlation structure
of the predictor variables and regress each suspicious predictor on the other predictors. You
can also review the variance inflation factors (VIF), which measure how much the variance of
an estimated regression coefficient increases if your predictors are correlated. If the VIF < 1,
there is no multicollinearity but if the VIF is > 1, predictors may be correlated. Montgomery and
Peck suggest that if the VIF is 5 10, the regression coefficients are poorly estimated.
Some possible solutions to the problem of multicollinearity are:
• Eliminate predictors from the model, especially if deleting them has little effect
on R2.
• Change predictors by taking linear combinations of them using partial least
squares regression or principal components analysis.
• If you are fitting polynomials, subtract a value near the mean of a predictor
before squaring it.
Small coefficient of variation
More If your data are extremely ill-conditioned, Minitab removes one of the problematic
columns from the model. You can use the TOLERANCE subcommand with REGRESS
to force Minitab to keep that column in the model. Lowering the tolerance can be
dangerous, possibly producing numerically inaccurate results. See Session command
help for more information.
In linear regression, it is assumed that the residuals are independent of (not correlated
with) one another. If the independence assumption is violated, some model fitting results may
be questionable. For example, positive correlation between error terms tends to inflate the t-
values for coefficients, making predictors appear significant when they may not be.
Minitab provides two methods to determine if residuals are correlated:
• A graph of residuals versus data order (1 2 3 4... n) can provides a means to
visually inspect residuals for autocorrelation. A positive correlation is indicated by a clustering
of residuals with the same sign. A negative correlation is indicated by rapid changes in the
signs of consecutive residuals.
• The Durbin-Watson statistic tests for the presence of autocorrelation in
regression residuals by determining whether or not the correlation between two adjacent error
terms is zero. The test is based upon an assumption that errors are generated by a first-order
autoregressive process. If there are missing observations, these are omitted from the
calculations, and only the nonmissing observations are used.
To reach a conclusion from the test, you will need to compare the displayed statistic with lower
and upper bounds in a table. If D > upper bound, no correlation exists; if D < lower bound,
positive correlation exists; if D is in between the two bounds, the test is inconclusive.
REGRESSION RESULTS
REGRESSION STORAGE
IDENTIFYING OUTLIERS
Outliers are observations with larger than average response or predictor values. Minitab
provides several ways to identify outliers, including residual plots and three stored statistics:
leverages, Cook's distance, and DFITS, which are described below. It is important to identify
outliers because they can significantly influence your model, providing potentially misleading or
incorrect results. If you identify an outlier in your data, you should examine the observation to
understand why it is unusual and identify an appropriate remedy.
• Leverage values provide information about whether an observation has unusual
predictor values compared to the rest of the data. Leverages are a measure of the
distance between the x-values for an observation and the mean of x-values for all
observations. A large leverage value indicates that the x-values of an observation are
far from the center of x-values for all observations. Observations with large leverage
may exert considerable influence on the fitted value, and thus the regression model.
Leverage values fall between 0 and 1. A leverage value greater than 2p/n or 3p/n,
where p is the number of predictors plus the constant and n is the number of
observations, is considered large and should be examined. Minitab identifies
observations with leverage over 3p/n or .99, whichever is smaller, with an X in the table
of unusual observations.
• Cook's distance or D is an overall measure of the combined impact of each observation
on the fitted values. Because D is calculated using leverage values and standardized
residuals, it considers whether an observation is unusual with respect to both x- and y-
values. Geometrically, Cook's distance is a measure of the distance between the fitted
values calculated with and without the ith observation. Large values, which signify
unusual observations, can occur because the observation has 1) a large residual and
moderate leverage, 2) a large leverage and moderate residual, or 3) a large residual
and leverage. Some statisticians recommend comparing D to the F-distribution (p, n-p).
If D is greater than the F-value at the 50th percentile, then D is considered extreme and
should be examined. Other statisticians recommend comparing the D statistics to one
another, identifying values that are extremely large relative to the others. An easy way
to compare D values is to graph them using Graph >Time Series, where the x-axis
represents the observations, not an index or time period.
• DFITS provides another measure to determine whether an observation is unusual. It
uses the leverage and deleted (Studentized) residual to calculate the difference
between the fitted value calculated with and without the ith observation. DFITS
represents roughly the number of estimated standard deviations that the fitted value
changes when the ith observation is removed from the data. Some statisticians suggest
that an observation with a DFITS value greater than sqrt(2p/n) is influential. Other
statisticians recommend comparing DFITS values to one another, identifying values that
are extremely large relative to the others. An easy way to compare DFITS is to graph
TO DO A STEPWISE REGRESSION
STEPWISE METHODS
STEPWISE OPTIONS
Stepwise proceeds automatically by steps and then pauses. You can set the number of
steps between pauses in the Options subdialog box.
The number of steps can start at one with the default and maximum determined by the
output width. Set a smaller value if you wish to intervene more often.
You must check Editor > Enable Commands in order to intervene and use the
procedure interactively. If you do not, the procedure will run to completion without pausing.
At the pause, Minitab displays a MORE? prompt. At this prompt, you can continue the
display of steps, terminate the procedure, or intervene by typing a subcommand.
To... Type
display another "page" of steps (or until no more predictors can YES
enter or leave the model)
terminate the procedure NO
enter a set of variables ENTER C... C
remove a set of variables REMOVE C... C
force a set of variables to be in model FORCE C... C
display the next best alternate predictors BEST K
set the number of steps between pauses STEPS K
change F to enter FENTER K
change F to remove FREMOVE K
change to enter AENTER K
change to remove AREMOVE K