You are on page 1of 44

SAS Procedures

Anil Kumar
PROC tabulate
• Summarize the data in the form of a well
organized table
• Syntax:

PROC tabulate DATA=dataname;


ClASS class variables;
VAR variables;
TABLE page, row, column description/options;
RUN;
PROC tabulate – example (1)
proc tabulate data=sashelp.Class;
class sex;
var height weight;
table sex, height weight;
run;

Result:
PROC tabulate – example (2)
proc tabulate data=sashelp.Class;
class sex;
var height weight age;
table sex all, (age height weight)*(std mean sum);
run;

Result:
Gplot – A simple example
• SAS/ Graph modular is feathered by the
flexible PROC gplot
• A simple example:

proc gplot data=sashelp.Class;


symbol i=none v=star;
plot height*weight;
run;
quit;
Resulting graph
Gplot – further example
• The following example shows more
flexibility of the procedure
goptions reset=all;
proc gplot data=sashelp.Class;
symbol1 color = green i = join v= diamond line = 1 w=2 h=2;
symbol2 color = red i= join v= star line = 2 w=2 h=2;
plot Height*Weight=Sex/ hminor=0 legend=legend1;
legend1 down=1 position=(top center inside)
cshadow = blue frame value = (f=duplex)
ACROSS =1
label=(font=duplex h=1.5);
title f= zapf color=blue h =5pct 'Testing the graph';
run;
General Outline of Model Choices

Model Output Variable Types of Inputs Assumptions


ANOVA Interval Categorical, Fixed Effects Normality
only
REG Interval Interval, Fixed Effects only Normality

LOGISTIC Binary Categorical, Interval, Fixed Log-Normal


Effects Only
GLM Interval Categorical, Interval, Fixed Normality
Effects Only
GENMOD Categorical, Interval Categorical, Interval, Fixed Exponential
Effects Only Family
MIXED Interval Categorical, Interval, Normality
Random Effects
GLIMMIX Categorical, Interval, Categorical, Interval, Exponential
Random Effects Random Effects Family

Cerrito
PROC REG
• Inputs and output are interval
• Ordinal data may be included
• Assumptions on ε
– Normally distributed
– ε has mean zero and constant variance
– Is independent
• Residual analysis should be a routine part
of the analysis
Residuals
• The studentized residual, the RSTUDENT
statistic, is similar to the the standardized
residual except that the mean square error
is calculated omitting the observation.
• Observations with studentized residual
absolute values of greater than 2 are
potential outliers.
Regression Example
Output
Scatterplot With Regression Line
Residuals
PROC ANOVA
• Each treatment should have exactly the
same number of observations; every
categorical outcome has the same
number of observations.
• Caution: If you use PROC ANOVA for
analysis of unbalanced data, you must
assume responsibility for the validity of the
results.
• Use PROC GLM instead.
Categorical Procedures
Model Output Variable Types of Inputs Assumptions

LOGISTIC Binary Categorical, Interval, Fixed Log-Normal


Effects Only

CATMOD Analyzes data that can be represented by a two-dimensional


contingency table. Input can be raw data, cell counts, or direct
input of a covariance matrix

GENMOD Categorical, Interval Categorical, Interval, Fixed Exponential


Effects Only Family

GLIMMIX Categorical, Interval, Categorical, Interval, Exponential


Random Effects Random Effects Family
PROC CATMOD
• PROC CATMOD provides a wide variety
of categorical data analyses.
• Now that PROC LOGISITIC handles
classification variables, there is less of a
need to use PROC CATMOD for
regression.
• PROC CATMOD should not be used when
a continuous input variable has many
distinct values.
Output
Logistic Regression
• Binary outcomes
• Allows for any combination of nominal, ordinal or
continuous explanatory variables
• Computes predicted values, the receiver
operating characteristics (ROC) curve and an
approximation to the area beneath the curve ( c ),
and a number of regression diagnostics
• If the occurrence is rare, use the Poisson
distribution in PROC GENMOD.
Generalized Linear Models
In generalized linear models the response is assumed to
possess a probability distribution of exponential form.
That is, the probability density of the response Y for
continuous response variables, or the probability function for
discrete responses, can be expressed as

for some functions a, b, and c that determine the specific distribution


(omitting some requirements for these functions).
Expressions for the mean and variance are

Important to note is that the exponential family (or form) of


distributions constitute a broad class of probability density functions.
Don’t confuse this broad family with the exponential pdf.
Distributions and Associated Default
Link Functions Available in PROC
GENMOD
Interval (Quantitative) Procedures

Model Output Variable Types of Inputs Assumptions


ANOVA Interval Categorical, Fixed Effects Normality
only
REG Interval Interval, Fixed Effects only Normality

GLM Interval Categorical, Interval, Fixed Normality


Effects Only
GENMOD Categorical, Interval Categorical, Interval, Fixed Exponential
Effects Only Family
MIXED Interval Categorical, Interval, Normality
Random Effects
GLIMMIX Categorical, Interval, Categorical, Interval, Exponential
Random Effects Random Effects Family
Assessing Goodness of Fit -
Akaike’s Information Criterion (AIC)
• Information criteria uses the covariance matrix and the
number of parameters in a model to calculate a statistic
that summarizes the information represented by the
model by balancing a trade-off between a lack of fit term
and a penalty term.
• SAS calculates Akaike’s Information Criterion (AIC)
for every possible 2p models for p ≤ 10 independent
variables.
• AIC estimates a measure of the difference between a
given model and the “true” model. The model with the
smallest AIC among all competing models is deemed the
best model.
• Beal’s example provides SAS code that can be used to
simultaneously evaluate up to 1024 models to determine
the best subset of variables that minimizes the
information criteria among all possible subsets.
Minimum AIC
• The AIC statistic is widely used to select
the best model among alternative
parametric models.
• AIC = - 2( maximum log-likelihood) +
2( number of free parameters)
• The amount of AIC is not meaningful.
• The difference of the two AIC values is
considered insignificant if it is far less than 1.
Beal’s Simulation
• Implements five common statistical techniques
to determine the best linear model
– minimizing the RMSE
– maximizing R2
– forward selection
– backward elimination
– Stepwise regression
• The RMSE is a function of the sum of squared
errors (SSE), number of observations n and the
number of parameters p:
RMSE =sqrt(SSE/(n - p))
Generate the Data
Partial Code for Regressions
Simulation Results: n=1000
Simulation Result: n=10000
AIC Selected Coefficients
for Five Runs
Generalized Linear Mixed Models
PROC MIXED
• The mixed model generalizes the standard linear model:
y=X + Z +

• i s an unknown vector of random-effects parameters


with known design matrix Z, and is an unknown
random error vector whose elements are no longer
required to be independent and homogeneous.
• PROC MIXED is a generalization of the GLM procedure
in the sense that PROC GLM fits standard linear models,
and PROC MIXED fits the wider class of mixed linear
models.
• Both procedures have similar CLASS, MODEL,
CONTRAST, ESTIMATE, and LSMEANS statements.
• But their RANDOM and REPEATED statements differ.
RANDOM and REPEATED Statements
in PROC GLM and PROC MIXED
• The RANDOM statement in PROC MIXED incorporates random
effects constituting the vector in the mixed model.
• However, in PROC GLM, effects specified in the RANDOM
statement are still treated as fixed as far as the model fit is
concerned, and they serve only to produce corresponding expected
mean squares.
• The REPEATED statement in PROC MIXED is used to specify
covariance structures for repeated measurements on subjects.
• The REPEATED statement in PROC GLM is used to specify various
transformations with which to conduct the traditional univariate or
multivariate tests.
• In repeated measures situations, the mixed model approach used in
PROC MIXED is more flexible and more widely applicable than
either the univariate or multivariate approaches.
PROC GLIMMIX
• The GLIMMIX procedure fits statistical models to
data with correlations or nonconstant variability
and where the response is not necessarily
normally distributed.
• These models are known as generalized linear
mixed models (GLMM).
• November 2005: Production level version can
now be downloaded from http://
support.sas.com/rnd/app/da/glimmix.html
PROC GLIMMIX (continued)
• The GLMMs, like linear mixed models, assume
normal (Gaussian) random effects.
• Conditional on these random effects, data can
have any distribution in the exponential family.
• The binary, binomial, Poisson, and negative
binomial distributions, for example, are discrete
members of this family.
• The normal, beta, gamma, and chi-square
distrubtions are representatives of the
continuous distributions in this family.
Summary
• Know what your assumptions are and check
them.
• Theory, methods and techniques evolve.
• Consider using
– PROC GLIMMIX
– Enterprise Guide
• Fit the model to the data!
References
• Akaike, H. (1973), "Information Theory and an Extension of the Maximum Likelihood
Principle," in Petrov and Csaki, eds., "Proceedings of the Second International
Symposium on Information Theory," 267-281.
• Beal, Dennis J. (2005), SAS “Code to Select the Best Multiple Linear Regression Model
for Multivariate Data Using Information Criteria”, Proceedings, Southeast SAS Users
Group Conference.
• Bickel, Peter J. and Doksum, Kjell A. (2001), Mathematical Statistics, Prentice-Hall, Inc.,
Upper Saddle River, NJ.
• Cerrito, Patricia B. (2005), “From GLM to GLIMMIX-Which Model to Choose?” Workshop,
Southeast SAS Users Group Conference.
• Long, J.Scott (1997), Regression Models for Categorical and Limited Dependent
Variables, Thousand Oaks, CA: Sage Publications, Inc.
• McCullagh, P. and Nelder. J. A. (1989), Generalized Linear Models, Second Edition,
London: Chapman and Hall.
• Seber, G.A.F. (1984), Multivariate Observations, John Wiley & Sons, New York.
• Stokes, M.E., Davis, C.S., and Koch, G.G. (2000), Categorical Data Analysis Using the
SAS System, Second Edition, Cary, NC: SAS Institute Inc.
• SAS Online Documentation, http://www.sas.com
• GLIMMIX Procedure Documentation, “The GLIMMIX Procedure, Nov. 2005”, SAS
Institute.
UPCOMING COLLOQUIA

"Using LaTeX for Scientific Publication and Presentation,”


Wed., November 30, at 3:30 PM., presented by Ed Hall

----------------------
Please take a minute to complete the feedback form and
leave it on the counter as you exit. Thank you.

The Research Computing Support Center will be closed on


Wednesday-Friday, Nov. 23, 24 and 25. We will re-open on
Monday, November 28th at 9:00 a.m.

Note: EG project files, programs and other SAS source used in the original presentation
are available by request, but they are not contained in this online version - kmg