Lecture 11 Analytics - Regression Diagnostics With SAS

Modeling
Relationships
Association and Causation
Association
Correlation is not causation!
Data set FITNESS for correlation example
CORR PRODECDURE
Univariate Statistics
Pearson Correlation Coefficients

Cronbach’s alpha
The observed value is divided into two components, a true value

and a measurement error . The measurement error is assumed to
be independent of the true value; that is,
Y = T + E Cov(T , E ) = 0
Suppose there are p variables Yj = Tj + E j where j= 1,…p
Let Y0 =  j Y j and T0 =  j T j
Then Cronbach’s coefficient alpha is given by
p i  j Cov(Yi , Y j ) p   j V (Y j ) 
= = 1−
( p − 1) V (Y0 ) ( p − 1)  V (Y0 ) 
 
If the covariance of items is small compared to variance of their sum, it
implies that total variation is less due to Ti and more due to error (Ei) as
(Cov(T, E) =0 by assumption.
Hint : For obtaining second equality for alpha, Expand V(Yo) which is variance of (Y1+Y2+Y3+..Yp) and substitute
out the Cov terms from equation V(Yo) = ∑V(Yj) - ∑ (i≠j) CoV(Yi, Yj)
Cronbach's alpha is known as a measure of internal consistency. Generally,
this measure will increase as the inter-correlations among test items increase
Theoretically, if all test items measured the same construct, the inter-
correlations among them would be maximized. In this sense, the
Cronbach’s Alpha is seen as indirectly reflecting the degree to which the

items in the test set measure a single uni-dimensional latent construct.
Note: If the variances of the items V(Yj) vary much, you can standardize the
Items to unit variance before computing Cronbach’s alpha
Requesting Cronbach’s
alpha
Another Example: Fish Data
Cronbach’s alpha suggested value of 0.70 is given by

Nunnally and Bernstein (1994).
Effect of deleting variables on Cronbach’s alpha
If the standardized alpha decreases after removing a variable

from the construct, then this variable is strongly correlated with other
variables in the scale.
On the other hand, if the standardized alpha increases after removing a

variable from the construct, then removing this variable from the scale makes
the construct more reliable.
So removing Width variable increases internal consistency because
Standardized alpha increases substantially with removal of Width
Regression: cause and
effect relationship
Diagnostics
Exploring data : Looking at correlations among variables
proc insight data=sasuser.crime;

scatter crime pctmetro poverty single*crime pctmetro poverty single;
run;
quit;
The variables are

• state id (sid),
• state name (state),
• violent crimes per 100,000 people (crime),
• murders per 1,000,000 (murder),
• % of the population living in metropolitan areas (pctmetro),
• % of the population that is white (pctwhite),
• % of population with a at leaat high school education (pcths)
• % of population living under poverty line (poverty), and
• % of population that are single parents (single).
Unusual and influential data points
Observations can be unusual in 3 ways
❑ Outliers:
Observations with large residuals, i.e. observations whose
actual values are far away from their predicted values. May
indicate data peculiarity or a data entry problem
❑ Observations with high leverage:

These are observations with extreme values on predictor variables
. Leverage is measure of how far an observation deviate from the
mean of that predictor variable. The leverage points can have an
effect on estimates of regression coefficients
❑ Influence: An observation is said to be influential if removing it

substantially changes the estimate of the parameter. Influence is a
product of leverage and outlier-ness.
Regression assumptions
Consider the regression equation
y =α + β x + β x +u
1 2
1 2
We assume:
• E(u) = 0
• U is iid normally distributed
• E(X.u) = 0
• No hetrogeneiety (dist of u’s is independent at each X)
• No auto correlations
• No multicollinearity
Running regression
PROC REG is used for running regression
proc reg data=sasuser.crime;

model crime=pctmetro poverty single;
output out=crime1res(keep=sid state crime pctmetro poverty single r lev cd dffit)
rstudent=r h=lev cookd=cd
dffits=dffit;
run;
quit;
How to read regression output
Regression results output to data set crime1res:
Cook’s D, leverage, studentized residuals, DFFIT
Analysis of studentized residuals
With PROC UNIVARIATE
proc univariate data=crime1res plots plotsize=30;

var r;
run;
Univariate analysis of studentized residuals: outliers at obs 9, 25, 51
Leverage: Identifying unusual
observations
proc univariate data=crime1res plots
plotsize=30;
var lev;
run;
Levarage : Identifying unusual observations
Generally, a point with leverage greater than (2k+2)/n should

be carefully examined, where
k = the number of predictors and

n = the number of observations.
In our example k=3, n=51 (US states)
this works out to (2*3+2)/51 = .15686275, so we can do the

following
proc print data=crime1res;

var crime pctmetro poverty single state;
where lev > .156;
run;
Printing Unusual observations
with high leverage > 0.156
Obs crime pctmetro poverty single state

47 1062 75.000 26.4000 14.9000 la
48 208 41.800 22.2000 9.4000 wv
49 434 30.700 24.7000 14.7000 ms
50 761 41.800 9.1000 14.3000 ak
51 2922 100.000 26.4000 22.1000 dc
You may scrutinize these observations more carefully to understand

why they have more influence on your regression results
Other measures of influence: COOK’s D
(The conventional cut-off point is 4/n )

where cd > (4/51);
var crime pctmetro poverty single state cd;
run;
Other measures of influence: DFFITS
Conventional cut-off point is 2*sqrt(k/n)
where abs(dffit)> (2*sqrt(3/51));
var crime pctmetro poverty single state dffit;
run;
Again the same states show up as having high influence

Measures of specific influence
DFBETA
how each coefficient is changed by deleting the observation
proc reg data="c:\sasreg\crime";

model crime=pctmetro poverty single / influence;
ods output OutputStatistics=crimedfbetas;
id state;
run;
quit;
For Alaska DFBETA for predictor variable SINGLE= 0.1452 >>> including Alaska in analysis,
increases the coefficient for SINGLE by 0.14 standard errors, i.e., 0.14 times the standard
error for BSingle
DFBETA for Bsingle = 0.14
The contribution of Alaska observation to BSingle = DFBETA (BSingle)*SE(BSingle)
That is, the contribution of Alaska observation to the coefficient of variable SINGLE is given
by 0.14 times the standard errors of Bsingle [0.14 * BSingle ]
The standard Error of coefficient Bsingle is SE(Bsingle)= 15.5
So Alaska observation’s contribution to BSingle is 2.17 (=0.14 * 15.5 )
Therefore, if we exclude Alaska, the coefficient of Bsinglle will decrease by the contribution of
Alaska observation , or by 2.17. In other words the coefficient BSingle will decrease from
132.408 to 130.23 (= 132.408 - 2.17)

Lecture 11 Analytics - Regression Diagnostics With SAS

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 11 Analytics - Regression Diagnostics With SAS

Uploaded by

Copyright:

Available Formats

Modeling

Pearson Correlation Coefficients

The observed value is divided into two components, a true value

Then Cronbach’s coefficient alpha is given by

Cronbach’s Alpha is seen as indirectly reflecting the degree to which the

Cronbach’s alpha suggested value of 0.70 is given by

If the standardized alpha decreases after removing a variable

On the other hand, if the standardized alpha increases after removing a

proc insight data=sasuser.crime;

The variables are

Observations can be unusual in 3 ways

❑ Observations with high leverage:

❑ Influence: An observation is said to be influential if removing it

• U is iid normally distributed

• No hetrogeneiety (dist of u’s is independent at each X)

PROC REG is used for running regression

proc reg data=sasuser.crime;

proc univariate data=crime1res plots plotsize=30;

Generally, a point with leverage greater than (2k+2)/n should

k = the number of predictors and

In our example k=3, n=51 (US states)

this works out to (2*3+2)/51 = .15686275, so we can do the

proc print data=crime1res;

Obs crime pctmetro poverty single state

You may scrutinize these observations more carefully to understand

proc print data=crime1res;

Again the same states show up as having high influence

proc reg data="c:\sasreg\crime";

The contribution of Alaska observation to BSingle = DFBETA (BSingle)*SE(BSingle)

The standard Error of coefficient Bsingle is SE(Bsingle)= 15.5

So Alaska observation’s contribution to BSingle is 2.17 (=0.14 * 15.5 )

You might also like