You are on page 1of 35

Modeling

Relationships
Association and Causation
Association
Correlation is not causation!
Data set FITNESS for correlation example
CORR PRODECDURE
Univariate Statistics

Pearson Correlation Coefficients


Cronbach’s alpha

The observed value is divided into two components, a true value


and a measurement error . The measurement error is assumed to
be independent of the true value; that is,
Y = T + E Cov(T , E ) = 0
Suppose there are p variables Yj = Tj + E j where j= 1,…p
Let Y0 =  j Y j and T0 =  j T j

Then Cronbach’s coefficient alpha is given by

p i  j Cov(Yi , Y j ) p   j V (Y j ) 
= = 1−
( p − 1) V (Y0 ) ( p − 1)  V (Y0 ) 
 
If the covariance of items is small compared to variance of their sum, it
implies that total variation is less due to Ti and more due to error (Ei) as
(Cov(T, E) =0 by assumption.
Hint : For obtaining second equality for alpha, Expand V(Yo) which is variance of (Y1+Y2+Y3+..Yp) and substitute
out the Cov terms from equation V(Yo) = ∑V(Yj) - ∑ (i≠j) CoV(Yi, Yj)
Cronbach's alpha is known as a measure of internal consistency. Generally,
this measure will increase as the inter-correlations among test items increase

Theoretically, if all test items measured the same construct, the inter-
correlations among them would be maximized. In this sense, the

Cronbach’s Alpha is seen as indirectly reflecting the degree to which the


items in the test set measure a single uni-dimensional latent construct.

Note: If the variances of the items V(Yj) vary much, you can standardize the
Items to unit variance before computing Cronbach’s alpha
Requesting Cronbach’s
alpha
Another Example: Fish Data

Cronbach’s alpha suggested value of 0.70 is given by


Nunnally and Bernstein (1994).
Effect of deleting variables on Cronbach’s alpha

If the standardized alpha decreases after removing a variable


from the construct, then this variable is strongly correlated with other
variables in the scale.

On the other hand, if the standardized alpha increases after removing a


variable from the construct, then removing this variable from the scale makes
the construct more reliable.
So removing Width variable increases internal consistency because
Standardized alpha increases substantially with removal of Width
Regression: cause and
effect relationship
Diagnostics
Exploring data : Looking at correlations among variables

proc insight data=sasuser.crime;


scatter crime pctmetro poverty single*crime pctmetro poverty single;
run;
quit;

The variables are


• state id (sid),
• state name (state),
• violent crimes per 100,000 people (crime),
• murders per 1,000,000 (murder),
• % of the population living in metropolitan areas (pctmetro),
• % of the population that is white (pctwhite),
• % of population with a at leaat high school education (pcths)
• % of population living under poverty line (poverty), and
• % of population that are single parents (single).
Unusual and influential data points

Observations can be unusual in 3 ways

❑ Outliers:
Observations with large residuals, i.e. observations whose
actual values are far away from their predicted values. May
indicate data peculiarity or a data entry problem

❑ Observations with high leverage:


These are observations with extreme values on predictor variables
. Leverage is measure of how far an observation deviate from the
mean of that predictor variable. The leverage points can have an
effect on estimates of regression coefficients

❑ Influence: An observation is said to be influential if removing it


substantially changes the estimate of the parameter. Influence is a
product of leverage and outlier-ness.
Regression assumptions
Consider the regression equation

y =α + β x + β x +u
1 2
1 2

We assume:
• E(u) = 0

• U is iid normally distributed

• E(X.u) = 0

• No hetrogeneiety (dist of u’s is independent at each X)

• No auto correlations

• No multicollinearity
Running regression

PROC REG is used for running regression

proc reg data=sasuser.crime;


model crime=pctmetro poverty single;
output out=crime1res(keep=sid state crime pctmetro poverty single r lev cd dffit)
rstudent=r h=lev cookd=cd
dffits=dffit;
run;
quit;
How to read regression output
Regression results output to data set crime1res:
Cook’s D, leverage, studentized residuals, DFFIT
Analysis of studentized residuals
With PROC UNIVARIATE

proc univariate data=crime1res plots plotsize=30;


var r;
run;
Univariate analysis of studentized residuals: outliers at obs 9, 25, 51
Leverage: Identifying unusual
observations
proc univariate data=crime1res plots
plotsize=30;
var lev;
run;
Levarage : Identifying unusual observations

Generally, a point with leverage greater than (2k+2)/n should


be carefully examined, where

k = the number of predictors and


n = the number of observations.

In our example k=3, n=51 (US states)

this works out to (2*3+2)/51 = .15686275, so we can do the


following

proc print data=crime1res;


var crime pctmetro poverty single state;
where lev > .156;
run;
Printing Unusual observations
with high leverage > 0.156

Obs crime pctmetro poverty single state


47 1062 75.000 26.4000 14.9000 la
48 208 41.800 22.2000 9.4000 wv
49 434 30.700 24.7000 14.7000 ms
50 761 41.800 9.1000 14.3000 ak
51 2922 100.000 26.4000 22.1000 dc

You may scrutinize these observations more carefully to understand


why they have more influence on your regression results
Other measures of influence: COOK’s D
(The conventional cut-off point is 4/n )

proc print data=crime1res;


where cd > (4/51);
var crime pctmetro poverty single state cd;
run;
Other measures of influence: DFFITS
Conventional cut-off point is 2*sqrt(k/n)
proc print data=crime1res;
where abs(dffit)> (2*sqrt(3/51));
var crime pctmetro poverty single state dffit;
run;

Again the same states show up as having high influence


Measures of specific influence
DFBETA
how each coefficient is changed by deleting the observation

proc reg data="c:\sasreg\crime";


model crime=pctmetro poverty single / influence;
ods output OutputStatistics=crimedfbetas;
id state;
run;
quit;
For Alaska DFBETA for predictor variable SINGLE= 0.1452 >>> including Alaska in analysis,
increases the coefficient for SINGLE by 0.14 standard errors, i.e., 0.14 times the standard
error for BSingle
DFBETA for Bsingle = 0.14

The contribution of Alaska observation to BSingle = DFBETA (BSingle)*SE(BSingle)

That is, the contribution of Alaska observation to the coefficient of variable SINGLE is given
by 0.14 times the standard errors of Bsingle [0.14 * BSingle ]

The standard Error of coefficient Bsingle is SE(Bsingle)= 15.5

So Alaska observation’s contribution to BSingle is 2.17 (=0.14 * 15.5 )

Therefore, if we exclude Alaska, the coefficient of Bsinglle will decrease by the contribution of
Alaska observation , or by 2.17. In other words the coefficient BSingle will decrease from
132.408 to 130.23 (= 132.408 - 2.17)

You might also like