Complete SAS Code For Workout

Data workout SAS code
Complete SAS code for workout

Data treatment demonstration
 For converting categorical variables in to indicator variable
 For missing value treatment
 Knowing when we don’t need to create indicator variable for the categorical
variable
 (Note – categorical variable is either character variable or such numeric variable,
which looks like numeric but doesn’t have numeric meaning (like say sector 1, 2
etc.)
/* Proc Means to check the values of variable TargetD for distinct values of variable
TargetB */
proc means data=shan.gift;

var TargetD;
class TargetB;
run;
The MEANS Procedure
Analysis Variable : TargetD Target Gift Amount
Target Gift N
Flag Obs N Mean Std Dev Minimum Maximum
--------------------------------------------------------------------------------------------
0 4843 0 . . . .
1 4843 4843 15.6243444 12.4451371 1.0000000 200.0000000

--------------------------------------------------------------------------------------------
/* Checking for missing value count*/
data chk;
set shan.gift;
where GiftAvgCard36=.;
run;
NOTE: There were 1780 observations read from the data set SHAN.GIFT.
WHERE GiftAvgCard36=.;
NOTE: The data set WORK.CHK has 1780 observations and 28 variables.
NOTE: DATA statement used (Total process time):
real time 0.05 seconds
cpu time 0.01 seconds
/*missing values for age*/
data chk;
set shan.gift;
where DemAge=.;
run;
WHERE DemAge=.;
NOTE: The data set WORK.CHK has 2407 observations and 28 variables.
cpu time 0.02 second
/*missing value treatment*/
data shan.gift1;
set shan.gift;
if DemAge=. then
do;
DemAge=0;
flagDemAge=1;
end;
else flagDemage=0;
if GiftAvgCard36=. then
do;
GiftAvgCard36=0;
flagGiftAvgCard36=1;
end;
else flagGiftAvgCard36=0;
run;
NOTE: This SAS session is using a registry in WORK. All changes will be lost at the end of
this session.
23 data shan.gift1;
24 set shan.gift;
25 if DemAge=. then do;
26 DemAge=0;
27 flagDemAge=1;
28 end;
29 else flagDemage=0;
30 if GiftAvgCard36=. then do;
31 GiftAvgCard36=0;
32 flagGiftAvgCard36=1;
33 end;
34 else flagGiftAvgCard36=0;
35 run;
NOTE: The data set SHAN.GIFT1 has 9686 observations and 30 variables.
/*Find the significance of character variables on the dependent variable*/
proc freq data=shan.gift1;

tables StatusCat96NK*TargetB/chisq;
run;
The FREQ Procedure

Frequency Table of StatusCat96NK by TargetB
Percent
Row Pct StatusCat96NK(Status Category 96NK) TargetB(Target Gift Flag) Total
Col Pct
0 1
A 3032 2794 5826

31.30 28.85 60.15
52.04 47.96
62.61 57.69
E 95 132 227
0.98 1.36 2.34
41.85 58.15
1.96 2.73
F 403 257 660

4.16 2.65 6.81
Table of StatusCat96NK by TargetB
StatusCat96NK(Status Category 96NK) TargetB(Target Gift Flag) Total
0 1
61.06 38.94
8.32 5.31
L 17 17 34
0.18 0.18 0.35
50.00 50.00
0.35 0.35
N 310 264 574

3.20 2.73 5.93
54.01 45.99
6.40 5.45
S 986 1379 2365

10.18 14.24 24.42
41.69 58.31
20.36 28.47
Total 4843 4843 9686

50.00 50.00 100.00
Statistics for Table of StatusCat96NK by TargetB

Statistic DF Value Prob
Chi-Square 5 117.0430 <.0001
Likelihood Ratio Chi-Square 5 117.6493 <.0001
Mantel-Haenszel Chi-Square 1 50.8320 <.0001
Phi Coefficient 0.1099
Contingency Coefficient 0.1093
Cramer's V 0.1099
Sample Size = 9686
/* combine E and S of the output cross tablulation*/

/*A L and N can be combined of cross tab to reduce no of distinct categories in
categorical variable*/
/*Please remove the categorical variable in the dataset once indicator variables are
created*/
data shan.gift1;
set shan.gift1;
ind_stcat96nk_E_or_S=0;
ind_stcat96nk_A_L_N=0;
if StatusCat96NK in ('E','S') then ind_stcat96nk_E_or_S=1;
else if StatusCat96NK in ('A','L','N') then ind_stcat96nk_A_L_N=1;
run;

tables ind_stcat96nk_E_or_S*StatusCat96NK
ind_stcat96nk_A_L_N*StatusCat96NK/norow nocol nocum
nopercent chisq;
run;
The FREQ Procedure

Frequency Table of ind_stcat96nk_E_or_S by StatusCat96NK
ind_stcat96nk_E_or_S StatusCat96NK(Status Category 96NK) Total
A E F L N S
0 5826 0 660 34 574 0 7094

Table of ind_stcat96nk_E_or_S by StatusCat96NK
ind_stcat96nk_E_or_S StatusCat96NK(Status Category 96NK) Total
A E F L N S
1 0 227 0 0 0 2365 2592
Total 5826 227 660 34 574 2365 9686
Statistics for Table of ind_stcat96nk_E_or_S by StatusCat96NK

Chi-Square 5 9686.0000 <.0001
Cramer's V 1.0000
Sample Size = 9686

Frequency Table of ind_stcat96nk_A_L_N by StatusCat96NK
ind_stcat96nk_A_L_N StatusCat96NK(Status Category 96NK) Total
A E F L N S
0 0 227 660 0 0 2365 3252
1 5826 0 0 34 574 0 6434
Total 5826 227 660 34 574 2365 9686
Statistics for Table of ind_stcat96nk_A_L_N by StatusCat96NK

Chi-Square 5 9686.0000 <.0001
Cramer's V 1.0000
Sample Size = 9686
/*Creating an Indicator variable for demcluster varaible*/

/*First take chisq*/

tables Demcluster*targetB/chisq;
run;
/*output of chisq*/
The SAS System
The FREQ Procedure

Frequency Table of DemCluster by TargetB
Percent
Row Pct DemCluster(Demographic Cluster) TargetB(Target Gift Flag) Total
Col Pct
0 1
00 100 140 240

1.03 1.45 2.48
41.67 58.33
2.06 2.89
Table of DemCluster by TargetB
DemCluster(Demographic Cluster) TargetB(Target Gift Flag) Total
0 1
01 54 67 121
0.56 0.69 1.25
44.63 55.37
1.12 1.38
02 92 99 191
0.95 1.02 1.97
48.17 51.83
1.90 2.04
03 68 85 153
0.70 0.88 1.58
44.44 55.56
1.40 1.76
04 21 30 51
0.22 0.31 0.53
41.18 58.82
0.43 0.62
05 48 47 95
0.50 0.49 0.98
50.53 49.47
0.99 0.97
0 1
06 31 22 53
0.32 0.23 0.55
58.49 41.51
0.64 0.45
07 34 44 78
0.35 0.45 0.81
43.59 56.41
0.70 0.91
08 102 80 182
1.05 0.83 1.88
56.04 43.96
2.11 1.65
09 36 34 70
0.37 0.35 0.72
51.43 48.57
0.74 0.70
10 106 69 175
1.09 0.71 1.81
60.57 39.43
2.19 1.42
0 1
11 109 127 236

1.13 1.31 2.44
46.19 53.81
2.25 2.62
12 163 160 323

1.68 1.65 3.33
50.46 49.54
3.37 3.30
13 138 171 309

1.42 1.77 3.19
44.66 55.34
2.85 3.53
14 121 127 248

1.25 1.31 2.56
48.79 51.21
2.50 2.62
15 56 52 108
0.58 0.54 1.12
51.85 48.15
1.16 1.07
0 1
16 97 104 201
1.00 1.07 2.08
48.26 51.74
2.00 2.15
17 92 86 178
0.95 0.89 1.84
51.69 48.31
1.90 1.78
18 153 168 321

1.58 1.73 3.31
47.66 52.34
3.16 3.47
19 25 25 50
0.26 0.26 0.52
50.00 50.00
0.52 0.52
20 78 93 171
0.81 0.96 1.77
45.61 54.39
1.61 1.92
0 1
21 91 74 165
0.94 0.76 1.70
55.15 44.85
1.88 1.53
22 60 65 125
0.62 0.67 1.29
48.00 52.00
1.24 1.34
23 60 71 131
0.62 0.73 1.35
45.80 54.20
1.24 1.47
24 185 216 401

1.91 2.23 4.14
46.13 53.87
3.82 4.46
25 71 64 135
0.73 0.66 1.39
52.59 47.41
1.47 1.32
0 1
26 49 51 100
0.51 0.53 1.03
49.00 51.00
1.01 1.05
27 165 166 331

1.70 1.71 3.42
49.85 50.15
3.41 3.43
28 85 109 194
0.88 1.13 2.00
43.81 56.19
1.76 2.25
29 33 40 73
0.34 0.41 0.75
45.21 54.79
0.68 0.83
30 153 109 262

1.58 1.13 2.70
58.40 41.60
3.16 2.25
0 1
31 63 62 125
0.65 0.64 1.29
50.40 49.60
1.30 1.28
32 45 27 72
0.46 0.28 0.74
62.50 37.50
0.93 0.56
33 26 26 52
0.27 0.27 0.54
50.00 50.00
0.54 0.54
34 64 68 132
0.66 0.70 1.36
48.48 51.52
1.32 1.40
35 182 202 384

1.88 2.09 3.96
47.40 52.60
3.76 4.17
0 1
36 216 185 401

2.23 1.91 4.14
53.87 46.13
4.46 3.82
37 56 43 99
0.58 0.44 1.02
56.57 43.43
1.16 0.89
38 53 65 118
0.55 0.67 1.22
44.92 55.08
1.09 1.34
39 118 124 242

1.22 1.28 2.50
48.76 51.24
2.44 2.56
40 197 235 432

2.03 2.43 4.46
45.60 54.40
4.07 4.85
0 1
41 113 84 197
1.17 0.87 2.03
57.36 42.64
2.33 1.73
42 67 73 140
0.69 0.75 1.45
47.86 52.14
1.38 1.51
43 123 104 227

1.27 1.07 2.34
54.19 45.81
2.54 2.15
44 111 74 185
1.15 0.76 1.91
60.00 40.00
2.29 1.53
45 123 105 228

1.27 1.08 2.35
53.95 46.05
2.54 2.17
0 1
46 92 104 196
0.95 1.07 2.02
46.94 53.06
1.90 2.15
47 52 34 86
0.54 0.35 0.89
60.47 39.53
1.07 0.70
48 48 48 96
0.50 0.50 0.99
50.00 50.00
0.99 0.99
49 175 148 323

1.81 1.53 3.33
54.18 45.82
3.61 3.06
50 35 35 70
0.36 0.36 0.72
50.00 50.00
0.72 0.72
0 1
51 119 101 220

1.23 1.04 2.27
54.09 45.91
2.46 2.09
52 19 13 32
0.20 0.13 0.33
59.38 40.63
0.39 0.27
53 70 88 158
0.72 0.91 1.63
44.30 55.70
1.45 1.82
Total 4843 4843 9686

50.00 50.00 100.00
Statistics for Table of DemCluster by TargetB

Chi-Square 53 90.3768 0.0010

Likelihood Ratio Chi-Square 53 90.7359 0.0010
Mantel-Haenszel Chi-Square 1 8.5881 0.0034
Cramer's V 0.0966
Sample Size = 9686

tables Demcluster*targetB/nocol nofreq nocum
nopercent chisq;
run;
/* Let’s take this output to excel and check it */
data shan.gift1;
set shan.gift1;
ind_demclus_1=0;
ind_demclus_2=0;
ind_demclus_3=0;
ind_demclus_4=0;
If DemCluster in ( '32') then ind_demclus_1 = 1 ;
else If DemCluster in ( '41') then ind_demclus_2 = 1 ;
else If DemCluster in ( '36') then ind_demclus_2 =1;
run;
/*demgender variable signifcance*/

tables demgender*targetb/ chisq;
run;
The FREQ Procedure

Frequency Table of DemGender by TargetB
Percent
Row Pct DemGender(Gender) TargetB(Target Gift Flag) Total
0 1
Col Pct Table of DemGender by TargetB
DemGender(Gender) TargetB(Target Gift Flag) Total
0 1
M 1963 1962 3925

20.27 20.26 40.52
50.01 49.99
40.53 40.51
U 266 272 538

2.75 2.81 5.55
49.44 50.56
5.49 5.62
Total 4843 4843 9686

50.00 50.00 100.00
Statistics for Table of DemGender by TargetB

Chi-Square 2 0.0720 0.9647

Cramer's V 0.0027
Sample Size = 9686

The SAS System
/*categorical variable conversion to indicator Not required*/

tables demhomeowner*targetb/ chisq;
run;
The FREQ Procedure

Frequency Table of DemHomeOwner by TargetB
Percent
Row Pct DemHomeOwner(Home Owner) TargetB(Target Gift Flag) Total
Col Pct
0 1
Table of DemHomeOwner by TargetB
DemHomeOwner(Home Owner) TargetB(Target Gift Flag) Total
0 1
U 2174 2135 4309

22.44 22.04 44.49
50.45 49.55
44.89 44.08
Total 4843 4843 9686

50.00 50.00 100.00
Statistics for Table of DemHomeOwner by TargetB

Chi-Square 1 0.6359 0.4252
Continuity Adj. Chi-Square 1 0.6037 0.4372
Phi Coefficient -0.0081

Cramer's V -0.0081
/*categorical variable conversion to indicator Not required*/
Multi collinearity Treatment

 Randomly diving dataset into two parts – for development and validation of model
 Multi collinearity treatment steps
o Knowing individual strength of variables in explaining the dependent variable
o Knowing variable, which has high multi collinearity
o Knowing which are the variables, with which it is collinear
o Deciding which one to keep among collinear variable
/* dividing the data in to test and validation data*/
data test val ;

set shan.gift1;
if ranuni(1)<=0.7 then output test;
else output val;
run;
/* log file*/
NOTE: There were 9686 observations read from the data set SHAN.GIFT1.
NOTE: The data set WORK.TEST has 6793 observations and 36 variables.
NOTE: The data set WORK.VAL has 2893 observations and 36 variables.
cpu time 0.01 second
/* Knowing bi-variate strength of the independent variables in explaining the dependent
variable*/
proc logistic data = test ;
model targetB =
DemAge
DemMedHomeValue
DemMedIncome
DemPctVeterans
GiftAvg36
GiftAvgAll
GiftAvgCard36
GiftAvgLast
GiftCnt36
GiftCntAll
GiftCntCard36
GiftCntCardAll
GiftTimeFirst
GiftTimeLast
PromCnt12
PromCnt36
PromCntAll
PromCntCard12
PromCntCard36
PromCntCardAll
StatusCatStarAll
flagDemAge
flagGiftAvgCard36
ind_demclus_1
ind_demclus_2
ind_demclus_3
ind_demclus_4
ind_stcat96nk_A_L_N
ind_stcat96nk_E_or_S/selection = stepwise maxstep=1 details;
ods output EffectNotInModel = log_data ;
run;
/* Multi collinearity treatment – step 01
Note – we also dropped some insignificant variables based on bi-variate strength*/
proc reg data = test ;

model targetB =
DemAge
DemMedHomeValue
GiftAvg36
GiftAvgAll
GiftAvgLast
GiftCnt36
GiftCntAll
GiftCntCard36
GiftCntCardAll
GiftTimeFirst
GiftTimeLast
PromCnt12
PromCnt36
PromCntAll
PromCntCard12
PromCntCard36
PromCntCardAll
StatusCatStarAll
flagGiftAvgCard36
ind_demclus_1
ind_demclus_2
ind_demclus_4
ind_stcat96nk_A_L_N
ind_stcat96nk_E_or_S
/ vif collin; ODS OUTPUT CollinDiag = collin_data (drop = intercept) ParameterEstimates
= para_data; run;
/* First cycle of VIF we removed promcntcardall*/
/* Close look at complete multi collinearity removal output*/

Final Model development
 Trying to create model based on variables left after multi collinearity treatment
o Checking model estimate and variables significance
o Selecting best variables using step wise regression
o Generate variables coefficient in validation data set
o Checking coefficient stability
/*Model with 10 variables to show the significance of all the variables coming after
multicollinearity test*/
proc logistic data = test outmodel = model_1;
model targetb(event = '1') =
DemAge
DemMedHomeValue
GiftAvg36
GiftCnt36
GiftCntCardAll
GiftTimeLast
ind_demclus_1
ind_demclus_2
ind_demclus_4
ind_stcat96nk_E_or_S/
details;
run;
/*final model variable selection using step wise regression */
DemAge
DemMedHomeValue
GiftAvg36
GiftCnt36
GiftCntCardAll
GiftTimeLast
ind_demclus_1
ind_demclus_2
ind_demclus_4
ind_stcat96nk_E_or_S/
selection = stepwise maxstep=8 details;
run;
/*Developing final model variable coefficients on validation data set*/
proc logistic data = val ;

DemMedHomeValue
GiftAvg36
GiftCnt36
GiftCntCardAll
GiftTimeLast
ind_demclus_1
ind_demclus_2
ind_demclus_4
;
run;
/* Take a look at coefficient stability worksheet */

Knowing model strength
 Obtaining other popular measures of model strength
o Generating score in the development data set
o Understand what is actually the score
o How can you get score manually as well
o Calculating KS statistics on development data
o Checking scoring ability of model - calculating KS statistics on validation data
(using the model developed on development data)
/*Keeping model coefficients in a data set*/

DemMedHomeValue
GiftAvg36
GiftCnt36
GiftCntCardAll
GiftTimeLast
ind_demclus_1
ind_demclus_2
ind_demclus_4
;
run;
/* Generating score in the test data */
proc logistic inmodel = model_1;

score data= test out = predicted;
run;
/* Proc contents of test data just to see what extra fields were added */
proc contents data=predicted;

run;
/* Understand how proc logistic generates score in the dataset

And what is the score actually */
data predicted;
set predicted;
P_0_D = round(P_0*1000,0.1);
log_odds=0.2751 +
DemMedHomeValue*9.425E-7 +
GiftAvg36*-0.00915 +
GiftCnt36*0.0847 +
GiftCntCardAll*0.0273 +
GiftTimeLast*-0.0362 +
ind_demclus_1*-0.3611 +
ind_demclus_2*-0.2279 +
ind_demclus_4*0.1434 ;
prob=exp(log_odds)/(1+exp(log_odds));
run;
/*P_0=Probability of '0' in the model

P_1=Probability of 1
prob =e(logs_odds/(1+exp(log_odds) this is the derived probability value using equation
which should be equal to P_1 in the predicted dataset(almost equal)
P_0_D=It is P_0 Multiplied by 1000 to make it easier to read for the user*/
proc print data=predicted (obs=50);

var DemMedHomeValue
GiftAvg36
GiftCnt36
GiftCntCardAll
GiftTimeLast
ind_demclus_1
ind_demclus_2
ind_demclus_4
P_0
P_1
log_odds
Prob
P_0_D;
run;
/*Creates ten deciles for the score variable on the dataset.

The decile will be ascending(P_0)
Please note lower value of P_0 is same as high value of P_1, hence more of outcome =1
*/
proc rank data=predicted out=practice group=10 ties=low ;

var P_0_D;
ranks P_Final;
run;
/*Check how does the actual score and ranked variable look like */
proc print data=practice(obs=50);
var P_0_D P_Final ;
run;
/*Getting figures to calculate KS and Gini in Development datset */

proc sql ;
select P_final, min(P_0_D)as Min_score, max(P_0_D)as Max_score, sum(1*targetB) as
responder, count(targetB) as population
from practice
group by P_final
order by P_final
;
quit;
/*Scoring the validation dataset – using coefficients obtained on development data*/
proc logistic inmodel = model_1;

score data= val out = predicted;
run;
data predicted;
set predicted;
P_0_D = round(P_0*1000,0.1);
run;
proc rank data=predicted out=practice group=10 ties=low ;

var P_0_D;
ranks P_Final;
run;
proc print data=practice(obs=50);

var P_0_D P_Final ;
run;
proc sql ;
select P_final, min(P_0_D)as Min_score, max(P_0_D)as Max_score, sum(1*targetB) as
responder, count(targetB) as population
from practice
group by P_final
order by P_final
;
quit;

Complete SAS Code For Workout

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Complete SAS Code For Workout

Uploaded by

Copyright:

Available Formats

Data workout SAS code

Complete SAS code for workout

proc means data=shan.gift;

The MEANS Procedure

Analysis Variable : TargetD Target Gift Amount

1 4843 4843 15.6243444 12.4451371 1.0000000 200.0000000

proc freq data=shan.gift1;

The FREQ Procedure

A 3032 2794 5826

F 403 257 660

StatusCat96NK(Status Category 96NK) TargetB(Target Gift Flag) Total

N 310 264 574

S 986 1379 2365

Total 4843 4843 9686

Statistics for Table of StatusCat96NK by TargetB

Chi-Square 5 117.0430 <.0001

Likelihood Ratio Chi-Square 5 117.6493 <.0001

Mantel-Haenszel Chi-Square 1 50.8320 <.0001

Phi Coefficient 0.1099

Contingency Coefficient 0.1093

Sample Size = 9686

/* combine E and S of the output cross tablulation*/

proc freq data=shan.gift1;

The FREQ Procedure

ind_stcat96nk_E_or_S StatusCat96NK(Status Category 96NK) Total

0 5826 0 660 34 574 0 7094

ind_stcat96nk_E_or_S StatusCat96NK(Status Category 96NK) Total

1 0 227 0 0 0 2365 2592

Total 5826 227 660 34 574 2365 9686

Statistics for Table of ind_stcat96nk_E_or_S by StatusCat96NK

Chi-Square 5 9686.0000 <.0001

Likelihood Ratio Chi-Square 5 11252.4170 <.0001

Mantel-Haenszel Chi-Square 1 6831.6503 <.0001

Phi Coefficient 1.0000

Contingency Coefficient 0.7071

Sample Size = 9686

ind_stcat96nk_A_L_N StatusCat96NK(Status Category 96NK) Total

0 0 227 660 0 0 2365 3252

1 5826 0 0 34 574 0 6434

Total 5826 227 660 34 574 2365 9686

Statistics for Table of ind_stcat96nk_A_L_N by StatusCat96NK

Chi-Square 5 9686.0000 <.0001

Likelihood Ratio Chi-Square 5 12362.6467 <.0001

Mantel-Haenszel Chi-Square 1 6385.9061 <.0001

Phi Coefficient 1.0000

Contingency Coefficient 0.7071

/*Creating an Indicator variable for demcluster varaible*/

proc freq data=shan.gift1;

The FREQ Procedure

00 100 140 240

DemCluster(Demographic Cluster) TargetB(Target Gift Flag) Total

DemCluster(Demographic Cluster) TargetB(Target Gift Flag) Total

DemCluster(Demographic Cluster) TargetB(Target Gift Flag) Total

11 109 127 236

12 163 160 323

13 138 171 309

14 121 127 248

DemCluster(Demographic Cluster) TargetB(Target Gift Flag) Total

18 153 168 321

DemCluster(Demographic Cluster) TargetB(Target Gift Flag) Total

24 185 216 401

DemCluster(Demographic Cluster) TargetB(Target Gift Flag) Total

27 165 166 331

/Creating an Indicator variable for demcluster varaible/

/categorical variable conversion to indicator Not required/

/categorical variable conversion to indicator Not required/

/Getting figures to calculate KS and Gini in Development datset /

/Scoring the validation dataset – using coefficients obtained on development data/