Professional Documents
Culture Documents
PGPM 2019-2020
Group-1-Section-1
Data Preparation:
Data Source:
Data Description:
o Age
o Work class – organization of the employer
o Fnlwgt – Final weight
o Education – level of education of the individual
o Education-num – Score given to education level (categorical to
continuous)
o Marital status
o Occupation
o Relationship
o Race
o Gender
o Capital-gain
o Capital-loss
o Hours-per-week
o Native-country
DEPENDENT VARIABLE:
o Income class
Data Understanding:
o Data Volume – The dataset contains 48842 rows which contain null
values and unknown values.
o Attribute – there are 6 continuous variables and 8 nominal variables.
fnlwgt: continuous
age: continuous
education-num: continuous
capital-gain: continuous
capital-loss: continuous
hours-per-week: continuous
o These dataset are split into training and test datasets depending on the
split factor given as input.
o The model is built using the training data and validated using the test
data.
Data Dictionary:
This attribute information provides the type of value that each column in
the dataset can hold.
o age: continuous.
o workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov,
Local-gov, State-gov, Without-pay, Never-worked.
o fnlwgt: continuous.
o education: Bachelors, Some-college, 11th, HS-grad, Prof-school,
Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th,
Doctorate, 5th-6th, Preschool.
o education-num: continuous.
o marital-status: Married-civ-spouse, Divorced, Never-married,
Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
o occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-
managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct,
Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv,
Protective-serv, Armed-Forces.
o relationship: Wife, Own-child, Husband, Not-in-family, Other-
relative, Unmarried.
o race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
o sex: Female, Male.
o capital-gain: continuous.
o capital-loss: continuous.
o hours-per-week: continuous.
o native-country: United-States, Cambodia, England, Puerto-Rico,
Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan,
Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy,
Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France,
Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia,
Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-
Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
o Income class: >50K, <=50K
Business Understanding:
DATA PREPARATION
There are values in the dataset marked as “?” we treated them as “NA”
There are too many levels in the factors, hence, merging them:
We can observe that there no “NA” values in the observations. Hence, we can go
ahead with the splitting of data for “test.df” and “train.df”:
We have done 30-70 partition of “income” data into “test.df” and “train.df” data.
Model Selection:
1. Decision Tree
2. Logistic Regression Model
1. Decision Tree
#classification tree
default.ct = rpart(income~., data = train.df, method = "class")
#plot tree
prp(default.ct, type = 5, cex=0.5)
Even though it is fully grown tree model, the accuracy is closer to 100%. This is good sign
that the generated model is very accurate
Performing cross validation:
Taking the minimum error value to generate a “Best Pruned Tree”
cv.ct = rpart(income~., data = train.df, method="class", cp=0.00001, minsplit=5, xval=5)
printcp(cv.ct)
min.cp = cv.ct$cptable[which.min(cv.ct$cptable[,"xerror"]),"CP"]
> varImp(cv.ct)
Overall
age 2652.9003
capital.gain 3794.0838
capital.loss 1161.7316
education 3881.9123
educational.num 3583.1706
fnlwgt 2503.3892
gender 246.7625
hours.per.week 1879.5588
marital.status 2555.8307
native.country 720.3707
occupation 2956.7303
race 284.5883
relationship 2676.5847
workclass 980.2665
> varImp(deeper.ct)
Overall
age 3615.3792
capital.gain 3962.4568
capital.loss 1263.6892
education 4380.2679
educational.num 3973.0590
fnlwgt 3934.1958
gender 396.5312
hours.per.week 2597.3727
marital.status 2704.1844
native.country 926.8374
occupation 3454.9851
race 556.2974
relationship 2918.5779
workclass 1394.9231
Logistic Regression:
Performing logistic regression using the important variables we found from Decision Tree.
[1] 0.4185798
test.df$predicted_val<-predict(training_logit, newdata=test.df,type="response
")
test.df$score[test.df$predicted_val>=0.5]<-"1"
test.df$score[test.df$predicted_val<0.5]<-"0"
test.df$score<-as.factor(test.df$score)
confusionMatrix(test.df$score,test.df$income)
varImp(training_logit)
> summary(training_logit)
Call:
glm(formula = income ~ ., family = binomial(link = "logit"),
data = train.df)
Deviance Residuals:
Min 1Q Median 3Q Max
-5.1152 -0.5163 -0.1940 -0.0212 3.7828
m1 = stepAIC(training_logit)
Start: AIC=20797.94
income ~ age + workclass + fnlwgt + education + educational.num +
marital.status + occupation + relationship + race + gender +
capital.gain + capital.loss + hours.per.week + native.country
Step: AIC=20797.94
income ~ age + workclass + fnlwgt + education + marital.status +
occupation + relationship + race + gender + capital.gain +
capital.loss + hours.per.week + native.country
Df Deviance AIC
<none> 20606 20798
- race 4 20626 20810
- fnlwgt 1 20628 20818
- native.country 40 20715 20827
- gender 1 20686 20876
- workclass 6 20732 20912
- marital.status 6 20742 20922
- relationship 5 20822 21004
- age 1 20826 21016
- hours.per.week 1 20879 21069
- capital.loss 1 20911 21101
- occupation 13 21238 21404
- education 15 21593 21755
- capital.gain 1 22428 22618
Interpretations
From Decision Tree model, we found that all are the variables that predicts
the income levels are significant.
Accuracy is around 83.92% Balanced accuracy is around 84.31%.
From Logistic Regression model, we found that all the predictors are
significant at alpha value 0.01, 0.05 and 0.1
About 41.85% of variance in Dependent variable is explained by the
predictor variables.
If we look into the value of AIC, it is 20798 which is less. And we know
lesser the value of AIC, better is the model.
Logistic AUC is 90.69% which shows the model is very accurate. If we look
into the graph, the curve is above the base model. This shows that the model
is highly accurate and significant.
By logistic Regression we figured out the following 12 important variables -
Age, workclass, fnlwgt, education, marital.status, occupation, relationship,
race, gender, capital.gain, capital.loss , hours.per.week, native.country
Final Code-
setwd (“C:/Users/Desktop/dataset”)
library(caret)
library(rpart)
library(rpart.plot)
library(MASS)
income = read.csv("income.csv")
str(income)
summary(income)
#preprocess begin
table(income$workclass)
income$workclass[income$workclass == "Without-pay" |
income$workclass == "Never-worked"] <- "Unemployed"
income$workclass[income$workclass == "State-gov" |
income$workclass == "Local-gov"] <- "SL-gov"
income$workclass[income$workclass == "Self-emp-inc" |
income$workclass == "Self-emp-not-inc"] <- "Self-employed"
table(income$workclass)
table(income$marital.status)
income$marital.status[income$marital.status == "Married-AF-spouse" |
income$marital.status == "Married-civ-spouse" |
income$marital.status == "Married-spouse-absent"] <- "Married"
income$marital.status[income$marital.status == "Divorced" |
income$marital.status == "Separated" |
income$marital.status == "Widowed"] <- "Not-Married"
table(income$marital.status)
table(income$native.country)
income.data = na.omit(income.data)
str(income.data)
####End Preprocess
##################################################
############################################################
####Partition Data 70% training
set.seed(77850)
#train.index=sample(c(1:dim(bank.df)[1]),dim(bank.df)[1]*0.6)
train.index<- sample(c(1:dim(income.data)[1]),dim(income.data)[1]*0.7)
train.df = income.data[train.index,]
test.df = income.data[-train.index,]
dim.data.frame(train.index)
dim.data.frame(test.df)
library(rpart)
library(rpart.plot)
set.seed(77850)
#classification tree
default.ct = rpart(income~., data = train.df, method = "class")
#plot tree
prp(default.ct, type = 5, cex=0.5)
library(caret)
printcp(cv.ct)
min.cp = cv.ct$cptable[which.min(cv.ct$cptable[,"xerror"]),"CP"]
varImp(cv.ct)
varImp(deeper.ct)
m1
#Logistic Regression
training_logit = glm(income ~ age + workclass + fnlwgt + education + marital.status +
occupation + relationship + race + gender + capital.gain +
capital.loss + hours.per.week + native.country, data = train.df, family =
binomial(link="logit"))
with(training_logit,1-(deviance/null.deviance))
summary(training_logit)
m1 = stepAIC(training_logit)
m1
test.df$predicted_val<-predict(training_logit, newdata=test.df,type="response")
test.df$score[test.df$predicted_val>=0.5]<-"1"
test.df$score[test.df$predicted_val<0.5]<-"0"
test.df$score<-as.factor(test.df$score)
confusionMatrix(as.factor(test.df$score),as.factor(test.df$income))
varImp(training_logit)
library(ROCR)
pred<-prediction(test.df$predicted_val,test.df$income)
perf <- performance(pred,"tpr","fpr")
plot(perf)
abline(a=0,b=1)
#Create AUC data
aucval<-performance(pred,"auc")
#Calcualte AUC
logistic_auc<-as.numeric(aucval@y.values)
#Display the auc value
logistic_auc