You are on page 1of 11

GROUP ASSIGNMENT: MACHINE LEARNING

TOPIC: Predicting of Census Data using Machine Learning Techniques

Group Members:
Simran Saha
Srinidhi Narsimhan
Kalai Anbumani
Indushree AnandRaj
Problem Statement:
Census data is provided. Assignment task is to evaluate various methods of machine learning to
find out which works best for the given data and predict whether the person earns more than 50K
or not. Following points to be considered.
1. How to tackle missing values? Should these be tackled at all?
2. Is there any need to normalize data?
3. What can be said about class imbalance?
4. Which machine learning algorithm works the best without tuning?
5. How to determine the tuning parameters?

Solution:

Explanatory Data Analysis (EDA)


 Dependent Variable of this dataset is “Income”. It is a classification problem since there
are two classes “<=50” and “>50”.
 Dataset Split

 Split Dataset into Train (70%) and Test (30%) data.

Creating model using Azure ML


PART 1
1. Clean Missing Data

 We could see below Categorical variables have few missing values in the data set.
o Workclass
o Occupation
o Native.country
 Since the dataset is small, we cannot omit the missing data. Therefore, better option is
to replace the missing values with mode. The reason for replacing with mode is
because all the respective variables are categorical in nature.
2. Visualizing imbalance inTrain dataset
 Below percentage of imbalance in train dataset has been visualized
class1 “<=50” has 76% of data
class2 “>50%” has arount 24% of data in the train set
 We see a risk of small amount of Imbalance in data. However we can proceed
with our validation and later perform SMOTE and validate whether the results has
any difference or not.

3. Clip Values
 Performed 99% upper threshold clipping of values for the below Variable to
handle the outliers. We could see an outlier of sudden jump from 22000 to 99999
which is weird. Therefor in such case to avoid incorrect result in the model, we do
clipping of values.
o Capital.gain
4. Feature Engineering
 Instead of having two columns of capital gain and loss, we can take a difference
between them to get the net capital gain/loss which helps to reduce the number of
features in the dataset.

5. Normalize Data
 We need to scale all the numeric data in the dataset to get more reliable model.
Hence Scaling is performed on the newly added column of net capital gain/loss.
PART 2
1. The Normalized dataset of part1 is used as input in part
2. While Selecting the Columns from Dataset, we can eliminate “education.num” variable
as it has a strong correlation with the “education” variable. Also, elimination of this
categorical variable may not have much impact on the result as it is just a mapping
variable of “education”.

3. A Plain Simple Model is created using the Normalized dataset.


Comparison between different Model Performances

a) Two-Class Support Vector Model

Train Dataset
 SVM model give below ROC curve with an accuracy rate of 83.8%
Test Dataset
 The same model performance with test dataset gives ROC curve and Accuracy is
82.8%.
b) Two-Class Linear Regression Model

Train Dataset

 This model gives a better ROC curve with an accuracy rate of 84.7%

Test Dataset

 The same model performance with Test dataset gives an accuracy rate of 84.1%.
4. Tune the Model

 We need to tune the model to bring down the Sum of Squares error (SSE)
<Yet to do>

You might also like