You are on page 1of 9

Adult Income Prediction using

Machine Learning Algorithms

Submitted by:
Sanchit Kaushal (2K19/BMBA/14)
Ritika (2K19/BMBA/13)
Research Questions

 Does education play a major role in salary and what is the minimum level
of education needed to ensure a high salary?
 Will marital status affect the salary of a person?
 Will all other factors being the same will sex of a person determine him/her
getting a higher salary?
 Will the age of a person play a significant role in defining the salary?
 Will the race of a person be a significant factor in defining the salary?
Overview

 The data set we are analysing is census data with a focus on income
of the population.
 Total size of the data: 32561 rows and 15 number of predictors.
 Following is a row of data from the dataset:
39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family,
White, Male, 2174, 0, 40, United-States, <=50K

Goals: Identify how salary is affected by demographics.


Briefing with Dataset
The dataset got 14 number of variables and the predictor column income. Given as follows:
Import Libraries and Load Data

We will first load the Python libraries that we are going to use, as well as the adult
data. The last column will be our target variable, ‘income’, and the rest will be the
features.
Data Analysis

An initial exploration of the dataset like finding the number of records,


the number of individuals making more or less than 50k etc., will show
us how many individuals fit in each group.
Data Pre-processing
Data must be preprocessed in order to be used in Machine Learning algorithms. This preprocessing phase includes the
cleaning and preparing the data.

 Missing Values:

 Removed missing values which were denoted by “?”.


 na.omit() was used to remove those rows.
Data Pre-processing

 Data Modification

 Removed less significant columns (“fnlwgt & “education_num”).

 Data Binning – Grouping multiple categories into lesser number of bins.


Normalization

 It is recommended to perform some type of scaling on numerical features. It is


used to change the values of numeric columns in the dataset to a common scale,
without distorting differences in the ranges of values.

You might also like