You are on page 1of 3

Data Preparation

The data set was checked for missing data and duplicate cells – none was found.

sum(duplicated(banks))
[1] 0
> sum(!complete.cases(banks))
[1] 0

Though there are no missing values in the dataset, there are many unknows in it. So, all the ‘unknown’
data are converted to NA.

From the above result, it can be observed that the contact and poutcome column contain nearly 96% of
NA value. Cleaning NA value from these 2 columns will lead to significant data loss form the data set. So,
these two columns can be dropped from the analysis.

banks1<-subset (banks, select = -c(contact,poutcome))

After deleting these 2 columns, delete all the rows that has missing values and recheck the cleaned
dataset.

The new file was saved in csv format and then the “banks.clean” data set was renamed again to ‘banks’
for convenience.

DATA STRUCTURE OF CLEANED DATA SET

The cleaned data set (banks) has 43193 observations with 15 variables – the variables “contact” and
“poutcome” being dropped.
The response variable “y” which is a categorical variable with two levels “Yes” and “No” need to be
converted into binary series with “yes = 1” and “no = 0” for further analysis.

Splitting the Data Set

Before proceeding to analysis, the data set was split into 2 part [training data and testing data]. Training
data will be used to build the model and the model will be tested on the testing data for verification.
Testing for Imbalance

When tested for imbalance, it was found that the data set is skewed towards ‘no’ for almost 88% of
data.

To balance the data set for better accuracy in prediction, we need to analyse equal proportion of ‘yes’
and ‘no’ in our model. So, to achieve this, we need to under sample/oversample the data set.

Under-sampling has been done in the present case because there are sufficient data for analysis and it
will be easier to analyze less no of data. (total rows with over-sampling = 61096; total rows with under-
sampling = 8014).

You might also like