You are on page 1of 12

1

1 Data Manipulation

The data which contains 1000 rows and 21 columns has been used in this re-
port. This data has been generated from 1000 bank customers’ information and
target output for this data is to define customer’s being good or bad. The data
description is in following figure (see Fig. 1).

Fig. 1. Original data set.

As seen from data it needs to be manipulated to implement algorithms. First


step done here was to call columns with clear name which explains feature.
Second step was to change data completely to the numeric format. These steps
was done by the help of data description and as a result the following data (see
Fig. 2) was obtained.

Fig. 2. Manipulated data set


2

2 Data Cleaning
In this section, the data has been checked whether it has a incomplete, irrelevant
value or not and following picture (see Fig. 3) shows that the data is clean.

Fig. 3. Data cleaning checking

3 Exploratory Data Analysis


In this section, the distribution and correlation of data has been explored.The
data distribution was conducted for each column and column related to another
column (see Fig. 4). Generally, it can be conclude that the distribution for the
age and credit amount feature is skewed-right, for the current job column is
normal,for the trusty column which shows number of people the customer being
liable to provide maintenance. Furthermore, the correlation results show that
there is high correlation compared to others between credit amount and dura-
tion of the credit (positively 0.62) (see Fig. 5). Additionally, the data can be
considered as an unbalanced date as seen from following picture (see Fig. 6).

4 Data Pre-Processing
In this section, the standardization was applied to the data with a purpose of
converting the structure of disparate data set into a Common Data Format.
3

Fig. 4. Distribution

Fig. 5. Correlation
4

Fig. 6. Data exploratory for checking unbalance and balance

Therefore, as a tool sklearn.preprocessing.StandardScaler has been used. Also,


the data set was separated as a training and test data.Here, 30 percentage of
data for test, remaining for the training data was used.

5 Model Building

Here, to determine bad and good customer firstly, Logistic Regression has been
applied to the data. The results obtaining from training data is in following
picture (see Fig. 7).
If we observe result,it can be told that the accuracy is acceptable and there is
no under-fitting.After, The next step was applying training model on test data
and following results were obtained (see Fig. 8).
These results show that there is no over-fitting. The second algorithm applied
to the data is Random Forest. Also, this algorithm was implemented on the same
training and test data.The training results are in following picture (see Fig. 9)
and it can be seen that the result is extraordinarily perfect.
The test data results are in following picture (see Fig. 10) and we can see
that testing accuracy is almost the same with logistic regression accuracy. Also,
it is obvious that there is over -fitting in random forest model.

5.1 Cross Validation

Cross Validation method was applied to get more accurate result. Because with
this method you can separate data to different folds and expose each fold to train
and test concepts and get average results of folds We applied cross validation
only for the logistic regression and as a result 0.75 accuracy was obtained.
5

Fig. 7. Logistic Regression train results.

Fig. 8. Logistic Regression test results.


6

Fig. 9. Random Forest train result.

Fig. 10. Random Forest test result.


7

5.2 Recursive Feature Elimination


İn this subsection, it has been intended to select best features for applying Lo-
gistic Regression and Random Forest algorithms. For that reason, Recursive
Feature Elimination algorithm has been used and following result was obtained
(see Fig. 11). The result shows that 13 features have been selected and others
have been eliminated as a week features for the Logistic Regression and for the
Random forest best features number is 13 (see Fig. 12).

Fig. 11. Recursive Feature Elimination on Logistic Regression

5.3 Clustering
For the clustering problem, K-means algorithm has been applied. As a first step,
one of the dimension reduction techniques Principal Component Analysis (PCA)
has been implemented and two principal components selected (see Fig. 13). Af-
ter implementing PCA, clustering was applied with 1 to 10 clusters and these
clusters have been determined by elbow method in order to find which cluster
is better (see Fig. 18).
From the elbow result,we can see that the best cluster number is 3. For the
demonstration, we have been used Silhoutte Coefficient which is the metric used
to calculate the goodness of a clustering technique. The clustering results are in
following figures.
8

Fig. 12. Recursive Feature Elimination on Random Forest.


9

Fig. 13. PCA result.


10

Fig. 14. Elbow result.

Fig. 15. Cluster 2


11

Fig. 16. Cluster 3

Fig. 17. Cluster 4


12

Fig. 18. Cluster 5

You might also like