You are on page 1of 3

FINAL DATA SCIENCE PROJECT-MODELING FOR A BANK

The propose of this data science project is to show if the methodology that we apply to identify
which banknote are genuine and which ones were forged works well or not, this is going to be
analyzed by the data that was taken from the bank.

Firstly, we just wanted to know how the feature of this data is (see Fig [1]) to make sure what we
should do knowing this feature.

Fig [2]. Variance vs skewness

After that, knowing how the feature were, we decided to standardize the data to make easier
understand the data and at the same time looking for outliers. We tried to find the outliers using an
ellipse knowing that outliers can be found two standard deviation from the mean as is shown in Fig
[2].

Fig [2]. Variance vs Skewness-Standardized

Unfortunately We realized that it is not an important result because we have too many features
making different trends into the same graph, that’s why I think that applying K-means clustering,
we can take advantage from all that data having some groups into the same data because we have
more than one feature here, and if we have more than one feature, so this method becomes more
trustworthy, and we can study the trend of each group easier.
The red dot shows us the mean, it let us know that we have too many features into the same graph,
and we can not say too much just with that, that’s one more reason to use a K-mean clustering, and
may be after that we can get more clear where the outliers are, this is possible having the respective
centroids depending on how many clusters we want. I decided to get two clusters into this data
because it shows two different features, after that I will be able to check if I need more data for the
method that we are using.

Using K-means clustering We wanted to find the outlier taking into the account that now We have
two centroids more, We try to find the outliers know the location of two centroids and the mean
and from that we can have an idea how many outliers are there, see Fig [3].

Fig [3]. Variance vs skewness Standardized Using K-means clustering

In the Fig 3, We can see 3 ellipses, the oranges ones were made with two clusters and the red one
with the mean, this because We wanted to cover a larger area within the data to find the outliers,
this graph let us think that may be after We finish the code We still having a considerable error, but
it is going to be less if the bank provides more data making stronger at the time to use K-means
clustering.

Knowing that there is going to be some errors in the results, We keep working with the standardized
data reducing the margin of error since there will be a smaller distance between the data, and with
this We make a prediction trying to find the genuine and the forged using K-means clustering and
dividing the groups by the following colors, see Fig [4]:

Fig [4]. Variance vs Skewness Standardized Using K-Means Clustering


After this we classified the genuine with 0 and the forged with 1 and then making a comparation
between the real data that was given by the bank, and analyzing the results we got 12,8% of error
that says that our prediction is 88,2% correct.

Number of success 1197 of 1372


Reliability 88,2%

In conclusion We do not get enough reliability from the prediction it was just 88,2% , this since We
need more data from the bank as more data is given to study, the margin of error will be much
smaller and the prediction will be more accurate, so my advice is to not use this prediction since a
lot of money will be lost due to errors in this, but could be better if the bank still given more
information about the variance and the skewness as others features to make more trustworthy the
prediction.

You might also like