Observation: As We Can See We Have Threwe Types of Datatypes I.E. (Int, Float, Object) That Means We Have Both Categorical and Numerical Data

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
import warnings
warnings.filterwarnings('ignore')
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split #(Used for splitting the data)
from sklearn.model_selection import StratifiedKFold # (Used for cross-validation)

from sklearn.metrics import accuracy_score #(Used for accuracy)
from tqdm.notebook import tqdm # (Used for showing progress Bar)
# Models
#import optuna
import xgboost as xgb
In [2]:
data = pd.read_csv(r"C:\Users\Pratik Rathod\Downloads\Fraud.csv")
data.shape
Out[2]: (6362620, 11)
In [3]:
data.head()
Out[3]: step type amount nameOrig oldbalanceOrg newbalanceOrig nameDest oldbalanceDest newbalanceDest isFraud isFlagged
0 1 PAYMENT 9839.64 C1231006815 170136.0 160296.36 M1979787155 0.0 0.0 0
1 1 PAYMENT 1864.28 C1666544295 21249.0 19384.72 M2044282225 0.0 0.0 0
2 1 TRANSFER 181.00 C1305486145 181.0 0.00 C553264065 0.0 0.0 1
3 1 CASH_OUT 181.00 C840083671 181.0 0.00 C38997010 21182.0 0.0 1
4 1 PAYMENT 11668.14 C2048537720 41554.0 29885.86 M1230701703 0.0 0.0 0
In [4]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6362620 entries, 0 to 6362619
Data columns (total 11 columns):
# Column Dtype
--- ------ -----
0 step int64
1 type object
2 amount float64
3 nameOrig object
4 oldbalanceOrg float64
5 newbalanceOrig float64
6 nameDest object
7 oldbalanceDest float64
8 newbalanceDest float64
9 isFraud int64
10 isFlaggedFraud int64
dtypes: float64(5), int64(3), object(3)
memory usage: 534.0+ MB
Observation : As we can see we have threwe types of datatypes i.e. (int,float,object) that
means we have both categorical and numerical data
In [5]:
data.describe()
Out[5]: step amount oldbalanceOrg newbalanceOrig oldbalanceDest newbalanceDest isFraud isFlaggedFraud
count 6.362620e+06 6.362620e+06 6.362620e+06 6.362620e+06 6.362620e+06 6.362620e+06 6.362620e+06 6.362620e+06
mean 2.433972e+02 1.798619e+05 8.338831e+05 8.551137e+05 1.100702e+06 1.224996e+06 1.290820e-03 2.514687e-06
std 1.423320e+02 6.038582e+05 2.888243e+06 2.924049e+06 3.399180e+06 3.674129e+06 3.590480e-02 1.585775e-03
min 1.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
25% 1.560000e+02 1.338957e+04 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
50% 2.390000e+02 7.487194e+04 1.420800e+04 0.000000e+00 1.327057e+05 2.146614e+05 0.000000e+00 0.000000e+00
75% 3.350000e+02 2.087215e+05 1.073152e+05 1.442584e+05 9.430367e+05 1.111909e+06 0.000000e+00 0.000000e+00
max 7.430000e+02 9.244552e+07 5.958504e+07 4.958504e+07 3.560159e+08 3.561793e+08 1.000000e+00 1.000000e+00
In [6]:
sns.countplot(x = data['isFraud'])
Out[6]: <AxesSubplot:xlabel='isFraud', ylabel='count'>
In [7]:
plt.figure(figsize=(10,6))
sns.heatmap(data.corr(), cmap='PiYG', linewidth=0.2, annot=True)
Out[7]: <AxesSubplot:>
In [8]:
data['isFraud'].value_counts()
Out[8]: 0 6354407
1 8213
Name: isFraud, dtype: int64
So we have 8213 Fraud cases from whole data i.e. we have highly imbalanced data
In [9]:
Fraud_done = data[data['isFraud'] == 1]
In [35]:
fig,ax=plt.subplots(3,2,figsize=(15,30))
sns.boxplot(x=data['isFraud'],y=data['amount'],ax=ax[0][0])
sns.boxplot(x=data['isFraud'],y=data['oldbalanceOrg'],ax=ax[0][1])
sns.boxplot(x=data['isFraud'],y=data['newbalanceOrig'],ax=ax[1][0])
sns.boxplot(x=data['isFraud'],y=data['oldbalanceDest'],ax=ax[1][1])
sns.boxplot(x=data['isFraud'],y=data['newbalanceDest'],ax=ax[2][0])
sns.boxplot(x=data['isFraud'],y=data['isFlaggedFraud'],ax=ax[2][1])
Out[35]: <AxesSubplot:xlabel='isFraud', ylabel='isFlaggedFraud'>
In [33]:
fig,ax=plt.subplots(3,2,figsize=(15,30))
sns.violinplot(x=data['isFraud'],y=data['amount'],ax=ax[0][0])
sns.violinplot(x=data['isFraud'],y=data['oldbalanceOrg'],ax=ax[0][1])
sns.violinplot(x=data['isFraud'],y=data['newbalanceOrig'],ax=ax[1][0])
sns.violinplot(x=data['isFraud'],y=data['oldbalanceDest'],ax=ax[1][1])
sns.violinplot(x=data['isFraud'],y=data['newbalanceDest'],ax=ax[2][0])
sns.violinplot(x=data['isFraud'],y=data['isFlaggedFraud'],ax=ax[2][1])
Out[33]: <AxesSubplot:xlabel='isFraud', ylabel='isFlaggedFraud'>
In [11]:
##lets check number of missing values in dataset
data.isnull().sum()
Out[11]: step 0
type 0
amount 0
nameOrig 0
oldbalanceOrg 0
newbalanceOrig 0
nameDest 0
oldbalanceDest 0
newbalanceDest 0
isFraud 0
isFlaggedFraud 0
dtype: int64
In [12]:
## lets find categorical data
categorical = []
for i in data.drop('isFraud', axis=1).columns :
if data[i].dtypes == 'O':
categorical.append(i)
print(categorical)
['type', 'nameOrig', 'nameDest']
In [13]:
##lets find numerical data in one line
numerical = [i for i in data.drop('isFraud', axis=1).columns if data[i].dtypes != 'O']
numerical
Out[13]: ['step',
'amount',
'oldbalanceOrg',
'newbalanceOrig',
'oldbalanceDest',
'newbalanceDest',
'isFlaggedFraud']
In [14]:
##lets check number of unique values in categorical data
print(data['type'].nunique())
print(data['nameOrig'].nunique())
print(data['nameDest'].nunique())
5
6353307
2722362
In [15]:
##lets play with categorical data
sns.countplot(data['type'], palette='hot');
In [16]:
##lets observe univariate distribution of data
sns.distplot(data['step'])
Out[16]: <AxesSubplot:xlabel='step', ylabel='Density'>
In [17]:
sns.distplot(data['amount'])
Out[17]: <AxesSubplot:xlabel='amount', ylabel='Density'>
In [18]:
sns.distplot(Fraud_done['amount']);
From this distplot we can see that , when amount is less then fraud is more
In [19]:
data.dtypes
Out[19]: step int64

type object
amount float64
nameOrig object
oldbalanceOrg float64
newbalanceOrig float64
nameDest object
oldbalanceDest float64
newbalanceDest float64
isFraud int64
isFlaggedFraud int64
dtype: object
In [20]:
##lets convert categorical to numerical
##lets convert categorical data into indicator variable
dt = pd.get_dummies(data['type'], prefix='num', drop_first=True)
dt.head()
Out[20]: num_CASH_OUT num_DEBIT num_PAYMENT num_TRANSFER
0 0 0 1 0
1 0 0 1 0
2 0 0 0 1
3 1 0 0 0
4 0 0 1 0
In [21]:
label = LabelEncoder()
data['nameOrig'] = label.fit_transform(data['nameOrig'])
data['nameDest'] = label.fit_transform(data['nameDest'])
In [22]:
data[['nameOrig', 'nameDest']]
Out[22]: nameOrig nameDest
0 757869 1662094
1 2188998 1733924
2 1002156 439685
3 5828262 391696
4 3445981 828919
... ... ...
6362615 5651847 505863
6362616 1737278 260949
6362617 533958 108224
6362618 2252932 319713
6362619 919229 534595
6362620 rows × 2 columns
In [23]:
new_data = pd.concat([data, dt], axis=1)
new_data = new_data.drop('type', axis=1)
In [24]:
new_data.head(8)
Out[24]: step amount nameOrig oldbalanceOrg newbalanceOrig nameDest oldbalanceDest newbalanceDest isFraud isFlaggedFraud num_CASH
0 1 9839.64 757869 170136.00 160296.36 1662094 0.0 0.0 0 0
1 1 1864.28 2188998 21249.00 19384.72 1733924 0.0 0.0 0 0
2 1 181.00 1002156 181.00 0.00 439685 0.0 0.0 1 0
3 1 181.00 5828262 181.00 0.00 391696 21182.0 0.0 1 0
4 1 11668.14 3445981 41554.00 29885.86 828919 0.0 0.0 0 0
5 1 7817.71 6026525 53860.00 46042.29 2247218 0.0 0.0 0 0
6 1 7107.77 1805947 183195.00 176087.23 2063363 0.0 0.0 0 0
7 1 7861.64 2999171 176087.23 168225.59 2314008 0.0 0.0 0 0
In [25]:
new_data.dtypes
Out[25]: step int64

amount float64
nameOrig int32
oldbalanceOrg float64
newbalanceOrig float64
nameDest int32
oldbalanceDest float64
newbalanceDest float64
isFraud int64
isFlaggedFraud int64
num_CASH_OUT uint8
num_DEBIT uint8
num_PAYMENT uint8
num_TRANSFER uint8
dtype: object
As we can see that new_data dont have categorical data now lets start our model building
In [26]:
X = new_data.drop('isFraud', axis=1)
y = new_data['isFraud']
In [27]:
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=0)
In [36]:
## lets use xgboost model
xgb_model = xgb.XGBClassifier()
xgb_model.fit(X_tr, y_tr)
print(f'Train_Accuracy :- {xgb_model.score(X_tr, y_tr)*100}%')
Train_Accuracy :- 99.98917505681621%
In [38]:
xgb_model.feature_importances_
Out[38]: array([0.02405486, 0.05231239, 0.00364918, 0.11402173, 0.19967377,

0.00515594, 0.06449613, 0.19102415, 0. , 0.27907968,
0. , 0.02434049, 0.04219166], dtype=float32)
In [39]:
feature_important = xgb_model.get_booster().get_score(importance_type='weight')
keys = list(feature_important.keys())
values = list(feature_important.values())
data = pd.DataFrame(data=values, index=keys, columns=["score"]).sort_values(by = "score", ascending=False)

data.nlargest(40, columns="score").plot(kind='barh', figsize = (20,10))
Out[39]: <AxesSubplot:>
From this plot we can say that "oldbalanceOrg" is most important feature and then "amount" and "step" feature
are important
In [30]:
pred = xgb_model.predict(X_te)
In [31]:
from sklearn.metrics import accuracy_score, confusion_matrix
print(f'Test Accuracy:- {accuracy_score(y_te, pred)*100}%')
Test Accuracy:- 99.98090409296799%
In [32]:
sns.heatmap(confusion_matrix(y_te, pred), linewidths=0.5, annot=True);
As we can see that we got accuracy of 99.981% so we can say that our model is performing pretty good.
So basically my model follows following pattern: After cleaning data we have perform feature engineering on data and convert all
categorical features to numerical features so that we can build and train our model . After converting features in numerical form we have to
split data into train and test set. Train set used for training our model and test set is used for how our trained model is performing on
unseen data.
model elaboration: When we give request for a transaction to the model, it checks for the information like the payment,
cash_out,cash_in,debt,transfer. All this dataset is fed as an input to our fraud detection algorithm. Then this fraud detection algorithm
selects variables from the given dataset that help in splitting up of the dataset. After checking all the conditions, our model will give the
result for a transaction to be ‘fraud’ and ‘non-fraud.’ Based on the combined result, the model will mark the transaction as "fraud" or "real".
variables to be included in model: We have to check feature importance and from this we can decide which feature is more important.
key factors that predict fraudulent customer: Basically customer preferred method of payment, their id, their previous order,etc. based
on these key factors we can prevent fraud
Fraud Prevention: Fraud prevention and detection is a continuous, ongoing process and the key to prevention is to detect it right at the
stage of origination on a real time basis. However, it is easier said than done. Machine learning (ML) and Artificial Intelligence (AL)
algorithms offer an effective counter for fraud detection and prevention. Based on the learning from the historical patterns in data, current
sets of transactions can be analysed before lending companies decide to proceed with a particular application. Fraudsters also come up
with newer ways to bypass the checks in place. Hence, for any company, making the algorithms better by training them on newer methods
is important to stay ahead in the game. The use of reinforcement learning through machine learning algorithms can continuously take
feedback from humans and learn to become increasingly accurate with time. But, it can be an expensive affair for small and medium size
companies.
how to check model is working or not: After the training, to check that the model is working correctly, we show the model some data
which it has never seen before, but which we know the fraud outcomes for. If the model detects the fraud correctly, we can deploy it to be
used against the online business’s transactions. We also do some automatic common-sense analysis on recent data for which we do not
have fraud labels to ensure the model will behave correctly when it is deployed.
There are certain fraudy situations which the model should always pick up on - some examples are: A customer placing lots of orders of
high value good eg. luxury alcohol High velocity of new payment methods eg. a customer adds new 10 payment cards in an hour.
Suspicious email address eg. a mismatch between the account name or name on the card.
In [ ]:

Observation: As We Can See We Have Threwe Types of Datatypes I.E. (Int, Float, Object) That Means We Have Both Categorical and Numerical Data

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Observation: As We Can See We Have Threwe Types of Datatypes I.E. (Int, Float, Object) That Means We Have Both Categorical and Numerical Data

Uploaded by

Copyright:

Available Formats

In [1]:

from sklearn.model_selection import train_test_split #(Used for splitting the data)

from sklearn.model_selection import StratifiedKFold # (Used for cross-validation)

from tqdm.notebook import tqdm # (Used for showing progress Bar)

Out[2]: (6362620, 11)

0 1 PAYMENT 9839.64 C1231006815 170136.0 160296.36 M1979787155 0.0 0.0 0

1 1 PAYMENT 1864.28 C1666544295 21249.0 19384.72 M2044282225 0.0 0.0 0

2 1 TRANSFER 181.00 C1305486145 181.0 0.00 C553264065 0.0 0.0 1

3 1 CASH_OUT 181.00 C840083671 181.0 0.00 C38997010 21182.0 0.0 1

4 1 PAYMENT 11668.14 C2048537720 41554.0 29885.86 M1230701703 0.0 0.0 0

Out[5]: step amount oldbalanceOrg newbalanceOrig oldbalanceDest newbalanceDest isFraud isFlaggedFraud

count 6.362620e+06 6.362620e+06 6.362620e+06 6.362620e+06 6.362620e+06 6.362620e+06 6.362620e+06 6.362620e+06

mean 2.433972e+02 1.798619e+05 8.338831e+05 8.551137e+05 1.100702e+06 1.224996e+06 1.290820e-03 2.514687e-06

std 1.423320e+02 6.038582e+05 2.888243e+06 2.924049e+06 3.399180e+06 3.674129e+06 3.590480e-02 1.585775e-03

min 1.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00

25% 1.560000e+02 1.338957e+04 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00

50% 2.390000e+02 7.487194e+04 1.420800e+04 0.000000e+00 1.327057e+05 2.146614e+05 0.000000e+00 0.000000e+00

75% 3.350000e+02 2.087215e+05 1.073152e+05 1.442584e+05 9.430367e+05 1.111909e+06 0.000000e+00 0.000000e+00

max 7.430000e+02 9.244552e+07 5.958504e+07 4.958504e+07 3.560159e+08 3.561793e+08 1.000000e+00 1.000000e+00

Out[6]: <AxesSubplot:xlabel='isFraud', ylabel='count'>

Out[35]: <AxesSubplot:xlabel='isFraud', ylabel='isFlaggedFraud'>

Out[33]: <AxesSubplot:xlabel='isFraud', ylabel='isFlaggedFraud'>

['type', 'nameOrig', 'nameDest']

Out[16]: <AxesSubplot:xlabel='step', ylabel='Density'>

Out[17]: <AxesSubplot:xlabel='amount', ylabel='Density'>

Out[19]: step int64

Out[20]: num_CASH_OUT num_DEBIT num_PAYMENT num_TRANSFER

Out[22]: nameOrig nameDest

... ... ...

6362615 5651847 505863

6362616 1737278 260949

6362617 533958 108224

6362618 2252932 319713

6362619 919229 534595

6362620 rows × 2 columns

0 1 9839.64 757869 170136.00 160296.36 1662094 0.0 0.0 0 0

1 1 1864.28 2188998 21249.00 19384.72 1733924 0.0 0.0 0 0

2 1 181.00 1002156 181.00 0.00 439685 0.0 0.0 1 0

3 1 181.00 5828262 181.00 0.00 391696 21182.0 0.0 1 0

4 1 11668.14 3445981 41554.00 29885.86 828919 0.0 0.0 0 0

5 1 7817.71 6026525 53860.00 46042.29 2247218 0.0 0.0 0 0

6 1 7107.77 1805947 183195.00 176087.23 2063363 0.0 0.0 0 0

7 1 7861.64 2999171 176087.23 168225.59 2314008 0.0 0.0 0 0

Out[25]: step int64

print(f'Train_Accuracy :- {xgb_model.score(X_tr, y_tr)*100}%')

Out[38]: array([0.02405486, 0.05231239, 0.00364918, 0.11402173, 0.19967377,

data = pd.DataFrame(data=values, index=keys, columns=["score"]).sort_values(by = "score", ascending=False)

Test Accuracy:- 99.98090409296799%

You might also like