Professional Documents
Culture Documents
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
import warnings
warnings.filterwarnings('ignore')
from sklearn.preprocessing import LabelEncoder
# Models
#import optuna
import xgboost as xgb
In [2]:
data = pd.read_csv(r"C:\Users\Pratik Rathod\Downloads\Fraud.csv")
data.shape
In [3]:
data.head()
Out[3]: step type amount nameOrig oldbalanceOrg newbalanceOrig nameDest oldbalanceDest newbalanceDest isFraud isFlagged
In [4]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6362620 entries, 0 to 6362619
Data columns (total 11 columns):
# Column Dtype
--- ------ -----
0 step int64
1 type object
2 amount float64
3 nameOrig object
4 oldbalanceOrg float64
5 newbalanceOrig float64
6 nameDest object
7 oldbalanceDest float64
8 newbalanceDest float64
9 isFraud int64
10 isFlaggedFraud int64
dtypes: float64(5), int64(3), object(3)
memory usage: 534.0+ MB
Observation : As we can see we have threwe types of datatypes i.e. (int,float,object) that
means we have both categorical and numerical data
In [5]:
data.describe()
In [6]:
sns.countplot(x = data['isFraud'])
In [7]:
plt.figure(figsize=(10,6))
sns.heatmap(data.corr(), cmap='PiYG', linewidth=0.2, annot=True)
Out[7]: <AxesSubplot:>
In [8]:
data['isFraud'].value_counts()
Out[8]: 0 6354407
1 8213
Name: isFraud, dtype: int64
So we have 8213 Fraud cases from whole data i.e. we have highly imbalanced data
In [9]:
Fraud_done = data[data['isFraud'] == 1]
In [35]:
fig,ax=plt.subplots(3,2,figsize=(15,30))
sns.boxplot(x=data['isFraud'],y=data['amount'],ax=ax[0][0])
sns.boxplot(x=data['isFraud'],y=data['oldbalanceOrg'],ax=ax[0][1])
sns.boxplot(x=data['isFraud'],y=data['newbalanceOrig'],ax=ax[1][0])
sns.boxplot(x=data['isFraud'],y=data['oldbalanceDest'],ax=ax[1][1])
sns.boxplot(x=data['isFraud'],y=data['newbalanceDest'],ax=ax[2][0])
sns.boxplot(x=data['isFraud'],y=data['isFlaggedFraud'],ax=ax[2][1])
In [33]:
fig,ax=plt.subplots(3,2,figsize=(15,30))
sns.violinplot(x=data['isFraud'],y=data['amount'],ax=ax[0][0])
sns.violinplot(x=data['isFraud'],y=data['oldbalanceOrg'],ax=ax[0][1])
sns.violinplot(x=data['isFraud'],y=data['newbalanceOrig'],ax=ax[1][0])
sns.violinplot(x=data['isFraud'],y=data['oldbalanceDest'],ax=ax[1][1])
sns.violinplot(x=data['isFraud'],y=data['newbalanceDest'],ax=ax[2][0])
sns.violinplot(x=data['isFraud'],y=data['isFlaggedFraud'],ax=ax[2][1])
In [11]:
##lets check number of missing values in dataset
data.isnull().sum()
Out[11]: step 0
type 0
amount 0
nameOrig 0
oldbalanceOrg 0
newbalanceOrig 0
nameDest 0
oldbalanceDest 0
newbalanceDest 0
isFraud 0
isFlaggedFraud 0
dtype: int64
In [12]:
## lets find categorical data
categorical = []
for i in data.drop('isFraud', axis=1).columns :
if data[i].dtypes == 'O':
categorical.append(i)
print(categorical)
In [13]:
##lets find numerical data in one line
numerical = [i for i in data.drop('isFraud', axis=1).columns if data[i].dtypes != 'O']
numerical
Out[13]: ['step',
'amount',
'oldbalanceOrg',
'newbalanceOrig',
'oldbalanceDest',
'newbalanceDest',
'isFlaggedFraud']
In [14]:
##lets check number of unique values in categorical data
print(data['type'].nunique())
print(data['nameOrig'].nunique())
print(data['nameDest'].nunique())
5
6353307
2722362
In [15]:
##lets play with categorical data
sns.countplot(data['type'], palette='hot');
In [16]:
##lets observe univariate distribution of data
sns.distplot(data['step'])
In [17]:
sns.distplot(data['amount'])
In [18]:
sns.distplot(Fraud_done['amount']);
From this distplot we can see that , when amount is less then fraud is more
In [19]:
data.dtypes
In [20]:
##lets convert categorical to numerical
##lets convert categorical data into indicator variable
dt = pd.get_dummies(data['type'], prefix='num', drop_first=True)
dt.head()
0 0 0 1 0
1 0 0 1 0
2 0 0 0 1
3 1 0 0 0
4 0 0 1 0
In [21]:
label = LabelEncoder()
data['nameOrig'] = label.fit_transform(data['nameOrig'])
data['nameDest'] = label.fit_transform(data['nameDest'])
In [22]:
data[['nameOrig', 'nameDest']]
0 757869 1662094
1 2188998 1733924
2 1002156 439685
3 5828262 391696
4 3445981 828919
In [23]:
new_data = pd.concat([data, dt], axis=1)
new_data = new_data.drop('type', axis=1)
In [24]:
new_data.head(8)
Out[24]: step amount nameOrig oldbalanceOrg newbalanceOrig nameDest oldbalanceDest newbalanceDest isFraud isFlaggedFraud num_CASH
In [25]:
new_data.dtypes
As we can see that new_data dont have categorical data now lets start our model building
In [26]:
X = new_data.drop('isFraud', axis=1)
y = new_data['isFraud']
In [27]:
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=0)
In [36]:
## lets use xgboost model
xgb_model = xgb.XGBClassifier()
xgb_model.fit(X_tr, y_tr)
Train_Accuracy :- 99.98917505681621%
In [38]:
xgb_model.feature_importances_
In [39]:
feature_important = xgb_model.get_booster().get_score(importance_type='weight')
keys = list(feature_important.keys())
values = list(feature_important.values())
Out[39]: <AxesSubplot:>
From this plot we can say that "oldbalanceOrg" is most important feature and then "amount" and "step" feature
are important
In [30]:
pred = xgb_model.predict(X_te)
In [31]:
from sklearn.metrics import accuracy_score, confusion_matrix
print(f'Test Accuracy:- {accuracy_score(y_te, pred)*100}%')
In [32]:
sns.heatmap(confusion_matrix(y_te, pred), linewidths=0.5, annot=True);
As we can see that we got accuracy of 99.981% so we can say that our model is performing pretty good.
So basically my model follows following pattern: After cleaning data we have perform feature engineering on data and convert all
categorical features to numerical features so that we can build and train our model . After converting features in numerical form we have to
split data into train and test set. Train set used for training our model and test set is used for how our trained model is performing on
unseen data.
model elaboration: When we give request for a transaction to the model, it checks for the information like the payment,
cash_out,cash_in,debt,transfer. All this dataset is fed as an input to our fraud detection algorithm. Then this fraud detection algorithm
selects variables from the given dataset that help in splitting up of the dataset. After checking all the conditions, our model will give the
result for a transaction to be ‘fraud’ and ‘non-fraud.’ Based on the combined result, the model will mark the transaction as "fraud" or "real".
variables to be included in model: We have to check feature importance and from this we can decide which feature is more important.
key factors that predict fraudulent customer: Basically customer preferred method of payment, their id, their previous order,etc. based
on these key factors we can prevent fraud
Fraud Prevention: Fraud prevention and detection is a continuous, ongoing process and the key to prevention is to detect it right at the
stage of origination on a real time basis. However, it is easier said than done. Machine learning (ML) and Artificial Intelligence (AL)
algorithms offer an effective counter for fraud detection and prevention. Based on the learning from the historical patterns in data, current
sets of transactions can be analysed before lending companies decide to proceed with a particular application. Fraudsters also come up
with newer ways to bypass the checks in place. Hence, for any company, making the algorithms better by training them on newer methods
is important to stay ahead in the game. The use of reinforcement learning through machine learning algorithms can continuously take
feedback from humans and learn to become increasingly accurate with time. But, it can be an expensive affair for small and medium size
companies.
how to check model is working or not: After the training, to check that the model is working correctly, we show the model some data
which it has never seen before, but which we know the fraud outcomes for. If the model detects the fraud correctly, we can deploy it to be
used against the online business’s transactions. We also do some automatic common-sense analysis on recent data for which we do not
have fraud labels to ensure the model will behave correctly when it is deployed.
There are certain fraudy situations which the model should always pick up on - some examples are: A customer placing lots of orders of
high value good eg. luxury alcohol High velocity of new payment methods eg. a customer adds new 10 payment cards in an hour.
Suspicious email address eg. a mismatch between the account name or name on the card.
In [ ]: