You are on page 1of 33

DATA ANALYTICS AND LEARNING

GROUP PROJECT
CREDIT CARD FRAUD DETECTION – A CASE OF DEALING WITH
IMBALANCED DATA

Submitted by
2016PGP054 ANIMESH KUMAR
2016PGP090 BVS PAVAN KUMAR
2016PGP261 KRUPAL PATEL
2016PGP301 RAJAT MALLICK
2016PGP423 VASANTADA SRIKANTH

1
Contents
CREDIT CARD FRAUD..................................................................................................................................... 4
Credit card Default in India PSU Banks (All figures in Crores of Rs.)......................................................... 6
PROBLEM STATEMENT.................................................................................................................................. 6
Description of the Data: ............................................................................................................................ 7
LITERATURE REVIEW ..................................................................................................................................... 7
Dealing with Imbalanced data: ................................................................................................................. 7
Ignoring the Problem: ............................................................................................................................... 8
Under sampling the Majority Class: .......................................................................................................... 8
Oversampling the minority class............................................................................................................... 8
.................................................................................................................................................................. 9
METHODOLOGY .......................................................................................................................................... 10
Logistic regression................................................................................................................................... 10
Applying Logistic Regression ................................................................................................................... 12
Basic theoretical concepts of over- and under-sampling: ...................................................................... 12
Methodology:.......................................................................................................................................... 13
Parameters for choosing the Model ....................................................................................................... 13
Original Unbalanced Trained Data .......................................................................................................... 13
Regression Summary of original trained data..................................................................................... 13
Train data on original test sample ...................................................................................................... 15
Train data on under sampled test sample .......................................................................................... 16
Train data on oversampled test sample ............................................................................................. 16
Oversampled Trained Data ..................................................................................................................... 17
Regression Summary of oversampled trained data............................................................................ 17
Oversampled Train data on test sample ............................................................................................. 18
Oversampled Train data on Undersampled test sample .................................................................... 19
Oversampled Train data on oversampled test data ........................................................................... 20
Undersampled Trained Data ................................................................................................................... 20
Regression Summary of undersampled trained data ......................................................................... 21
Undersampled Train data on Original Test sample ............................................................................ 22
Undersampled Train data on undersampled test data....................................................................... 22
Undersampled Train data on oversampled test data ......................................................................... 23
Conclusion ............................................................................................................................................... 25

2
Random Forest Algorithm for Prediction ................................................................................................ 25
How does it work? .............................................................................................................................. 26
Advantages of Random Forest ............................................................................................................ 27
Disadvantages of Random Forest ....................................................................................................... 27
R implementation ............................................................................................................................... 28
IMPLEMENTATION OF RANDOM FOREST MACHINE LEARNING TECHNIQUE TO PREDICT FRAUD ............ 28
ANALYSIS 1: WITHOUT USING ANY METHODOLOGY TO ADJUST FOR IMBALANCED DATA................... 28
Decrease in Gini Index when the variable is excluded........................................................................ 28
Confusion Matrix................................................................................................................................. 29
ANALYSIS 2: USING UNDER SAMPLING OF MAJORITY DATA TO TACKLE THE IMBALANCED DATA ....... 30
Mean Decrease in Gini Index upon variable Exclusion ....................................................................... 30
Conclusion: .................................................................................................................................................. 32
BIBLIOGRAPHY ............................................................................................................................................ 32

3
CREDIT CARD FRAUD
Despite the fact that fraud only impacts a fraction of one percent of all purchases made with plastic,
according to data from the Federal Reserve, it represents one of the biggest concerns among
consumers. This can largely be attributed to the catastrophic impact of the worst-case scenarios that run
through people’s minds as well as the notion that regardless of how low the incidence of fraud may be,
no one wants to be the exception to the rule and find their hard-earned money siphoned away by
criminals.

What consumers generally do not know is that they are shielded from liability for unauthorized
transactions made with their credit cards via the combination of federal law issuer/card network
policy. As a result, financial institutions and merchants assume responsibility for most of the money lost
as a result of fraud. For example, card issuers bore a 72% share of fraudulent losses in 2015 and merchants
and ATM acquirers assumed the other 28% of liability, according to the Nilson Report, October 2016.

The following statistics will give you a better sense of the credit card and debit card fraud landscapes as
well as how both have changed over the years.

• Global Credit card and debit card fraud resulted in losses amounting to $21.84 billion during 2015.
Card issuers and merchants incurred 72% and 28% of those losses, respectively, with the following
transaction breakdown:
• Card issuer losses occur mainly at the point of sale from counterfeit cards while merchant losses
occur mainly on card-not- present (CNP) when customers buy online or pick up in a store
• During 2015 credit card and debit card gross fraud losses accounted for roughly 6.9₵ per $100 in
total volume, up from 5.7₵ per $100 in 2014.
• In 2015, US accounted for 38.7% of the worldwide payment card fraud losses but generated only
22.9% of total volume.
• Retailers incur $580.5 million in debit card fraud losses and spend $6.47 billion annually on credit
and debit card fraud prevention annually.
• In 2011, 59% of the more than 37 billion debit card transactions that were made were verified by
signature, 85% of all fraudulent debit card transactions involved signature “verification,” and
$1.15 billion of the total $1.35 billion in debt card fraud losses (85%) stemmed from signature
debit card transactions.
• Fraud against bank deposit accounts cost the industry $1.910 billion in losses in 2014. Debit card
fraud accounted for 66 percent of 2014 losses.
• In addition to the estimated fraud loss amount, banks’ prevention measures stopped another $11
billion in fraudulent transactions.
• In 2015 the U.S. switched to EMV, which resulted in reduced existing card fraud. It also his drove
a 113 percent increase in incidence of new account fraud, which now accounts for 20 percent of
all fraud losses.

4
Identity theft is a form of fraud that often results in unauthorized credit card and debit card transactions.
In 2016, the number of identity fraud victims was at its peak in six years at 15.4 million, and the fraud
losses resulting from identity theft amounted to 16 million a 6.7% increase from 2015.

The majority of identity theft victims (86%) experienced the fraudulent use of existing account
information, such as credit card or bank account information.

(All figures in millions)

5
Credit card Default in India PSU Banks (All figures in Crores of Rs.)
Bank-wise Status of Fraud Cases in ATM/Credit/Debit Cards
and Internet Banking in Nationalized Banks of India
(2008 to 2011)
(Rs. in Lakh)
2008 2009 2010 2011
Banks Cases Amount Cases Amount Cases Amount Cases Amount
Reported Involed Reported Involed Reported Involed Reported Involed
State Bank of India - - - - - - - -
State Bank of Bikaner and
4 12.32 2 6.66 2 0.15 2 3.49
Jaipur
State Bank of Hyderabad - - - - - - 3 53.05
State Bank of Indore 1 0.48 1 0.8 - - - -

State Bank of Mysore - - - - 1 1.01 - -

State Bank of Patiala - - - - - - 4 80.45


State Bank of Travancore 1 0.62 - - - - 6 10.3
Allahabad Bank - - - - - - 1 3.3
Andhra Bank - - - - 1 31.85 - -
Bank of Maharashtra 3 2.66 4 3.55 4 4.69 1 2.8

Bank of Baroda 10 9.28 6 6.88 5 12.4 5 31.82


Bank of India 4 7.93 5 5.21 2 14.61 1 42.23
Canera Bank 3 4.84 6 1.39 - - 1 0.6
Corporation Bank - - 2 0.72 2 6.21 4 3.34
Central Bank of India 1 0.22 2 0.84 2 2.15 - -
Dena Bank 1 3.01 - - 1 2.07 1 0.53
Indian Bank - - - - 1 1.41 - -

Indian Overseas Bank 2 0.39 2 0.39 3 1.44 10 176.03

IDBI Bank 11 37.93 24 16.29 13 15.29 29 36.53


Oriental Bank of Commerce - - - - 1 4.75 - -
Punjab National Bank 1 4.77 33 50.15 108 248.64 22 127.63
Punjab and Sindh Bank - - - - - - - -
Syndicate Bank 3 1.47 2 0.53 1 2.32 1 0.56
Union Bank of India 8 22.3 5 10.45 7 19.22 1 0.38
Uco Bank - - 2 0.58 1 1.6 - -
United Bank - - 1 1.37 - - - -
Vijaya Bank 2 9.02 - - - - - -
Total 55 117.24 97 105.81 155 369.81 92 573.04

PROBLEM STATEMENT

6
Description of the Data:
The datasets contains transactions made by credit cards in September 2013 by European
cardholders. This dataset presents transactions that occurred in two days, where we have 492
frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds)
account for only 0.172% which became the minority class of all transactions and the major class
negative class represents 99.828% of the population.
It contains only numerical input variables which are the result of a Principles of Component
Analysis (PCA) transformation. Unfortunately, due to confidentiality issues, the organization
cannot provide the original features and more background information about the data. Features
V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not
been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed
between each transaction and the first transaction in the dataset. The feature 'Amount' is the
transaction Amount, this feature can be used for example-dependant cost-sensitive learning.
Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.
Class = 1 means Credit Card Fraud Occurs
= 0 means Credit cards not occurs
Now the challenge is to predict the fraud rate in test samples. As this data set has very sparse
number of fraudulent people we have to deal with this using machine learning with some specific
blend.

LITERATURE REVIEW
Dealing with Imbalanced data:
Before going for feature selection we need to cross validate the data. During cross-validation, we are
typically trying to understand how well our model can generalize, and how well it can predict our outcome
of interest on unseen samples. If the data is unbalanced during cross-validation there is high probability
that the selected features are only derived from training data, and not from pooled training and test data.

Pre-process imbalanced data required before we feed them into a classifier because classifiers are more
sensitive to detecting the majority class (not fraud) and less sensitive to the minority class (fraud case).
Thus, if we don't take care of the issue, the classification output will be biased, in many cases resulting in
always predicting the majority class.

Main problem with the imbalanced data

Ignoring the problem.


Under sampling the majority class.
Oversampling the minority class.

7
Ignoring the Problem:
Building a classifier using the data as it is, would in most cases give us a prediction model that always
returns the majority class. The classifier would be biased Sensitivity will be zero and specificity will be 1 or
close to one and therefore no or almost no preterm recordings are correctly identified.

Under sampling the Majority Class:

One of the most common and simplest strategies to handle imbalanced data is to under sample the
majority class. While different techniques have been proposed in the past, typically using more advanced
methods did not bring any improvement with respect to simply selecting samples at random. So, for the
analysis simply select n samples at random from the majority class, where n is the number of samples for
the minority class, and use them during training phase, after excluding the sample to use for validation.
The main difference from the previous case is that now we randomly select at each iteration n samples
from the majority class, and use only those as training data, combined with all samples from the minority
class. By under sampling, we will solve the class imbalance issue, and increased the sensitivity of our
models. However, results will be poor. A reason could indeed be that we trained our classifiers using few
samples. In general, the more imbalanced the dataset the more samples will be discarded when under
sampling, therefore throwing away potentially useful information.

Oversampling the minority class.

Oversampling the minority class can result in over fitting problems if we oversample before cross-
validating. The easiest way to oversample is to re-sample the minority class, i.e. to duplicate the entries,
or manufacture data which is exactly the same as what we have already. Now, if we do so before cross-
validating, i.e. before we enter the leave one participant out cross-validation loop, we will be training the
classifier using N-1 entries, leaving 1 out, but including in the N-1 one or more instances that are exactly
the same as the one being validated. Thus, defeating the purpose of cross-validation altogether. Let's have
a look at this issue graphically:

8
From left to right, we start with the original dataset where we have a minority class with two samples. We
duplicate those samples, and then we do cross-validation. At this point there will be iterations, such as
the one showed, where the training and validation set contain the same sample, resulting in over fitting
and misleading results. Here is how this should be done:

First, we start cross-validating. This means that at each iteration we first exclude the sample to use as
validation set, and then oversample the remaining of the minority class (in orange). In this toy example I
had only two samples, so I created three instances of the same. The difference from before is that clearly
now we are not using the same data for training and validation. Therefore we will obtain more
representative results. The same holds even if we use other cross-validation methods, such as k-fold cross-
validation.

9
METHODOLOGY
Logistic regression
Logistic Regression is a classification algorithm which is used to predict a binary outcome given a set of
independent variables using dummy variables. It is a special case of linear regression when the outcome
variable is categorical, where we are use log of odds as dependent variable. It predicts the probability of
occurrence of an event by fitting data to a logit function.

Regression equation with dependent variable enclosed in a link function:

g(Y) = β0 + β1X

In logistic regression, we are only concerned about the probability of outcome dependent variable (
success or failure). As described above, g() is the link function. This function is established using two things:
Probability of Success(p) and Probability of Failure(1-p). p should meet following criteria:

• It must always be positive (since p >= 0)


• It must always be less than equals to 1 (since p <= 1)

Since probability must always be positive, we’ll put the linear equation in exponential form. For any value
of slope and dependent variable, exponent of this equation will never be negative.

p = exp(β0 + β1X)

To make the probability less than 1, we must divide p by a number greater than p. This can simply be done
by:

p = exp(β0 + β1X)/[exp(β0 + β1X) +1]

We can redefine the probability as:

P = eY/ (1 + eY)

If p is the probability of success, 1-p will be the probability of failure which can be written as:

q = 1 – p = 1 - eY/ (1 + eY)

Therefore,

p/(1 –p) = eY

After taking log on both side, we get,

log(p/1 –p) = Y

After substituting value of y, we’ll get:

log(p/1 –p) = β0 + β1X

Whenever the log of odd ratio is found to be positive, the probability of success is always more than 50%.

10
A typical logistic model plot is shown below, probability never goes below 0 and above 1

Performance of Logistic Regression Model

• AIC (Akaike Information Criteria) – The analogous metric of adjusted R² in logistic regression is
AIC. AIC is the measure of fit which penalizes model for the number of model coefficients.
Therefore, we always prefer model with minimum AIC value.
• Null Deviance and Residual Deviance – Null Deviance indicates the response predicted by a model
with nothing but an intercept. Lower the value, better the model. Residual deviance indicates the
response predicted by a model on adding independent variables. Lower the value, better the
model.

Confusion Matrix: It is nothing but a tabular representation of Actual vs Predicted values. This helps us to
find the accuracy of the model and avoid overfitting. This is how it looks like:

11
You can calculate the accuracy of your model with:

From confusion matrix, Specificity and Sensitivity can be derived as illustrated below:

Specificity and Sensitivity plays a crucial role in deriving ROC curve.

ROC Curve: Receiver Operating Characteristic(ROC) summarizes the model’s performance by evaluating
the trade offs between true positive rate (sensitivity) and false positive rate(1- specificity). For plotting
ROC, it is advisable to assume p > 0.5 since we are more concerned about success rate. ROC summarizes
the predictive power for all possible values of p > 0.5. The area under curve (AUC), referred to as index of
accuracy(A) or concordance index, is a perfect performance metric for ROC curve. Higher the area under
curve, better the prediction power of the model. Below is a sample ROC curve. The ROC of a perfect
predictive model has TP equals 1 and FP equals 0. This curve will touch the top left corner of the graph.

Applying Logistic Regression

As the required output is in only “1” and “0”, we will use logistic regression for predicting outcome.

But, the dataset is biased which will bias the prediction model towards the more common class. A machine
learning model which is based on such a dataset could now predict majority class for all samples and still
gain a very high accuracy.

Basic theoretical concepts of over- and under-sampling:

12
With under-sampling, we randomly select a subset of samples from the class with more instances to match
the number of samples coming from each class.

With oversampling, we randomly duplicate samples from the class with fewer instances or we generate
additional instances based on the data that we have, so as to match the number of samples in each class.
While we avoid loosing information with this approach, we also run the risk of overfitting our model as
we are more likely to get the same samples in the training and in the test data, i.e. the test data is no
longer independent from training data. This would lead to an overestimation of our model’s performance
and generalizability. To avoid this problem we have to perform cross validation.

Methodology:

• We have deleted “Time” Column because it shows timestamps for transactions and therefore is
not relevant.
• We have first divided the data in train and test sample in the ratio of 70 % and 30% in such a way
that both training and test dataset contains “1” & “0” in the sample proportion of actual data.

Parameters for choosing the Model

• TPR = True positive Rate = TP/ (TP+FN)


• FPR =False positive Rate = FP /(FP+TN)
• Precision = TP / (TP+FP)
• Accuracy = (TP+TN) / (TP+TN+FP+FN)

The important measure for model testing is here is True positive rate. Here the main objective of the
credit card company is to identify the potential fraud of the credit card transaction. False positive rate
always be less in almost all the cases. So ROC curve area will be higher in all cases. So AUC will not be a
good measure hence we consider TPR. TRP Should be high for a good classification.

Original Unbalanced Trained Data


Original unbalanced trained data has been used for regression and applied on original, under sampled and
oversampled test data. Threshold probability of 0.5 is used above 0.5 the result is considered as positive
1 and below 0.5 the result is considered as 0 i.e. negative.

Regression Summary of original trained data

13
Call:
glm(formula = Class ~ ., family = binomial(link = "logit"), data = train)

Deviance Residuals:
Min 1Q Median 3Q Max
-4.6361 -0.0285 -0.0188 -0.0120 4.3308

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.7941200 0.1838908 -47.823 < 2e-16 ***
V1 0.1046755 0.0508054 2.060 0.039368 *
V2 -0.0320574 0.0673303 -0.476 0.633988
V3 0.0331080 0.0574357 0.576 0.564321
V4 0.7165440 0.0865341 8.280 < 2e-16 ***
V5 0.0334783 0.0828204 0.404 0.686045
V6 -0.1054274 0.0982654 -1.073 0.283323
V7 -0.0989703 0.0823774 -1.201 0.229586
V8 -0.1585474 0.0408493 -3.881 0.000104 ***
V9 -0.4021638 0.1319277 -3.048 0.002301 **
V10 -0.7857285 0.1142441 -6.878 6.09e-12 ***
V11 -0.0977455 0.0945554 -1.034 0.301259
V12 0.1431801 0.1108893 1.291 0.196635
V13 -0.3588784 0.1036352 -3.463 0.000534 ***
V14 -0.6124387 0.0796923 -7.685 1.53e-14 ***
V15 0.0252410 0.1057628 0.239 0.811372
V16 -0.2131663 0.1473850 -1.446 0.148087
V17 0.0121342 0.0846862 0.143 0.886065
V18 -0.0513308 0.1525150 -0.337 0.736447
V19 0.1583511 0.1153635 1.373 0.169868
V20 -0.3822775 0.1028700 -3.716 0.000202 ***
V21 0.3510415 0.0791662 4.434 9.24e-06 ***
V22 0.5868520 0.1572519 3.732 0.000190 ***
V23 -0.1147954 0.0791407 -1.451 0.146913
V24 0.0390528 0.1837851 0.212 0.831723
V25 0.0701537 0.1615320 0.434 0.664069
V26 -0.0447095 0.2409267 -0.186 0.852780
V27 -0.8228093 0.1547636 -5.317 1.06e-07 ***
V28 -0.2696558 0.1105601 -2.439 0.014728 *
Amount 0.0006209 0.0004607 1.348 0.177760
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 5064.6 on 199363 degrees of freedom


Residual deviance: 1485.5 on 199334 degrees of freedom
AIC: 1545.5

Number of Fisher Scoring iterations: 12

14
Train data on original test sample

PREDICTED 0 PREDICTED 1
TRUE 0 85280 15
TRUE 1 69 79
INDICATORS
TPR 0.5337837838
FPR 0.0001758602
PRECISION 0.8404255319
ACCURACY 0.9990168885
AUC ROC = 0.969005977663198

15
Train data on under sampled test sample

PREDICTED 0 PREDICTED 1
TRUE 0 148 0
TRUE 1 69 79

INDICATORS
TPR 0.5337838
FPR 0.0000000
PRECISION 1.0000000
ACCURACY 0.7668919
AUC ROC = 0.966581446311177

Train data on oversampled test sample

PREDICTED 0 PREDICTED 1
TRUE 0 85280 15
TRUE 1 39764 45531

INDICATORS
TPR 0.5338062020

16
FPR 0.0001758602
PRECISION 0.9996706626
ACCURACY 0.7668151709
AUC ROC = 0.968985938708829

Oversampled Trained Data

Regression Summary of oversampled trained data


Deviance Residuals:
Min 1Q Median 3Q Max
-8.4904 -0.2663 0.0000 0.0000 2.8728

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.3579045 0.0461324 -94.465 < 2e-16 ***
V1 0.8765438 0.0247981 35.347 < 2e-16 ***
V2 0.4441561 0.0458819 9.680 < 2e-16 ***
V3 0.4944401 0.0192758 25.651 < 2e-16 ***
V4 0.8515930 0.0124209 68.561 < 2e-16 ***
V5 0.6662332 0.0315119 21.142 < 2e-16 ***
V6 -0.4436235 0.0192677 -23.024 < 2e-16 ***
V7 -0.6700854 0.0389460 -17.205 < 2e-16 ***
V8 -0.5236369 0.0129832 -40.332 < 2e-16 ***
V9 -0.7576807 0.0161337 -46.963 < 2e-16 ***
V10 -1.1977440 0.0256842 -46.633 < 2e-16 ***
V11 0.5806893 0.0120980 47.999 < 2e-16 ***

17
V12 -1.3163894 0.0194586 -67.651 < 2e-16 ***
V13 -0.3867944 0.0083382 -46.388 < 2e-16 ***
V14 -1.6272329 0.0222727 -73.060 < 2e-16 ***
V15 -0.0374781 0.0088420 -4.239 2.25e-05 ***
V16 -0.9838441 0.0194873 -50.486 < 2e-16 ***
V17 -1.2489799 0.0285687 -43.719 < 2e-16 ***
V18 -0.4288587 0.0151115 -28.380 < 2e-16 ***
V19 0.4702386 0.0141385 33.260 < 2e-16 ***
V20 -1.0420572 0.0385966 -26.999 < 2e-16 ***
V21 0.1556415 0.0151704 10.260 < 2e-16 ***
V22 0.8409318 0.0173364 48.507 < 2e-16 ***
V23 0.6179437 0.0404265 15.286 < 2e-16 ***
V24 0.0025894 0.0158716 0.163 0.87
V25 0.3197462 0.0215415 14.843 < 2e-16 ***
V26 -0.3233321 0.0185333 -17.446 < 2e-16 ***
V27 -0.5115073 0.0318476 -16.061 < 2e-16 ***
V28 0.8840823 0.0453307 19.503 < 2e-16 ***
Amount 0.0085527 0.0004324 19.781 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 551801 on 398039 degrees of freedom


Residual deviance: 109620 on 398010 degrees of freedom
AIC: 109680

Number of Fisher Scoring iterations: 15

Oversampled trained data has been used for regression and applied on original, undersampled and
oversampled data. (threshold probability of 0.5 is used)

Oversampling of minority class is done until both class are same in numbers.

Oversampled Train data on test sample

PREDICTED 0 PREDICTED 1
TRUE 0 83321 1974
TRUE 1 11 137

INDICATORS
TPR 0.92567568
FPR 0.02314321
PRECISION 0.06489815
ACCURACY 0.97676814
AUC ROC = 0.97916903655462

18
Oversampled Train data on Undersampled test sample

PREDICTED 0 PREDICTED 1
TRUE 0 143 5
TRUE 1 11 137

INDICATORS
TPR 0.92567568
FPR 0.03378378
PRECISION 0.96478873
ACCURACY 0.94594595
AUC ROC = 0.97685354273192

19
Oversampled Train data on oversampled test data

PREDICTED 0 PREDICTED 1
TRUE 0 83321 1974
TRUE 1 6338 78957

INDICATORS
TPR 0.92569318
FPR 0.02314321
PRECISION 0.97560885
ACCURACY 0.95127499
AUC ROC = 0.979252549782175

Undersampled Trained Data


Undersampled trained data has been used for regression and applied on original, undersampled and
oversampled data. (threshold probability of 0.5 is used)

20
Undersampling of majority class is done until both class are same in numbers.

Regression Summary of undersampled trained data

Call:
glm(formula = Class ~ ., family = binomial(link = "logit"), data = ustraindata)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.1917 -0.1497 0.0000 0.0000 3.1459

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.55504 49.74657 -0.011 0.991098
V1 -21.99797 12.83708 -1.714 0.086597 .
V2 25.93131 73.41989 0.353 0.723945
V3 -61.06206 33.17981 -1.840 0.065719 .
V4 41.26333 25.67212 1.607 0.107984
V5 -41.27241 12.44213 -3.317 0.000909 ***
V6 -15.45055 33.49246 -0.461 0.644573
V7 -85.21061 116.74452 -0.730 0.465458
V8 14.57713 19.88579 0.733 0.463532
V9 -45.29711 37.01372 -1.224 0.221031
V10 -104.61492 85.19355 -1.228 0.219459
V11 72.61748 70.99619 1.023 0.306385
V12 -129.75724 127.44428 -1.018 0.308607
V13 0.50353 3.25614 0.155 0.877106
V14 -135.26744 138.60404 -0.976 0.329101
V15 -2.63686 4.82466 -0.547 0.584697
V16 -120.59743 122.40128 -0.985 0.324495
V17 -219.31795 215.41722 -1.018 0.308627
V18 -81.52274 82.16018 -0.992 0.321080
V19 27.89769 33.64851 0.829 0.407053
V20 9.00432 22.60838 0.398 0.690429
V21 18.51552 7.37420 2.511 0.012044 *
V22 2.76856 14.56552 0.190 0.849250
V23 0.03258 43.99923 0.001 0.999409
V24 -2.04239 4.23502 -0.482 0.629620
V25 5.02012 20.07371 0.250 0.802522
V26 1.39778 5.09797 0.274 0.783943
V27 20.66474 17.02533 1.214 0.224838
V28 31.53641 53.96437 0.584 0.558956
Amount 0.02383 0.50833 0.047 0.962617
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 953.77 on 687 degrees of freedom


Residual deviance: 144.26 on 658 degrees of freedom
AIC: 204.26

21
Number of Fisher Scoring iterations: 25

Undersampled Train data on Original Test sample

PREDICTED 0 PREDICTED 1
TRUE 0 81700 3595
TRUE 1 10 138

INDICATORS
TPR 0.93243243
FPR 0.04214784
PRECISION 0.03696759
ACCURACY 0.95780813
AUC ROC = 0.979094414773012

Undersampled Train data on undersampled test data

PREDICTED 0 PREDICTED 1
TRUE 0 142 6
TRUE 1 10 138

INDICATORS
TPR 0.93243243
FPR 0.04054054
PRECISION 0.95833333
ACCURACY 0.94594595

22
AUC ROC = 0.978223155588021

Undersampled Train data on oversampled test data

PREDICTED 0 PREDICTED 1
TRUE 0 81700 3595
TRUE 1 5837 79458

INDICATORS
TPR 0.93156691
FPR 0.04214784
PRECISION 0.95671439
ACCURACY 0.94470954
AOC ROC = 0.979092763028719

23
Randomly Drawn Samples from Test and Actual Data

We took random samples from the test data and ran logistic regression using over sampled, under
sampled and normal train data.

Randomly Drawn Samples from Test Data

(Size = 10000, 20000, ……, 90000)

Positive True Rates:

Under
Sample Size
Normal Oversampled Sampled
10000 0.816 1.000 1.000
20000 0.741 0.976 0.965
30000 0.691 0.957 0.957
40000 0.702 0.962 0.962
50000 0.757 0.973 0.973
60000 0.693 0.939 0.945
70000 0.680 0.931 0.943
80000 0.673 0.934 0.944
90000 0.664 0.938 0.948

In all the cases, under sampled Trained Data gives better PTR

Randomly Drawn Samples from Whole Data

(Size = 100000, 110000, ……, 280000)

Positive True Rates:

sample Size Normal Oversampled UnderSampled


100000 0.928251 0.9282511 0.9372197
110000 0.916318 0.916318 0.9288703
120000 0.906883 0.9068826 0.9271255
130000 0.908046 0.908046 0.9310345
140000 0.905303 0.905303 0.9318182
150000 0.911263 0.9112628 0.9351536
160000 0.918539 0.9185393 0.9382022
170000 0.919444 0.9194444 0.9388889
180000 0.917582 0.9175824 0.9395604
190000 0.919138 0.9191375 0.9407008
200000 0.919481 0.9194805 0.9376623
210000 0.918782 0.9187817 0.9365482

24
220000 0.916256 0.9162562 0.9334975
230000 0.916468 0.9164678 0.9331742
240000 0.919909 0.9199085 0.9359268
250000 0.917031 0.9170306 0.9344978
260000 0.917198 0.9171975 0.9341826
270000 0.918919 0.9189189 0.9355509
280000 0.915984 0.9159836 0.9303279

In all the cases, Undersampled Trained Data gives better PTR

Conclusion
By running logistic regression on all three types of trained data (original trained data, under sampled
trained data and over sampled trained data and predicting original, over sampled and under sampled test
data using all three models we are getting higher True Positive Rate on the models predicted using under
sampled trained data.

We can not afford to miss the fraud transaction, but wrongly predicting some of the non-fraudulent
transactions as fraudulent transactions would not have as bad effect as missing fraud transactions.

As our True Positive Rate is increasing for , our False Positive Rate is also increasing which means that we
are detecting more “True Fraud” transaction at the cost of more detecting “False Fraud” transactions.

Random Forest Algorithm for Prediction


Random Forest is considered to be a panacea of all data science problems. On a funny note, when you
can’t think of any algorithm (irrespective of situation), use random forest.

Random Forest is a versatile machine learning method capable of performing both regression and
classification tasks. It also undertakes dimensional reduction methods, treats missing values, outlier
values and other essential steps of data exploration, and does a fairly good job. It is a type of ensemble
learning method, where a group of weak models combine to form a powerful model.

Random Forest is considered to be a panacea of all data science problems. On a funny note, when you
can’t think of any algorithm (irrespective of situation), use random forest!

Random Forest is a versatile machine learning method capable of performing both regression and
classification tasks. It also undertakes dimensional reduction methods, treats missing values, outlier
values and other essential steps of data exploration, and does a fairly good job. It is a type of ensemble
learning method, where a group of weak models combine to form a powerful model.

25
How does it work?

In Random Forest, we grow multiple trees as opposed to a single tree in CART model (see comparison
between CART and Random Forest here, part1 and part2). To classify a new object based on attributes,
each tree gives a classification and we say the tree “votes” for that class. The forest chooses the
classification having the most votes (over all the trees in the forest) and in case of regression, it takes the
average of outputs by different trees.

It works in the following manner. Each tree is planted & grown as follows:

1. Assume number of cases in the training set is N. Then, sample of these N cases is taken at random
but with replacement. This sample will be the training set for growing the tree.
2. If there are M input variables, a number m<M is specified such that at each node, m variables are
selected at random out of the M. The best split on these m is used to split the node. The value of
m is held constant while we grow the forest.
3. Each tree is grown to the largest extent possible and there is no pruning.
4. Predict new data by aggregating the predictions of the number of trees to be used (ntree i.e.,
majority votes for classification, average for regression).

26
Advantages of Random Forest

• This algorithm can solve both type of problems i.e. classification and regression and does a decent
estimation at both fronts.
• One of benefits of Random forest which excites me most is, the power of handle large data set
with higher dimensionality. It can handle thousands of input variables and identify most
significant variables so it is considered as one of the dimensionality reduction methods. Further,
the model outputs Importance of variable, which can be a very handy feature (on some random
data set).
• It has an effective method for estimating missing data and maintains accuracy when a large
proportion of the data are missing.
• It has methods for balancing errors in data sets where classes are imbalanced.
• The capabilities of the above can be extended to unlabeled data, leading to unsupervised
clustering, data views and outlier detection.
• Random Forest involves sampling of the input data with replacement called as bootstrap
sampling. Here one third of the data is not used for training and can be used to testing. These are
called the out of bag samples. Error estimated on these out of bag samples is known as out of bag
error. Study of error estimates by Out of bag, gives evidence to show that the out-of-bag estimate
is as accurate as using a test set of the same size as the training set. Therefore, using the out-of-
bag error estimate removes the need for a set aside test set.

Disadvantages of Random Forest

• It surely does a good job at classification but not as good as for regression problem as it does not
give precise continuous nature predictions. In case of regression, it doesn’t predict beyond the
range in the training data, and that they may over-fit data sets that are particularly noisy.
• Random Forest can feel like a black box approach for statistical modelers – you have very little
control on what the model does. You can at best – try different parameters and random seeds!

27
R implementation

> library(randomForest)

> x <- cbind(x_train,y_train)

# Fitting model

> fit <- randomForest(Species ~ ., x,ntree=500)

> summary(fit)

#Predict Output

> predicted= predict(fit,x_test)

IMPLEMENTATION OF RANDOM FOREST MACHINE LEARNING


TECHNIQUE TO PREDICT FRAUD
ANALYSIS 1: WITHOUT USING ANY METHODOLOGY TO ADJUST FOR IMBALANCED DATA
Decrease in Gini Index when the variable is excluded
Mean Decrease Gini
V1 11.539113
V2 10.02801
V3 10.899461
V4 12.190307
V5 6.662457
V6 11.290116
V7 23.160354
V8 6.558336
V9 35.171088
V10 45.1634
V11 51.916822
V12 104.073197
V13 7.130618
V14 73.627923
V15 11.292658
V16 22.672061

28
V17 139.495277
V18 19.90186
V19 7.629033
V20 8.736183
V21 6.779229
V22 7.165882
V23 8.632115
V24 5.498
V25 6.150093
V26 12.32182
V27 6.345539
V28 6.400512
Amount 7.580834

Higher the decrease in Gini Index higher the importance of the variable in splitting the nodes into
homogeneous sets.

The variables with higher Mean Decrease Gini are of higher importance here (e.g. V17, V14, V11, V12)

The same thing above can be shown in form of a graph as follows

Also, can be seen here variable (e.g. V17, V14, V11, V12) are of higher importance.

Confusion Matrix
actual
predictions 0 1
0 85283 37

29
1 12 111

𝑠𝑢𝑚 𝑜𝑓 𝑑𝑖𝑎𝑔𝑜𝑛𝑎𝑙𝑠 111+85283


We can infer that the accuracy in this case is 𝑇𝑜𝑡𝑎𝑙 𝑠𝑢𝑚
= 85283+37+12+111
= 99.94%

The number of frauds detected is the [2,2] element of the confusion matrix = 111
111
Hence percentage of total number of Frauds Predicted = 111+37 = 75%

Area under the curve = 0.9249, More the area under the curve (closer to 1) better the accuracy of the
model

ROC Curve: If the ROC curve lies above(left) to the 45 degrees then it indicates a good model.

ANALYSIS 2: USING UNDER SAMPLING OF MAJORITY DATA TO TACKLE THE


IMBALANCED DATA
Mean Decrease in Gini Index upon variable Exclusion
Mean Decrease Gini

V1 1.88242
V2 3.710379
V3 11.445979
V4 37.735666
V5 2.69995
V6 5.057464
V7 9.277644
V8 2.874527
V9 3.845098
V10 37.916979
V11 32.872831
V12 44.372253
V13 3.718404
V14 47.450757
V15 2.471948

30
V16 10.103382
V17 43.592688
V18 1.727912
V19 4.836451
V20 2.53403
V21 11.662863
V22 3.318836
V23 3.41832
V24 1.9765
V25 2.240291
V26 2.154664
V27 3.014254
V28 1.9038
Amount 3.799505
Above Table in form of a Graph

Confusion Matrix

actual
predictions 0 1
0 83482 19
1 1813 129
Accuracy = 0.97

Number of Frauds Detected = 129

Percentage of total frauds detected = 87.16%

Area Under the Curve = 0.9782

31
ROC Curve

Conclusion:
Here we can actually relate to what seems to be underlying philosophy of Machine Learning i.e. we trade
of on bias and get better prediction model. Our model is not as robust as traditional model but applying
some machine learning bias methodology we are able to predict better. To summarize and instantiate the
above statement

From Logistic Regression Application

From Random Forest Application

• Our ROC, AUC, Accuracy of the model decreased when biasing our sample by under sampling the
majority.
• But at the same time, we were able to predict better as percentage of frauds predicted to total
number of frauds increased.
1. We had predicted around 75% of the total frauds without creating any bias
2. Once we biased our samples we were able to achieve 87.16% prediction rate.
• This is basically robustness vs. prediction accuracy trade off and is in line with most of the
machine learning principles.

BIBLIOGRAPHY
https://www.medianama.com/2016/06/223-india-has-24-5m-credit-cards-661-8m-debit-cards-in-
march-2016/

https://www.indiastat.com/Searchresult.aspx

https://www.analyticsvidhya.com/blog/2016/04/complete-tutorial-tree-based-modeling-scratch-in-
python/

http://www.marcoaltini.com/blog/dealing-with-imbalanced-data-undersampling-oversampling-and-
proper-cross-validation

32
http://www.creditcards.com/credit-card-news/credit-card-security-id-theft-fraud-statistics-1276.php

https://wallethub.com/edu/credit-debit-card-fraud-statistics/25725/

Chen, Chao. Liaw, Andy. Breiman, Leo. “Using Random Forest to Learn Imbalanced Data”
Biometrics Research, Merck Research Labs
http://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf

33

You might also like