You are on page 1of 8

Unbalanced Credit Card Fraud Detection

Yazan Obeidi
Dept. of Systems Design Engineering
University of Waterloo
Waterloo, Canada
yazan.obeidi@uwaterloo.ca

Credit-card fraud is a growing problem worldwide which costs between class sizes [5]. Also dynamic trends within the data
upwards of billions of dollars per year. Advanced classification require robust algorithms with high tolerance for concept drift
methods provide the ability to detect fraudulent transactions in legitimate consumer behaviours [7].
without disturbance of legitimate transactions and unnecessarily
spent resources on fraud forensics for financial institutions. Although specialized techniques exist that may handle large
However some key risks exist, namely imbalance learning and class imbalance such as outlier detection, fuzzy inference
concept drift. In this paper, current state-of-the-art techniques systems and knowledge based systems [8, 9], current state-of-
for handling class imbalance at the data level and algorithm level the-art research suggests that conventional algorithms in fact
are examined on European credit card transactions over the
may be used with success if the data is sampled to produce
period of two days. The results are compared to baseline for three
algorithms which have been suggested in previous research to equivalent class sizes [10, 11, 12]. The newest sampling
have high performance with the task of fraud detection: linear techniques involve creating synthetic samples using a
support vector machine, random forest, and multi-layer clustering algorithm such as k-Nearest-Neighboor (kNN). This
perceptron. Results suggest that sophisticated generative is valuable as it means a wider range of, in some cases off-the-
sampling methods can easily fail to generalize the minority class shelf, typical classification algorithms may be used which
when extreme class imbalance exists leading to worse mitigates existing algorithmic limitations due to extreme class
performance over more conventional class imbalance techniques imbalance. Not only can fraud detection capabilities potentially
such as cost-based methods. increase due to a larger scope of potential methods, the cost of
development can be decreased due to reduced reliance on
Keywords — unbalanced fraud detection adaptive synthetic highly specialized niche methods, expert systems, and
sampling random forest support vector machine multi-layer continued research into algorithmic methods which handle
perceptron class imbalance directly.

I. INTRODUCTION II. MATERIAL AND METHODS


In 2015, credit, debit and prepaid cards generated over $31 A. Dataset
trillion in total volume worldwide with fraud losses reaching
over $21 billion [1]. In that same year there were over 225 The credit card fraud dataset used in this paper is obtainable
billion purchase transactions, a figure that is projected to from Kaggle.com [13] and contains a subset of online
surpass 600 billion by 2025 [2]. Fraud associated with credit, European credit card transactions made in September 2013
debit, and prepaid cards is a significant and growing issue for over a period of two days, consisting of a highly imbalanced
consumers, businesses, and the financial industry. 492 frauds out of 284 807 total transactions [5, 6, 7, 14, 15].

Historically, software solutions used to combat credit card For confidentiality the dataset is simply provided as 28
fraud by issuers closely followed progress in classification, unlabeled columns resulting from a PCA transformation.
clustering and pattern recognition [3, 4]. Today, most Fraud Additionally, there are three labeled columns: Class, Time, and
Amount.
Detection Systems (FDS) continue to use increasingly
sophisticated machine learning algorithms to learn and detect Note that the dataset is highly imbalanced towards
fraudulent patterns in real-time, as well as offline, with minimal legitimate transactions which have the label of “0”, visible in
disturbance of genuine transactions [5]. Fig. 1.
Generally, FDS need to address several inherent challenges
related to the task: extreme unbalancedness of the dataset as
frauds represent only a fraction of total transactions,
distributions that evolve due to changing consumer behaviour,
and assessment challenges that come with real time data
processing [6]. For example difficulties arise when learning
from an unbalanced datasets as many machine intelligence Fig 1: Class imbalance in he
methods are not designed to handle extremely large differences Kaggle credit-card fraud dataset
B. Methods actually provide a realistic approximation of if the classes
1) Imbalance Learning: Standard decision trees such as were in fact, balanced. Despite this, sampling methods can
ID3 and C4.5 use information gain as the splitting criterion for provide a more robust approach to imbalance learning than
learning which results in rules biased towards the majority [3]. other methods, for example cost-based techniques which
Research also shows that imbalanced datasets pose a problem penalize errors differently depending on class such that the
for kNN, neural networks (NN), and support vector machines minority class is favoured [12, 19]. For example, what cost to
(SVM) [3, 16, 17]. This problem is most pronounced when the use?
two classes overlap as in the case of the Kaggle dataset; the In either case, sampling using ADASYN on the training
majority of machine intelligence algorithms are not suited to data is compared with classic undersampling in prior research,
and no sampling at all. The package used is “Imbalance-Learn”
handle both class unbalanced and overlapped class
in Python2.7 [20]. The testing and validation data are not
distributions [3, 10, 11]. sampled such that reported final accuracies are not distorted.
Fortunately, particular algorithms exist that can take class This is representative of real-life where the fraudulent cases
imbalance into account. In addition there are techniques at the would be in the extreme minority class. Finally, these results
data level and algorithm level which can reduce the negative will be compared to cost-based balancing methods, if
effects of these biases. applicable.
2) Concept Drift: Credit card fraud is prone to concept
drift as consumer trends change due to changing preferences, 4) Classification: Most FDS use supervised classification
seasonality and new products, as well as evolving fraud attack techniques to discriminate between fraudulent and legitimate
strategies [7]. The net effect of this is that the statistical transactions. Research on similar credit-card fraud datasets
properties of the underlying data change over time. Recent shows Random Forest (RF) having superior performance than
research has shown that it is possible to overcome this while NN and SVM when using undersampling to correct for class
still maintaining conventional machine intelligence imbalance [6]. This finding is verified by exploring three
techniques. For example, a sliding window approach where a general classes of techniques against the Kaggle dataset: linear
classifier is trained everyday on the most recent samples, or an methods, ensemble methods, and neural networks.
ensemble approach where the oldest component is replaced Specifically, linear SVM, RF, and multi-layer perceptrons
with a new classifier [7]. (MLP) with training data subjected to ADASYN are explored
However, for the purposes of this paper the challenges in this paper.
associated with concept drift, as well as their resolutions, are
not explored for two reasons. The first is that the dataset used is a) Support Vector Machine: The aim of a SVM is to fit a
collected over a period of two days, which may not be hyperplane between data points in space, i.e. support vectors,
sufficient for concept drift to actually take place. The second such that the samples are separated by the largest gap possible.
reason is that as mentioned, research shows that concept drift in Classification occurs by determining which side the gap new
FDS may be appropriately tackled by using conventional data points fall on. Typically a binary linear classifier is used
methods that are employed to maintain only a local temporal although there exist non-linear methods for SVM. In this
memory of learned attributes [7]. In other words, once a paper. The SVM algorithm used is from “scikit-learn” v0.18.1
method is found that performs well for short periods of time, using Python2.7 [21]. This SVM implementation provides a
i.e. sufficiently small periods of time where concept drift does means to apply cost-based balancing which will be briefly
not take place, its implementation can then be further improved explored.
to account for concept drift. Therefore the fraud detection
methods explored in this paper would need further refinement b) Random Forest: Ensemble methods generally use
before being applied to a data stream of longer than a handful several weak classifiers to obtain a more optimal performance
of days. These refinements are discussed in more detail at the than any one single algorithm. Notably, RF is a technique
end of the paper. which constructs a multitude of decision trees, i.e. a forest. RF
has the advantages of being effective on relatively small
3) Sampling: Sampling methods are used to compensate amounts of data, efficient and robust on large amounts of data,
for the unbalancedness of the dataset by reducing the classes and can easily apply cost-based balancing [13]. The RF
to near equivalence in size. Undersampling and oversampling algorithm used is from “scikit-learn” v0.18.1 using Python2.7
are two roughly equivalent and opposite techniques which use [21].
a bias to achieve this purpose. More complex algorithms such c) Multi-Layer Perceptron: A multi-layer perceptron
as synthetic minority oversampling technique (SMOTE) and (MLP) is a fully connected feedforward neural network. MLPs
the state-of-the-art adaptive synthetic sampling approach can classify data which is not linearly seperable due to the use
(ADASYN) actually create new data points based on known of a non-linear activation function at each node in the network.
samples and their features instead of simply replicating the The network is trained using standard backpropogation.The
minority class [3, 18]. However these algorithms rely on MLP is implemented using “scikit-learn” v0.18.1 with
assumptions of the minority class and are generally Python2.7 [21]. “Tensorflow" v1.0 is also used to explore
computationally expensive. Particularly, the created data variant MLP architectures [22].
generally is an interpolation of prior data which may not
5) Validation: It is important to consider that a highly complete picture of its performance. The AUPR is formulated
unbalanced dataset will by default report a high baseline again using scikit-learn v0.17.1 in Python2.7 [21].
accuracy. For example, the Kaggle dataset used in this paper Finally, the AUPR is supplemented with the area under the
has a baseline accuracy of of 99.827%. That is, if a binary receiving operating characteristics (ROC) curve (AUROC).
classifier always chose the class of “no fraud”, since there are The ROC is simply the TP rate plotted against the FP rate as the
so few fraudulent transactions compared to legitimate ones a discriminatory threshold is varied for a binary classifier [23]. It
high accuracy would in fact be reported. But clearly the goal is similar to the PR curve in that generally, the greater the area
of flagging fraudulent transactions would not be under the curve the greater the performance, but provides a
accomplished. Therefore a more careful approach must be different evaluation of an algorithm, namely recall v.s. the fall
used: a confusion matrix that provides the True Positive (TP), out, that is the probability of classifiying a legitimate
transaction as fraudulent.
True Negative (TN), False Positive (FP) and False Negative
(FN).
III. RESULTS
TABLE 1: Confusion Matrix A. Support Vector Machine
True Fraud True Legitimate Precision-recall and ROC curves for different thresholds of
the “c” threshold parameter in the linear SVM classifer on the
Predicted Fraud TP FP training data upsampled using ADASYN are as follows:
Predicted FN TF
Legitimate

From these rates, two other key metrics may be formulated:


precision, the fraction of instances correctly classified as
fraudulent, and recall, the fraction of fraudulent instances that
are correctly classified.

(1)

Fig 2: SVM PR & ROC for c threshold, ADASYN


(2) Note the poor performance visible by the constant negative
slopes for all threshold values. This is contrasted by Fig. 3
which demonstates the same algorithm and threshold values
The confusion matrix as well as precision and recall metrics but with the raw unsampled data:
provide a rich description of the performance of a FDS [3, 23].
Namely, FP is acceptable whereas FN is not, as it is preferred
for a financial institution to expend some resources on an alert
that ends up being legitimate than to miss a fraud altogether.
Fraud analytics departments are viewed as a necessary cost of
business by banks and other credit card issuers but the impacts
of missed frauds are far greater, including loss of customer trust
and legal ramifications. In such a case, the recall metric is
important as any FN will decrease the score. Confusion
matrices are created using scikit-learn v0.18.1 in Python2.7
[20].
Classification report provided by scikit-learn v0.18.1 is also
used to assess algorithm performance. This report provides
Figure 3: SVM PR & ROC for c threshold, unsampled
precision and recall values with respect to both classes as well
as f-scores. This information is valuable for determining the
degree to which the algorithm provides FP and FN.
From these plots it is clear that the training data resampled
By examining the area under the precision-recall (PR) with ADASYN actually resulted in worse performance over the
curve (AUPR), different classification algorithms may be more raw unsampled data. Nevertheless both results suggest that a
effectively compared than by simply considering accuracy threshold value of c > 0.1 produces optimal performance.Using
alone [4, 7, 23]. In this fashion an evaluation of a classifier may this result to the alternative cost-based method of class
be obtained for the full range of thresholds providing a reweighting, the following plots are obtained:
Figure 7: SVM, c=1, ADASYN

Fig 4: SVM PR & ROC for class_weights, c=1, unsampled At first glance at Fig. 7 it seems that the classifer identifies
fraudulent transactions to a greater degree using ADASYN
than when trained with unsampled reweighted data, shown in
Based on the AUPR in Fig. 4 is it clear that a minority class Fig. 6. However the increased minority class recall comes at
weighting of 5:1 over the majority class on the unsampled data the cost of a substantially higher FP rate of 93%. Clearly the
produces superior performance over both ADASYN sampled classifier trained using ADASYN is immensely biased to the
data and unsampled data without class reweighting. minority class. SVM trained using unsampled data with class
Considering the AUROC provides a different perspective: weighting obtains superior performance: 91% TP and 97% TN.
increasing the minority class weighting up to 1000 would Compared to unsampled training data without class
increase the TP rate for a given FP rate with minimal impact reweighting, the FN rate is reduced by over 37%.
over the FP rate.
Attempting to apply class reweighting to ADASYN B. Random Forest
upsampled data increases the performance over ADASYN Although RF is suggested to have optimal performance
unsampled data without class reweighting, but produces AUPR over SVM and NN [3], the obtained results indicate otherwise.
and AUROC less than the results obtained in Fig. 4, suggesting Figure 8 below demonstrates extremely poor performance for
that ADASYN does not offer valuable contribution to the any value of the number of estimators parameter when training
classifier: with ADASYN sampled training data:

Fig 5: SVM PR & ROC for c threshold, Fig 8: RF PR & ROC for n_estimator, ADASYN
class_weight={1:5,0:1}, unsampled
This contrasts substatially with the same algorithm using
For a closer comparison of ADASYN without class unsampled and unweighted data:
reweighting, and the unsampled data with class reweighting
observe the FN rates in the proceeding confusion matrices:

Figure 6: SVM, c=1, unsampled, class_weight={1:1000,0:1} Fig 9: RF PR & ROC for n_estimator, unsampled &
unweighted
In the case of RF, taking the best performing value for the C. Multi-Layer Perceptron
number of estimators and observing AUPR and AUROC for The same trend is followed when using MLP. Varying the
varying class reweighting does not have as much performance number of layers using ADASYN sampled training data and
benefit than in the case of SVM, observable in Fig. 10 below: stochastic gradient descent (SGD) shows poor performance
across all parameter values:

Figure 10: RF PR & ROC for class_weight, 100 estimators,


unsampled
Figure 13: MLP PR & ROC for n_layers, ADASYN using
Once again, comparing the confusion matrices for RF when SGD
using ADASYN sampled training data and when using According to the AUPR and AUROC, a layer size of 50 is
unsampled data with class reweighting shows that ADSYN optimal but clearly any layer size will produce either a high TP
produces substantially worse performance: both TP and FP or TN rate, but not both.
rates are compromised:
This is constrasted in Fig. 14 below created by running the
same algorithm but using unsampled unweighted data.

Figure 11: RF confusion matrices, 100 estimators, ADASYN


Clearly the classifier trained using ADASYN has very low
fraud detection performance as it is biased towards the majority
class. This is interesting as it is opposite from the results
obtained using SVM with ADASYN. Figure 14: MLP PR & ROC for n_layers, unsampled &
unweighted using SGD

Figure 12: RF confusion matrices, 100 estimators,


class_weight={1:10,0:1}
Figure 12 above shows that RF on unsampled but
reweighted classes produces a perfect FP rate, but a TP rate of Figure 15: MLP confusion matrices, 100 layers, SGD,
74%, much lower than that obtained using SVM. ADASYN
It is evident that using ADASYN sampled training data in 2) The specific implementations of the SVM, RF, and
fact cripples the MLP. Observing the confusion matrices MLP algorithms used were not able to generalize to highly
confirms this finding: imbalanced test and validation data, after training with equally
balanced data generated by synthetic sampling. This is
supported by the high FP rate when ADASYN is used.
3) It is possible that the Python2.7 package used for the
ADASYN algorithm, “Imbalance-Learn” [20], contained
errors in the implementation that led to the poor results
observed. The package is flagged as “experimental” and “use
at your own risk”.
Comparing the performance of the different classifiers used,
it is clear that linear SVM produces the best performance over
RF and MLP . In such case it was the unsampled training data
Figure 16: MLP confusion matrices, 100 layers, SGD, with class reweighting that obtained optimal classification.
unweighted & unsampled MLP trained with unweighted and unsampled data came in
second but with a FN rate three times high compared to SVM.
Note that in Fig. 15 the TP rate is actually fairly strong, but RF came closely in third using unsampled training data with
comes at the cost of a high FP rate indicating bias towards the class reweighting.
minority class.
Based on these results an MLP trained on unsampled data V. CONCLUSIONS
has better FDS performance compared to ADASYN sampled The obtained results contradicts results from prior research
training data, but compared to other methods has a much which suggest upsampling using techniques such as ADASYN
greater FN rate. increase performance in binary classification of highly
The performance of the MLP trained with SGD on imbalanced datasets. It is clear that when compared to
unweighted and unsampled training data is nearly identical conventional methods for dealing with class imbalance such as
with the results of RF using reweighted classes trained with cost-based methods, or even undersampling, creating synthetic
unsampled data. The differences are a slightly higher FP rate samples can result in much worse performance. In many cases
and a slightly lower FN rate, which is perefered by FDS. it can actually handicap the classifier and produce results worse
than not accounting for class imbalance at all.
IV. DISCUSSION It is recommended that further research is done to examine
The results largely show than compared to the unsampled the conditions upon which upsampling techniques such as
dataset, the dataset upsampled with ADASYN had much worse ADASYN may successfully increase performance in extremely
performance. Additionally, the unsampled dataset using unbalanced datasets. Particularly it is suggested that a closer
examination of the upsampled data in this paper is done to
alternate class weighting to favour the miority class had
visualize and understand the characteristics of the synthetic
superior performance than both the unsampled and ADASYN
samples and to evaluate to what degree they are representative
sampled datasets. This in fact contradicts current research of actual fraudulent samples.
which showed that ADASYN lead to increased overall
performance in fraud detection versus other techniques [3, 18]. Second, it is recommended that examination of the results
of using ADASYN to upsample the minority class to a lesser
There are three suspected reasons that explain the degree is done, that is, rather than upsampling to produce
discrepancy. equivalent class sizes, only upsample to reduce the class
1) Thee generated data was not representative of true imbalance. When the class imbalance is 99.8% biased towards
the majority class, is is feasible that attempting to upsample too
fraudulent transactions. This is possible in three parts:
much produces a dataset that when trained on a classifier, will
ADASYN failed to create synthetic samples which captured result in a classifier that is always biased towards the minority
the underlying characteristic of fraudulent transactions, or that class, indicated by high FP rates.
fraudulent transactions, at least in the used dataset, do not have
strong enough differentiation when compared to legitimate Third, it is recommended that upsampling techniques such
transactions. Or, the class imbalance was simply too great to as ADASYN combined with conventional class imbalance
mitigation techniques such as class reweighting are further
mitigate using ADASYN. However the positive results
explored. Although in this paper combining the two techniques
obtained using the unsampled and unweighted methods lead to poorer performance compared to class reweighting
suggest that the classes are in fact separable. Therefore it is alone, it is conceivable that by correcting for the performance
suspected that ADASYN did not manage to create new impacts that ADASYN produced on the results in this paper,
fraudulent samples that were representative of the data. applying an additional technique that is shown to increase
Whether or not this is due completely to the extreme class performance will result in a globally optimal FDS which may
imbalance or due to the nature of the data itself is unknown. use off-the-shelf implementations of conventional classifiers.
Finally, although SVM trained with unsampled data with [18] H. He, et al., “ADASYN: Adaptive synthetic sampling approach for
class reweighting reported higher performance in fraud imbalanced learning”. in IEEE International Joint Conference on Neural
Networks, pages 1322–1328. IEEE, 2008.
detection than MLP trained with unsampled data and
[19] G. M. Weiss and F. Provost. “The effect of class distribution on classifier
unweighted classes, it is recommended that applying class learning: an empirical study”. Rutgers Univ, 2001.
reweighting towards MLP training is explored. This is because [20] G. Lemaitre, F. Nogueira, and C. K. Aridas, “Imbalanced-learn: A
a higher performance is noted for MLP than SVM on Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine
unsampled and unweighted data. It is expected that if class Learning,” submitted for publication.
reweighting may be applied towards the MLP, despite being an [21] Pedregosa, et. al., “Scikit-learn: Machine Learning in Python” in JMLR
unconventional technique, similar performance gains will be 12, pp. 2825-2830, 2011. [Online] Available:
obtained than what was observed with SVM. http://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html
[22] M. Abadi, et. al., “TensorFlow: Large-scale machine learning on
heterogeneous systems”, 2015. [Online] Available:
REFERENCES http://download.tensorflow.org/paper/whitepaper2015.pdf
[1] “The Nilson Report,” HSN Consultants., Carpinteria, CA, Issue 1096, [23] J. Davis and M. Goadrich. “The relationship between precision-recall
Oct. 2016. [Online] Available: and roc curves”. in Proceedings of the 23rd international conference on
https://nilsonreport.com/publication_newsletter_archive_issue.php? Machine learning, pages 233–240. ACM, 2006.
issue=1096
[2] “The Nilson Report,” HSN Consultants., Carpinteria, CA, Issue 1101, APPENDIX
Jan. 2017. [Online] Available:
https://nilsonreport.com/publication_newsletter_archive_issue.php? ### Code Sample –
issue=1101 # contact yazan.obeidi@uwaterloo.ca for complete codebase
[3] A. D. Pozzollo, “Adaptive Machine Learning for Credit Card Fraud
Detection,” Ph.D. dissertation, Dept. Comp. Sci., Univ. Libre de from collections import Counter
Bruxelles, Bussels, Belgium, 2015. import pickle
import pandas as pd
[4] L. Delamaire, H. Abdou, and J. Pointon. Credit card fraud and detection import numpy as np
techniques: a review. Banks and Bank Systems, 4(2):57–68, 2009. import matplotlib
[5] import matplotlib.pyplot as plt
import seaborn as sns
A. D. Pozzolo, et. al., “Calibrating Probability with Undersampling for from sklearn.ensemble import RandomForestClassifier
Unbalanced Classification” in Symposium on Computational from sklearn import svm
Intelligence and Data Mining (CIDM), 2015 © IEEE. doi: from sklearn.neural_network import MLPClassifier
10.1109/SSCI.2015.33 from sklearn.metrics import confusion_matrix,
[6] A. D. Pozzollo, et. al., “Learned lessons in credit card fraud detection classification_report, auc, precision_recall_curve,
roc_curve
from a practitioner perspective”, submitted for publication in Expert
from sklearn.preprocessing import StandardScaler
Systems with Applications, Feb 2014. from sklearn.manifold import TSNE
[7] A. D. Pozzollo, “Credit Card Fraud Detection and Concept-Drift from sklearn.utils import shuffle
Adaptation with Delayed Supervised Information” in International Joint import matplotlib.gridspec as gridspec
Conference on Neural Networks (IJCNN), 2015 © IEEE. doi: %matplotlib notebook
10.1109/IJCNN.2015.7280527
[8] M Krivko. “A hybrid model for plastic card fraud detection systems”. ## Dataset preparation
Expert Systems with Applications, 37(8):6070–6076, 2010. data = pd.read_csv("data/creditcard.csv")
[9] Y. Sahin, S. Bulkan, and E. Duman. “A cost-sensitive decision tree #Create dataframes of only Fraud and Normal transactions.
approach for fraud detection”. Expert Systems with Applications, Also Shuffle them.
fraud = shuffle(data[data.Class == 1])
40(15):5916–5923, 2013.
normal = shuffle(data[data.Class == 0])
[10] H. He and E. A Garcia. “Learning from imbalanced data. Knowledge and # Produce a training set of 80% of fraudulent and 80%
Data Engineering”, IEEE Transactions on, 21(9):1263–1284, 2009. normal transactions
[11] G. Batista, A. Carvalho, and M. Monard. “Applying one-sided selection X_train = fraud.sample(frac=0.8)
X_train = pd.concat([X_train, normal.sample(frac = 0.8)],
to unbalanced datasets” in MICAI 2000: Advances in Artificial
axis = 0)
Intelligence, pages 315–325,2000. # Split remainder into testing and validation
[12] J. Laurikkala. “Improving identification of difficult small classes by remainder = data.loc[~data.index.isin(X_train.index)]
balancing class distribution”. Artificial Intelligence in Medicine, pages X_test = remainder.sample(frac=0.7)
63–66, 2001. X_validation =
remainder.loc[~remainder.index.isin(X_test.index)]
[13] Kaggle. (2017, Jan. 12). Credit Card Fraud Detection [Online]. # Use ADASYN to rebalance dataset
Available: https://www.kaggle.com/dalpozz/creditcardfraud ada = ADASYN(n_jobs=3)
[14] https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm data_resampled, data_labels_resampled =
ada.fit_sample(np.array(X_train.ix[:, X_train.columns !=
[15] A. D. Pozzollo, “When is undersampling effective in unbalanced 'Class']), np.array(X_train.Class))
classification tasks?” in European Conference on Machine Learning and # Pickle for later retrieval
Principles and Practice of Knowledge Discovery, Porto, Portugal, 2015, with open('pickle/train_data_resampled.pkl', 'wb+') as f:
pp. 200-215. pickle.dump(data_resampled, f)
[16] S. Visa and A. Ralescu. “Issues in mining imbalanced data sets-a review
paper” in Proceedings of the sixteen midwest artificial intelligence and with open('pickle/train_data_labels_resampled.pkl', 'wb+')
as f:
cognitive science conference, pages 67–73. 2005.
pickle.dump(data_labels_resampled, f)
[17] I. Mani and I. Zhang. “knn approach to unbalanced data distributions: a # At this point if we print out a count of the resampled
case study involving information extraction” in Proceedings of data labels we get
Workshop on Learning from Imbalanced Datasets, 2003. # Counter({0: 227452, 1: 227324})
X_train_resampled = pd.DataFrame(X_train_resampled)
X_train_labels_resampled =
pd.DataFrame(X_train_labels_resampled)
X_train_resampled = pd.concat([X_train_resampled, for c,k in zip([0.0001, 0.001, 0.1, 1, 10, 25, 50,
X_train_labels_resampled], axis=1) 100],'bgrcmywk'):
X_train_resampled.columns = X_train.columns lsvm_ = svm.LinearSVC(C=c, dual=False,
X_train_resampled.head() class_weight={1:1,0:1})
# Shuffle the datasets once more to ensure random feeding lsvm_.fit(dataset['X_train_'], dataset['y_train_'])
into the classification algorithms y_pred = lsvm_.predict(dataset['X_test_'])
X_train = shuffle(X_train)
X_test = shuffle(X_test) p,r,_ = precision_recall_curve(dataset['y_test_'],
X_validation = shuffle(X_validation) y_pred)
X_train_ = shuffle(X_train_resampled) tpr,fpr,_ = roc_curve(dataset['y_test_'], y_pred)
X_test_ = shuffle(X_test)
X_validation_ = shuffle(X_validation) ax1.plot(r,p,c=k,label=c)
data_resampled = pd.concat([X_train_, X_test_, ax2.plot(tpr,fpr,c=k,label=c)
X_validation_])
# Normalize the data to an average of 0 and std of 1, but ax1.legend(loc='lower left')
do not touch the Class column ax2.legend(loc='lower left')
for feature in X_train.columns.values[:-1]: plt.show()
mean, std = data[feature].mean(), data[feature].std()
X_train.loc[:, feature] = (X_train[feature] - mean) /
std ## Linear SVM on weighted unsampled training data
X_test.loc[:, feature] = (X_test[feature] - mean) / std lsvm = svm.LinearSVC(C=1, dual=False,
X_validation.loc[:, feature] = (X_validation[feature] - class_weight={1:1000,0:1})
mean) / std lsvm.fit(dataset['X_train'], dataset['y_train'])
for feature in X_train_.columns.values[:-1]: y_pred = lsvm.predict(dataset['X_test'])
mean, std = data_resampled[feature].mean(), y_pred_validation = lsvm.predict(dataset['X_validation'])
data_resampled[feature].std() plot_confusion_matrix(dataset['y_test'], y_pred)
X_train_.loc[:, feature] = (X_train_[feature] - mean) / plot_confusion_matrix(dataset['y_validation'],
std y_pred_validation)
X_test_.loc[:, feature] = (X_test_[feature] - mean) /
std
X_validation_.loc[:, feature] = (X_validation_[feature]
- mean) / std
# Create labels
y_train = X_train.Class
y_test = X_test.Class
y_validation = X_validation.Class
y_train_ = X_train_.Class
y_test_ = X_test_.Class
y_validation_ = X_validation_.Class
# Remove labels from X's
X_train = X_train.drop(['Class'], axis=1)
X_train_ = X_train_.drop(['Class'], axis=1)
X_test = X_test.drop(['Class'], axis=1)
X_test_ = X_test_.drop(['Class'], axis=1)
X_validation = X_validation.drop(['Class'], axis=1)
X_validation_ = X_validation_.drop(['Class'], axis=1)
# Pickle and save the dataset
dataset = {'X_train' : X_train,
'X_train_': X_train_,
'X_test': X_test,
'X_test_': X_test,
'X_validation': X_validation,
'X_validation_': X_validation_,
'y_train': y_train,
'y_train_': y_train_,
'y_test': y_test,
'y_test_': y_test_,
'y_validation': y_validation,
'y_validation_': y_validation_}
with open('pickle/data_with_resample.pkl', 'wb+') as f:
pickle.dump(dataset, f)

## Sample Precision-Recall curve generating code for SVM on


ADASYN training data:
fig = plt.figure(figsize=(12,6))
ax1 = fig.add_subplot(1,2,1)
ax1.set_xlim([-0.05,1.05])
ax1.set_ylim([-0.05,1.05])
ax1.set_xlabel('Recall')
ax1.set_ylabel('Precision')
ax1.set_title('PR Curve')

ax2 = fig.add_subplot(1,2,2)
ax2.set_xlim([-0.05,1.05])
ax2.set_ylim([-0.05,1.05])
ax2.set_xlabel('False Positive Rate')
ax2.set_ylabel('True Positive Rate')
ax2.set_title('ROC Curve')

You might also like