You are on page 1of 5

Credit-Card-Fraud-Detection

The dataset provided consists of credit card transactions made by European cardholders in September
2013. The data covers a span of two days, with a total of 284,807 transactions, out of which 492 are
identified as fraudulent.

The dataset primarily comprises numerical input variables resulting from a PCA (Principal Component
Analysis) transformation. Due to confidentiality concerns, the original features and additional
background information about the data are not disclosed. The transformed features, labeled as V1,
V2, ..., V28, are the principal components obtained through PCA. The 'Time' and 'Amount' features are
exceptions and have not undergone PCA transformation. The 'Time' feature represents the time elapsed
in seconds between a transaction and the first transaction in the dataset. The 'Amount' feature denotes
the transaction amount, and it can be used for cost-sensitive learning based on individual cases. The
target variable, 'Class,' indicates whether a transaction is fraudulent (1) or legitimate (0).

Implications:

While consumers are typically not held liable for fraudulent transactions, the financial burden falls on
the shoulders of financial institutions and merchants. Financial institutions generally bear the cost of
fraud in in-store transactions, whereas merchants are responsible for fraud in online transactions.
Enhancing fraud detection can result in significant cost savings for both financial institutions and
merchants, potentially amounting to millions of dollars annually.

Variables:

The dataset comprises 31 variables. This includes the time of the transaction, the transaction amount,
and the transaction class (1 for fraud and 0 for legitimate transactions). The remaining 28 variables are
the result of PCA transformation on the original data. Unfortunately, there is no information available
regarding the nature of the original variables.
Imbalance in the Dependent Class:

Disribution of Amount:

Distribution of Time:

Correlation Between the Variables:


Classifier Models:

To evaluate the performance of our classifier models, we divided the dataset into a 70% training set and
a 30% test set.

Model Performance - Metrics:

Instead of solely focusing on overall accuracy across the entire dataset, our primary concern is
effectively identifying the majority of fraud cases (recall) while controlling the associated costs
(precision). Typically, this is assessed using the F1 score, which considers both precision and recall.
However, the presence of asymmetric costs for type I and II errors necessitates the use of a customized
cost function.

Decision Tree Classifier:

A decision tree is a hierarchical tree structure resembling a flowchart. Each internal node represents a
feature or attribute, the branches represent decision rules, and the leaf nodes indicate outcomes. The
topmost node, called the root node, learns to partition the data based on attribute values. The tree
partitions itself recursively, enabling decision-making in a manner that resembles human thinking.
Decision trees are highly interpretable and easy to understand.

The Decision Tree classifier achieved an accuracy of 99% when using a threshold of 0.5. However, since
our focus is on precision and recall, we examined the Precision-Recall Curve to evaluate the model's
performance.

Upon evaluating the model on the training data, we obtained a precision of 97%, recall of 79%, and F-
measure of 87%.
The model on the Testing data gave a Precision of 83%, Recall of 79% and F-measure of
81%.

Random Forest model:

It technically is an ensemble method (based on the divide-and-conquer approach) of


decision trees generated on a randomly split dataset. This collection of decision tree
classifiers is also known as the forest. The individual decision trees are generated using
an attribute selection indicator such as information gain, gain ratio, and Gini index for
each attribute. Each tree depends on an independent random sample. In a classification
problem, each tree votes and the most popular class is chosen as the final result. In the
case of regression, the average of all the tree outputs is considered as the final result. It
is simpler and more powerful compared to the other non-linear classification algorithms.

The Ranodm Forest model with number of estimators 100 gave an accuracy of 99%.

The model on the Training data gave a Precision of 96%, Recall of 73% and F-Measure
of 83%.
Random Forest model gave us the best metrics needed for this classification.

You might also like