You are on page 1of 19

HMR INSTITUTE OF TECHNOLOGY & MANAGEMENT

HAMIDPUR , DELHI - 110036

Affiliated to

GURU GOBIND SINGH INDRAPRASTHA UNIVERSITY

Sector - 16C Dwarka, Delhi - 110075, India 2008-12

Summer Training Project Report


CREDIT CARD FRAUD DETECTION

Submitted in partial fulfillment of the requirements for

The award of the degree of

Bachelor of Technology

In

(ELECTRONICS & COMMUNICATION ENGINEERING)

Guide(s): Submitted by:

Mr. KULDEEP VIVEK KUMAR

ECE-5B

Roll No:48213302817
DECLARATION

We, students of(ECE 5thsem) hereby declare that the major


project entitled “CREDIT CARD FRAUD DETECTION” which is
submitted to Department of ECE, HMR Institute of Technology &
Management, Hamidpur Delhi, affiliated to Guru Gobind Singh
Indraprastha University, Dwarka(New Delhi) in partial fulfilment of
requirement for the award of the degree of Bachelor of
Technology in ECE has not been previously formed the basis for
the award of any degree, diploma or other similar title or
recognition.

Submitted by:- Vivek kumar

This is to certify that the above statement made by the


candidate(s) is correct to the best of my knowledge.

New Delhi

Mr.KULDEEP

Date:

Signature of Guides
Mr.AVDESH KUMAR SHARMA

(Head of Department)

HMRITM Hamidpur, New Delhi-110036

ABSTRACT
In this project, we were asked to experiment with a real world dataset, and
to explore how

machine learning algorithms can be used to find the patterns in data. We


were expected to gain

experience using a common data-mining and machine learning library,


Weka, and were expected

to submit a report about the dataset and the algorithms used. After
performing the required tasks

on a dataset of my choice, herein lies my final report

ACKNOWLEDGEMENT
The success and final outcome of this project required a lots of guidance
and

assistance from many people and we are extremely privileged to have got
this

all along the completion of my project. All that we have done is only due to

such supervision and assistance and we would not forget to thank them.

We respect and thank Prof. Jigyasa Chadha for providing us an opportunity


to

do the project and giving us all support and guidance which made us
complete

the project duly. We are extremely thankful to her for providing such a nice

support and guidance, although she had busy schedule managing all the
things

except this project.

I am thankful and fortunate enough to get constant encouragement,


support and

guidance from all Teaching staffs of Electronics & Communication which

helped us in successfully completing our project work. Also, I would like to

extend our sincere esteems to all staff in laboratory for their timely support.

At last, We are indebt of our friends & family members who have provided
us

constant support and motivation through out time to complete our project.

ABOUT MACHINE LEARNING


Machine learning (ML) is the scientific study of algorithms and statistical
models that computer systems use to perform a specific task without using
explicit instructions, relying on patterns and inference instead. It is seen as a
subset of artificial intelligence. Machine learning algorithms build a
mathematical model based on sample data, known as "training data", in
order to make predictions or decisions without being explicitly programmed
to perform the task. Machine learning algorithms are used in a wide variety
of applications, such as email filtering and computer vision, where it is
difficult or infeasible to develop a conventional algorithm for effectively
performing the task.

Machine learning is closely related to computational statistics, which


focuses on making predictions using computers. The study of mathematical
optimization delivers methods, theory and application domains to the field
of machine learning. Data mining is a field of study within machine learning,
and focuses on exploratory data analysis through unsupervised learning. In
its application across business problems, machine learning is also referred
to as predictive analytics.

Problem Statement:

The Credit Card Fraud Detection Problem includes modeling past credit card
transactions with the knowledge of the ones that turned out to be fraud.
This model is then used to identify whether a new transaction is fraudulent
or not. Our aim here is to detect 100% of the fraudulent transactions while
minimizing the incorrect fraud classifications.

Libraries Used :
NUMPY
NumPy is a very popular python library for large multi-dimensional array
and matrix processing, with the help of a large collection of high-level
mathematical functions. It is very useful for fundamental scientific
computations in Machine Learning. It is particularly useful for linear
algebra, Fourier transform, and random number capabilities. High-end
libraries like TensorFlow uses NumPy internally for manipulation of Tensors.

Sci-kit learn
Skikit-learn is one of the most popular ML libraries for classical ML
algorithms. It is built on top of two basic Python libraries, viz., NumPy and
SciPy. Scikit-learn supports most of the supervised and unsupervised
learning algorithms. Scikit-learn can also be used for data-mining and data-
analysis, which makes it a great tool who is starting out with ML.

PANDAS
Pandas is a popular Python library for data analysis. It is not directly related
to Machine Learning. As we know that the dataset must be prepared before
training. In this case, Pandas comes handy as it was developed specifically
for data extraction and preparation. It provides high-level data structures
and wide variety tools for data analysis. It provides many inbuilt methods
for groping, combining and filtering data

MATPLOT LIB
Matpoltlib is a very popular Python library for data visualization. Like
Pandas, it is not directly related to Machine Learning. It particularly comes
in handy when a programmer wants to visualize the patterns in the data. It
is a 2D plotting library used for creating 2D graphs and plots. A module
named pyplot makes it easy for programmers for plotting as it provides
features to control line styles, font properties, formatting axes, etc. It
provides various kinds of graphs and plots for data visualization, viz.,
histogram, error charts, bar chats, etc.

ALGORITHM USED
Clustering
Clustering is a type of Unsupervised learning. This is very often used when
you don’t have labeled data.K-means Clustering is one of the popular
clustering algorithm. The goal of this algorithm is to find groups(clusters) in
the given data. In this post we will implement K-Means algorithm using
Python from scratch.

K-Means Clustering
K-Means is a very simple algorithm which clusters the data into K number of
clusters. The following image from PyPR is an example of K-Means
Clustering.

Use Cases

K-Means is widely used for many applications.

Image Segmentation

Clustering Gene Segementation Data

News Article Clustering

Clustering Languages

Species Clustering

Anomaly Detection

Algorithm

Our algorithm works as follows, assuming we have inputs x1,x2,x3,


…,xnx1,x2,x3,…,xn and value of K
Step 1 - Pick K random points as cluster centers called centroids.

Step 2 - Assign each xixi to nearest cluster by calculating its distance to each
centroid.

Step 3 - Find new cluster center by taking the average of the assigned
points.

Step 4 - Repeat Step 2 and 3 until none of the cluster assignments change.

The above animation is an example of running K-Means Clustering on a two


dimensional data.

Step 1

We randomly pick K cluster centers(centroids). Let’s assume these are c1,c2,


…,ckc1,c2,…,ck, and we can say that;

C=c1,c2,…,ckC=c1,c2,…,ck

CC is the set of all centroids.

Step 2

In this step we assign each input value to closest center. This is done by
calculating Euclidean(L2) distance between the point and the each centroid.

argminci∈Cdist(ci,x)2arg⁡minci∈Cdist(ci,x)2

Where dist(.)dist(.) is the Euclidean distance.

Step 3

In this step, we find the new centroid by taking the average of all the points
assigned to that cluster.

ci=1|Si|∑xi∈Sixici=1|Si|∑xi∈Sixi

SiSi is the set of all points assigned to the iith cluster.


Step 4

In this step, we repeat step 2 and 3 until none of the cluster assignments
change. That means until our clusters remain stable, we repeat the
algorithm.

Choosing the Value of K

We often know the value of K. In that case we use the value of K. Else we
use the Elbow Method.

We run the algorithm for different values of K(say K = 10 to 1) and plot the K
values against SSE(Sum of Squared Errors). And select the value of K for the
elbow point

FUNCTION USED
⁡sklearn.cluster.KMeans

class sklearn.cluster.KMeans(n_clusters=8, init=’k-means++’, n_init=10, m


ax_iter=300, tol=0.0001, precompute_distances=’auto’, verbose=0, rando
m_state=None, copy_x=True, n_jobs=1, algorithm=’auto’)

⁡Parameters:

n_clusters:⁡int,⁡optional,⁡default:⁡8

The⁡number⁡of⁡clusters⁡to⁡form⁡as⁡well⁡as⁡the⁡number⁡of⁡centroids⁡to⁡
generate.

init⁡:⁡{‘k-means++’,⁡‘random’⁡or⁡an⁡ndarray}Method⁡for⁡initialization,⁡
defaults⁡to⁡‘k-means++’:

‘k-means++’⁡:⁡selects⁡initial⁡cluster⁡centers⁡for⁡k-mean⁡clustering⁡in⁡a⁡smart
way⁡to⁡speed⁡up⁡convergence.⁡See⁡section⁡Notes⁡in⁡k_init⁡for⁡more⁡details.

‘random’:⁡choose⁡k⁡observations⁡(rows)⁡at⁡random⁡from⁡data⁡for⁡the⁡initial⁡
centroids.

If⁡anndarray⁡is⁡passed,⁡it⁡should⁡be⁡of⁡shape⁡(n_clusters,⁡n_features)⁡and⁡
gives⁡the⁡initial⁡centers.

n_init⁡:⁡int,⁡default:⁡10

Number⁡of⁡time⁡the⁡k-means⁡algorithm⁡will⁡be⁡run⁡with⁡different⁡centroid⁡
seeds.⁡The⁡final⁡results⁡will⁡be⁡the⁡best⁡output⁡of⁡n_init⁡consecutive⁡runs⁡in⁡
terms⁡of⁡inertia.

max_iter⁡:⁡int,⁡default:⁡300

Maximum⁡number⁡of⁡iterations⁡of⁡the⁡k-means⁡algorithm⁡for⁡a⁡single⁡run.

tol⁡:⁡float,⁡default:⁡1e-4

Relative⁡tolerance⁡with⁡regards⁡to⁡inertia⁡to⁡declare⁡convergence

precompute_distances:⁡{‘auto’,⁡True,⁡False}

Precompute⁡distances⁡(faster⁡but⁡takes⁡more⁡memory).
‘auto’⁡:⁡do⁡not⁡precompute⁡distances⁡if⁡n_samples⁡*⁡n_clusters>⁡12⁡million.⁡
This⁡corresponds⁡to⁡about⁡100MB⁡overhead⁡per⁡job⁡using⁡double⁡precision.

True⁡:⁡always⁡precompute⁡distances

False⁡:⁡never⁡precompute⁡distances

verbose⁡:⁡int,⁡default⁡0

Verbosity⁡mode.

random_state⁡:⁡int,⁡RandomState⁡instance⁡or⁡None,⁡optional,⁡default:⁡None

If⁡int,⁡random_state⁡is⁡the⁡seed⁡used⁡by⁡the⁡random⁡number⁡generator;⁡If⁡
RandomState⁡instance,⁡random_state⁡is⁡the⁡random⁡number⁡generator;⁡If⁡
None,⁡the⁡random⁡number⁡generator⁡is⁡the⁡RandomState⁡instance⁡used⁡by⁡
np.random.

copy_x⁡:boolean,⁡default⁡True

When⁡pre-computing⁡distances⁡it⁡is⁡more⁡numerically⁡accurate⁡to⁡center⁡the
data⁡first.⁡If⁡copy_x⁡is⁡True,⁡then⁡the⁡original⁡data⁡is⁡not⁡modified.⁡If⁡False,⁡
the⁡original⁡data⁡is⁡modified,⁡and⁡put⁡back⁡before⁡the⁡function⁡returns,⁡but⁡
small⁡numerical⁡differences⁡may⁡be⁡introduced⁡by⁡subtracting⁡and⁡then⁡
adding⁡the⁡data⁡mean.

n_jobs⁡:⁡int

The⁡number⁡of⁡jobs⁡to⁡use⁡for⁡the⁡computation.⁡This⁡works⁡by⁡computing⁡
each⁡of⁡the⁡n_init⁡runs⁡in⁡parallel.
If⁡-1⁡all⁡CPUs⁡are⁡used.⁡If⁡1⁡is⁡given,⁡no⁡parallel⁡computing⁡code⁡is⁡used⁡at⁡all,⁡
which⁡is⁡useful⁡for⁡debugging.⁡For⁡n_jobs⁡below⁡-1,⁡(n_cpus⁡+⁡1⁡+⁡n_jobs)⁡
are⁡used.⁡Thus⁡for⁡n_jobs⁡=⁡-2,⁡all⁡CPUs⁡but⁡one⁡are⁡used.

K-means⁡algorithm⁡to⁡use.⁡The⁡classical⁡EM-style⁡algorithm⁡is⁡“full”.⁡The⁡
“elkan”⁡variation⁡is⁡more⁡efficient⁡by⁡using⁡the⁡triangle⁡inequality,⁡but⁡
currently⁡doesn’t⁡support⁡sparse⁡data.⁡“auto”⁡chooses⁡“elkan”⁡for⁡dense⁡
data⁡and⁡“full”⁡for⁡sparse⁡data.

⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡CODE
import⁡sys
import⁡numpy
import⁡pandas
import⁡matplotlib
import⁡seaborn
import⁡scipy

print('Python:⁡{}'.format(sys.version))
print('Numpy:⁡{}'.format(numpy.__version__))
print('Pandas:⁡{}'.format(pandas.__version__))
print('Matplotlib:⁡{}'.format(matplotlib.__version__))
print('Seaborn:⁡{}'.format(seaborn.__version__))
print('Scipy:⁡{}'.format(scipy.__version__))

#import the necessary package

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

# Load the dataset from the csv file using pandas0

data = pd.read_csv('creditcard.csv')

#start exploring the dataset

print(data.columns)

# Print the shape of the data

data = data.sample(frac=0.1, random_state = 1)

print(data.shape)

print(data.describe())
# V1 - V28 are the results of a PCA Dimensionality reduction to
protect user identities and sensitive features

# Determine number of fraud cases in dataset

Fraud = data[data['Class'] == 1]

Valid = data[data['Class'] == 0]

outlier_fraction = len(Fraud)/float(len(Valid))

print(outlier_fraction)

print('Fraud Cases: {}'.format(len(data[data['Class'] == 1])))

print('Valid Transactions: {}'.format(len(data[data['Class'] == 0])))

# Correlation matrix

corrmat = data.corr()

fig = plt.figure(figsize = (12, 9))

sns.heatmap(corrmat, vmax = .8, square = True)


plt.show()

# Get all the columns from the dataFrame

columns = data.columns.tolist()

# Filter the columns to remove data we do not want

columns = [c for c in columns if c not in ["Class"]]

# Store the variable we'll be predicting on

target = "Class"

X = data[columns]

Y = data[target]

# Print shapes

print(X.shape)

print(Y.shape)

from sklearn.metrics import classification_report, accuracy_score

from sklearn.ensemble import IsolationForest


from sklearn.neighbors import LocalOutlierFactor

from numpy.core.umath_tests import inner1d

# define random states

state = 1

# define outlier detection tools to be compared

classifiers = {

"Isolation Forest": IsolationForest(max_samples=len(X),

contamination=outlier_fraction,

random_state=state),

"Local Outlier Factor": LocalOutlierFactor(

n_neighbors=20,

contamination=outlier_fraction)}

# Fit the model

plt.figure(figsize=(9, 7))

n_outliers = len(Fraud)
for i, (clf_name, clf) in enumerate(classifiers.items()):

# fit the data and tag outliers

if clf_name == "Local Outlier Factor":

y_pred = clf.fit_predict(X)

scores_pred = clf.negative_outlier_factor_

else:

clf.fit(X)

scores_pred = clf.decision_function(X)

y_pred = clf.predict(X)

# Reshape the prediction values to 0 for valid, 1 for fraud.

y_pred[y_pred == 1] = 0

y_pred[y_pred == -1] = 1

n_errors = (y_pred != Y).sum()

# Run classification metrics

print('{}: {}'.format(clf_name, n_errors))

print(accuracy_score(Y, y_pred))
print(classification_report(Y, y_pred))

CONCLUSION AND FUTURE WORK

Fraud detection is a complex issue that requires a substantial


amount of planning before throwing machine learning algorithms
at it. Nonetheless, it is also an application of data science and
machine learning for the good, which makes sure that the
customer’s money is safe and not easily tampered with.

Future work will include a comprehensive tuning of the Random


Forest algorithm I talked about earlier. Having a data set with non-
anonymized features would make this particularly interesting as
outputting the feature importance would enable one to see what
specific factors are most important for detecting fraudulent
transactions.

As always, if you have any questions or found mistakes, please do


not hesitate to reach out to me. A link to the notebook with my
code is provided at the beginning of this article.

You might also like