Project

HMR INSTITUTE OF TECHNOLOGY & MANAGEMENT
HAMIDPUR , DELHI - 110036
Affiliated to
GURU GOBIND SINGH INDRAPRASTHA UNIVERSITY
Sector - 16C Dwarka, Delhi - 110075, India 2008-12
Summer Training Project Report

CREDIT CARD FRAUD DETECTION
Submitted in partial fulfillment of the requirements for
The award of the degree of
Bachelor of Technology
In
(ELECTRONICS & COMMUNICATION ENGINEERING)
Guide(s): Submitted by:
Mr. KULDEEP VIVEK KUMAR
ECE-5B
Roll No:48213302817
DECLARATION
We, students of(ECE 5thsem) hereby declare that the major

project entitled “CREDIT CARD FRAUD DETECTION” which is
submitted to Department of ECE, HMR Institute of Technology &
Management, Hamidpur Delhi, affiliated to Guru Gobind Singh
Indraprastha University, Dwarka(New Delhi) in partial fulfilment of
requirement for the award of the degree of Bachelor of
Technology in ECE has not been previously formed the basis for
the award of any degree, diploma or other similar title or
recognition.
Submitted by:- Vivek kumar
This is to certify that the above statement made by the

candidate(s) is correct to the best of my knowledge.
New Delhi
Mr.KULDEEP
Date:
Signature of Guides
Mr.AVDESH KUMAR SHARMA
(Head of Department)
HMRITM Hamidpur, New Delhi-110036
ABSTRACT
In this project, we were asked to experiment with a real world dataset, and
to explore how
machine learning algorithms can be used to find the patterns in data. We

were expected to gain
experience using a common data-mining and machine learning library,

Weka, and were expected
to submit a report about the dataset and the algorithms used. After
performing the required tasks
on a dataset of my choice, herein lies my final report
ACKNOWLEDGEMENT
The success and final outcome of this project required a lots of guidance
and
assistance from many people and we are extremely privileged to have got
this
all along the completion of my project. All that we have done is only due to
such supervision and assistance and we would not forget to thank them.
We respect and thank Prof. Jigyasa Chadha for providing us an opportunity

to
do the project and giving us all support and guidance which made us
complete
the project duly. We are extremely thankful to her for providing such a nice
support and guidance, although she had busy schedule managing all the
things
except this project.
I am thankful and fortunate enough to get constant encouragement,

support and
guidance from all Teaching staffs of Electronics & Communication which
helped us in successfully completing our project work. Also, I would like to
extend our sincere esteems to all staff in laboratory for their timely support.
At last, We are indebt of our friends & family members who have provided
us
constant support and motivation through out time to complete our project.
ABOUT MACHINE LEARNING

Machine learning (ML) is the scientific study of algorithms and statistical
models that computer systems use to perform a specific task without using
explicit instructions, relying on patterns and inference instead. It is seen as a
subset of artificial intelligence. Machine learning algorithms build a
mathematical model based on sample data, known as "training data", in
order to make predictions or decisions without being explicitly programmed
to perform the task. Machine learning algorithms are used in a wide variety
of applications, such as email filtering and computer vision, where it is
difficult or infeasible to develop a conventional algorithm for effectively
performing the task.
Machine learning is closely related to computational statistics, which

focuses on making predictions using computers. The study of mathematical
optimization delivers methods, theory and application domains to the field
of machine learning. Data mining is a field of study within machine learning,
and focuses on exploratory data analysis through unsupervised learning. In
its application across business problems, machine learning is also referred
to as predictive analytics.
Problem Statement:
The Credit Card Fraud Detection Problem includes modeling past credit card
transactions with the knowledge of the ones that turned out to be fraud.
This model is then used to identify whether a new transaction is fraudulent
or not. Our aim here is to detect 100% of the fraudulent transactions while
minimizing the incorrect fraud classifications.
Libraries Used :
NUMPY
NumPy is a very popular python library for large multi-dimensional array
and matrix processing, with the help of a large collection of high-level
mathematical functions. It is very useful for fundamental scientific
computations in Machine Learning. It is particularly useful for linear
algebra, Fourier transform, and random number capabilities. High-end
libraries like TensorFlow uses NumPy internally for manipulation of Tensors.
Sci-kit learn
Skikit-learn is one of the most popular ML libraries for classical ML
algorithms. It is built on top of two basic Python libraries, viz., NumPy and
SciPy. Scikit-learn supports most of the supervised and unsupervised
learning algorithms. Scikit-learn can also be used for data-mining and data-
analysis, which makes it a great tool who is starting out with ML.
PANDAS
Pandas is a popular Python library for data analysis. It is not directly related
to Machine Learning. As we know that the dataset must be prepared before
training. In this case, Pandas comes handy as it was developed specifically
for data extraction and preparation. It provides high-level data structures
and wide variety tools for data analysis. It provides many inbuilt methods
for groping, combining and filtering data
MATPLOT LIB
Matpoltlib is a very popular Python library for data visualization. Like
Pandas, it is not directly related to Machine Learning. It particularly comes
in handy when a programmer wants to visualize the patterns in the data. It
is a 2D plotting library used for creating 2D graphs and plots. A module
named pyplot makes it easy for programmers for plotting as it provides
features to control line styles, font properties, formatting axes, etc. It
provides various kinds of graphs and plots for data visualization, viz.,
histogram, error charts, bar chats, etc.
ALGORITHM USED
Clustering
Clustering is a type of Unsupervised learning. This is very often used when
you don’t have labeled data.K-means Clustering is one of the popular
clustering algorithm. The goal of this algorithm is to find groups(clusters) in
the given data. In this post we will implement K-Means algorithm using
Python from scratch.
K-Means Clustering
K-Means is a very simple algorithm which clusters the data into K number of
clusters. The following image from PyPR is an example of K-Means
Clustering.
Use Cases
K-Means is widely used for many applications.
Image Segmentation
Clustering Gene Segementation Data
News Article Clustering
Clustering Languages
Species Clustering
Anomaly Detection
Algorithm
Our algorithm works as follows, assuming we have inputs x1,x2,x3,

…,xnx1,x2,x3,…,xn and value of K
Step 1 - Pick K random points as cluster centers called centroids.
Step 2 - Assign each xixi to nearest cluster by calculating its distance to each
centroid.
Step 3 - Find new cluster center by taking the average of the assigned
points.
Step 4 - Repeat Step 2 and 3 until none of the cluster assignments change.
The above animation is an example of running K-Means Clustering on a two

dimensional data.
Step 1
We randomly pick K cluster centers(centroids). Let’s assume these are c1,c2,

…,ckc1,c2,…,ck, and we can say that;
C=c1,c2,…,ckC=c1,c2,…,ck
CC is the set of all centroids.
Step 2
In this step we assign each input value to closest center. This is done by
calculating Euclidean(L2) distance between the point and the each centroid.
argminci∈Cdist(ci,x)2arg⁡minci∈Cdist(ci,x)2
Where dist(.)dist(.) is the Euclidean distance.
Step 3
In this step, we find the new centroid by taking the average of all the points
assigned to that cluster.
ci=1|Si|∑xi∈Sixici=1|Si|∑xi∈Sixi
SiSi is the set of all points assigned to the iith cluster.

Step 4
In this step, we repeat step 2 and 3 until none of the cluster assignments
change. That means until our clusters remain stable, we repeat the
algorithm.
Choosing the Value of K
We often know the value of K. In that case we use the value of K. Else we
use the Elbow Method.
We run the algorithm for different values of K(say K = 10 to 1) and plot the K
values against SSE(Sum of Squared Errors). And select the value of K for the
elbow point
FUNCTION USED
⁡sklearn.cluster.KMeans
class sklearn.cluster.KMeans(n_clusters=8, init=’k-means++’, n_init=10, m

ax_iter=300, tol=0.0001, precompute_distances=’auto’, verbose=0, rando
m_state=None, copy_x=True, n_jobs=1, algorithm=’auto’)
⁡Parameters:
n_clusters:⁡int,⁡optional,⁡default:⁡8
The⁡number⁡of⁡clusters⁡to⁡form⁡as⁡well⁡as⁡the⁡number⁡of⁡centroids⁡to⁡
generate.
init⁡:⁡{‘k-means++’,⁡‘random’⁡or⁡an⁡ndarray}Method⁡for⁡initialization,⁡
defaults⁡to⁡‘k-means++’:
‘k-means++’⁡:⁡selects⁡initial⁡cluster⁡centers⁡for⁡k-mean⁡clustering⁡in⁡a⁡smart
way⁡to⁡speed⁡up⁡convergence.⁡See⁡section⁡Notes⁡in⁡k_init⁡for⁡more⁡details.
‘random’:⁡choose⁡k⁡observations⁡(rows)⁡at⁡random⁡from⁡data⁡for⁡the⁡initial⁡
centroids.
If⁡anndarray⁡is⁡passed,⁡it⁡should⁡be⁡of⁡shape⁡(n_clusters,⁡n_features)⁡and⁡
gives⁡the⁡initial⁡centers.
n_init⁡:⁡int,⁡default:⁡10
Number⁡of⁡time⁡the⁡k-means⁡algorithm⁡will⁡be⁡run⁡with⁡different⁡centroid⁡
seeds.⁡The⁡final⁡results⁡will⁡be⁡the⁡best⁡output⁡of⁡n_init⁡consecutive⁡runs⁡in⁡
terms⁡of⁡inertia.
max_iter⁡:⁡int,⁡default:⁡300
Maximum⁡number⁡of⁡iterations⁡of⁡the⁡k-means⁡algorithm⁡for⁡a⁡single⁡run.
tol⁡:⁡float,⁡default:⁡1e-4
Relative⁡tolerance⁡with⁡regards⁡to⁡inertia⁡to⁡declare⁡convergence
precompute_distances:⁡{‘auto’,⁡True,⁡False}
Precompute⁡distances⁡(faster⁡but⁡takes⁡more⁡memory).
‘auto’⁡:⁡do⁡not⁡precompute⁡distances⁡if⁡n_samples⁡*⁡n_clusters>⁡12⁡million.⁡
This⁡corresponds⁡to⁡about⁡100MB⁡overhead⁡per⁡job⁡using⁡double⁡precision.
True⁡:⁡always⁡precompute⁡distances
False⁡:⁡never⁡precompute⁡distances
verbose⁡:⁡int,⁡default⁡0
Verbosity⁡mode.
random_state⁡:⁡int,⁡RandomState⁡instance⁡or⁡None,⁡optional,⁡default:⁡None
If⁡int,⁡random_state⁡is⁡the⁡seed⁡used⁡by⁡the⁡random⁡number⁡generator;⁡If⁡
RandomState⁡instance,⁡random_state⁡is⁡the⁡random⁡number⁡generator;⁡If⁡
None,⁡the⁡random⁡number⁡generator⁡is⁡the⁡RandomState⁡instance⁡used⁡by⁡
np.random.
copy_x⁡:boolean,⁡default⁡True
When⁡pre-computing⁡distances⁡it⁡is⁡more⁡numerically⁡accurate⁡to⁡center⁡the
data⁡first.⁡If⁡copy_x⁡is⁡True,⁡then⁡the⁡original⁡data⁡is⁡not⁡modified.⁡If⁡False,⁡
the⁡original⁡data⁡is⁡modified,⁡and⁡put⁡back⁡before⁡the⁡function⁡returns,⁡but⁡
small⁡numerical⁡differences⁡may⁡be⁡introduced⁡by⁡subtracting⁡and⁡then⁡
adding⁡the⁡data⁡mean.
n_jobs⁡:⁡int
The⁡number⁡of⁡jobs⁡to⁡use⁡for⁡the⁡computation.⁡This⁡works⁡by⁡computing⁡
each⁡of⁡the⁡n_init⁡runs⁡in⁡parallel.
If⁡-1⁡all⁡CPUs⁡are⁡used.⁡If⁡1⁡is⁡given,⁡no⁡parallel⁡computing⁡code⁡is⁡used⁡at⁡all,⁡
which⁡is⁡useful⁡for⁡debugging.⁡For⁡n_jobs⁡below⁡-1,⁡(n_cpus⁡+⁡1⁡+⁡n_jobs)⁡
are⁡used.⁡Thus⁡for⁡n_jobs⁡=⁡-2,⁡all⁡CPUs⁡but⁡one⁡are⁡used.
K-means⁡algorithm⁡to⁡use.⁡The⁡classical⁡EM-style⁡algorithm⁡is⁡“full”.⁡The⁡
“elkan”⁡variation⁡is⁡more⁡efficient⁡by⁡using⁡the⁡triangle⁡inequality,⁡but⁡
currently⁡doesn’t⁡support⁡sparse⁡data.⁡“auto”⁡chooses⁡“elkan”⁡for⁡dense⁡
data⁡and⁡“full”⁡for⁡sparse⁡data.
⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡CODE
import⁡sys
import⁡numpy
import⁡pandas
import⁡matplotlib
import⁡seaborn
import⁡scipy
print('Python:⁡{}'.format(sys.version))
print('Numpy:⁡{}'.format(numpy.__version__))
print('Pandas:⁡{}'.format(pandas.__version__))
print('Matplotlib:⁡{}'.format(matplotlib.__version__))
print('Seaborn:⁡{}'.format(seaborn.__version__))
print('Scipy:⁡{}'.format(scipy.__version__))
#import the necessary package
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Load the dataset from the csv file using pandas0
data = pd.read_csv('creditcard.csv')
#start exploring the dataset
print(data.columns)
# Print the shape of the data
data = data.sample(frac=0.1, random_state = 1)
print(data.shape)
print(data.describe())
# V1 - V28 are the results of a PCA Dimensionality reduction to
protect user identities and sensitive features
# Determine number of fraud cases in dataset
Fraud = data[data['Class'] == 1]
Valid = data[data['Class'] == 0]
outlier_fraction = len(Fraud)/float(len(Valid))
print(outlier_fraction)
print('Fraud Cases: {}'.format(len(data[data['Class'] == 1])))
print('Valid Transactions: {}'.format(len(data[data['Class'] == 0])))
# Correlation matrix
corrmat = data.corr()
fig = plt.figure(figsize = (12, 9))
sns.heatmap(corrmat, vmax = .8, square = True)

plt.show()
# Get all the columns from the dataFrame
columns = data.columns.tolist()
# Filter the columns to remove data we do not want
columns = [c for c in columns if c not in ["Class"]]
# Store the variable we'll be predicting on
target = "Class"
X = data[columns]
Y = data[target]
# Print shapes
print(X.shape)
print(Y.shape)
from sklearn.metrics import classification_report, accuracy_score
from sklearn.ensemble import IsolationForest

from sklearn.neighbors import LocalOutlierFactor
from numpy.core.umath_tests import inner1d
# define random states
state = 1
# define outlier detection tools to be compared
classifiers = {
"Isolation Forest": IsolationForest(max_samples=len(X),
contamination=outlier_fraction,
random_state=state),
"Local Outlier Factor": LocalOutlierFactor(
n_neighbors=20,
contamination=outlier_fraction)}
# Fit the model
plt.figure(figsize=(9, 7))
n_outliers = len(Fraud)
for i, (clf_name, clf) in enumerate(classifiers.items()):
# fit the data and tag outliers
if clf_name == "Local Outlier Factor":
y_pred = clf.fit_predict(X)
scores_pred = clf.negative_outlier_factor_
else:
clf.fit(X)
scores_pred = clf.decision_function(X)
y_pred = clf.predict(X)
# Reshape the prediction values to 0 for valid, 1 for fraud.
y_pred[y_pred == 1] = 0
y_pred[y_pred == -1] = 1
n_errors = (y_pred != Y).sum()
# Run classification metrics
print('{}: {}'.format(clf_name, n_errors))
print(accuracy_score(Y, y_pred))
print(classification_report(Y, y_pred))
CONCLUSION AND FUTURE WORK
Fraud detection is a complex issue that requires a substantial

amount of planning before throwing machine learning algorithms
at it. Nonetheless, it is also an application of data science and
machine learning for the good, which makes sure that the
customer’s money is safe and not easily tampered with.
Future work will include a comprehensive tuning of the Random

Forest algorithm I talked about earlier. Having a data set with non-
anonymized features would make this particularly interesting as
outputting the feature importance would enable one to see what
specific factors are most important for detecting fraudulent
transactions.
As always, if you have any questions or found mistakes, please do

not hesitate to reach out to me. A link to the notebook with my
code is provided at the beginning of this article.

Project

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Project

Uploaded by

Copyright:

Available Formats

HMR INSTITUTE OF TECHNOLOGY & MANAGEMENT

HAMIDPUR , DELHI - 110036

GURU GOBIND SINGH INDRAPRASTHA UNIVERSITY

Sector - 16C Dwarka, Delhi - 110075, India 2008-12

Summer Training Project Report

Submitted in partial fulfillment of the requirements for

The award of the degree of

(ELECTRONICS & COMMUNICATION ENGINEERING)

Guide(s): Submitted by:

Mr. KULDEEP VIVEK KUMAR

We, students of(ECE 5thsem) hereby declare that the major

Submitted by:- Vivek kumar

This is to certify that the above statement made by the

HMRITM Hamidpur, New Delhi-110036

machine learning algorithms can be used to find the patterns in data. We

experience using a common data-mining and machine learning library,

on a dataset of my choice, herein lies my final report

We respect and thank Prof. Jigyasa Chadha for providing us an opportunity

except this project.

I am thankful and fortunate enough to get constant encouragement,

guidance from all Teaching staffs of Electronics & Communication which

helped us in successfully completing our project work. Also, I would like to

ABOUT MACHINE LEARNING

Machine learning is closely related to computational statistics, which

K-Means is widely used for many applications.

Clustering Gene Segementation Data

News Article Clustering

Our algorithm works as follows, assuming we have inputs x1,x2,x3,

The above animation is an example of running K-Means Clustering on a two

We randomly pick K cluster centers(centroids). Let’s assume these are c1,c2,

CC is the set of all centroids.

Where dist(.)dist(.) is the Euclidean distance.

SiSi is the set of all points assigned to the iith cluster.

Choosing the Value of K

class sklearn.cluster.KMeans(n_clusters=8, init=’k-means++’, n_init=10, m

#import the necessary package

import matplotlib.pyplot as plt

import seaborn as sns

# Load the dataset from the csv file using pandas0

#start exploring the dataset

# Print the shape of the data

data = data.sample(frac=0.1, random_state = 1)

# Determine number of fraud cases in dataset

print('Fraud Cases: {}'.format(len(data[data['Class'] == 1])))

print('Valid Transactions: {}'.format(len(data[data['Class'] == 0])))

fig = plt.figure(figsize = (12, 9))

sns.heatmap(corrmat, vmax = .8, square = True)

# Get all the columns from the dataFrame

# Filter the columns to remove data we do not want

columns = [c for c in columns if c not in ["Class"]]

# Store the variable we'll be predicting on

from sklearn.metrics import classification_report, accuracy_score

from sklearn.ensemble import IsolationForest

from numpy.core.umath_tests import inner1d

# define random states

# define outlier detection tools to be compared

"Isolation Forest": IsolationForest(max_samples=len(X),

"Local Outlier Factor": LocalOutlierFactor(

# Fit the model

# fit the data and tag outliers

if clf_name == "Local Outlier Factor":

# Reshape the prediction values to 0 for valid, 1 for fraud.

n_errors = (y_pred != Y).sum()