Professional Documents
Culture Documents
Affiliated to
Bachelor of Technology
In
ECE-5B
Roll No:48213302817
DECLARATION
New Delhi
Mr.KULDEEP
Date:
Signature of Guides
Mr.AVDESH KUMAR SHARMA
(Head of Department)
ABSTRACT
In this project, we were asked to experiment with a real world dataset, and
to explore how
to submit a report about the dataset and the algorithms used. After
performing the required tasks
ACKNOWLEDGEMENT
The success and final outcome of this project required a lots of guidance
and
assistance from many people and we are extremely privileged to have got
this
all along the completion of my project. All that we have done is only due to
such supervision and assistance and we would not forget to thank them.
do the project and giving us all support and guidance which made us
complete
the project duly. We are extremely thankful to her for providing such a nice
support and guidance, although she had busy schedule managing all the
things
extend our sincere esteems to all staff in laboratory for their timely support.
At last, We are indebt of our friends & family members who have provided
us
constant support and motivation through out time to complete our project.
Problem Statement:
The Credit Card Fraud Detection Problem includes modeling past credit card
transactions with the knowledge of the ones that turned out to be fraud.
This model is then used to identify whether a new transaction is fraudulent
or not. Our aim here is to detect 100% of the fraudulent transactions while
minimizing the incorrect fraud classifications.
Libraries Used :
NUMPY
NumPy is a very popular python library for large multi-dimensional array
and matrix processing, with the help of a large collection of high-level
mathematical functions. It is very useful for fundamental scientific
computations in Machine Learning. It is particularly useful for linear
algebra, Fourier transform, and random number capabilities. High-end
libraries like TensorFlow uses NumPy internally for manipulation of Tensors.
Sci-kit learn
Skikit-learn is one of the most popular ML libraries for classical ML
algorithms. It is built on top of two basic Python libraries, viz., NumPy and
SciPy. Scikit-learn supports most of the supervised and unsupervised
learning algorithms. Scikit-learn can also be used for data-mining and data-
analysis, which makes it a great tool who is starting out with ML.
PANDAS
Pandas is a popular Python library for data analysis. It is not directly related
to Machine Learning. As we know that the dataset must be prepared before
training. In this case, Pandas comes handy as it was developed specifically
for data extraction and preparation. It provides high-level data structures
and wide variety tools for data analysis. It provides many inbuilt methods
for groping, combining and filtering data
MATPLOT LIB
Matpoltlib is a very popular Python library for data visualization. Like
Pandas, it is not directly related to Machine Learning. It particularly comes
in handy when a programmer wants to visualize the patterns in the data. It
is a 2D plotting library used for creating 2D graphs and plots. A module
named pyplot makes it easy for programmers for plotting as it provides
features to control line styles, font properties, formatting axes, etc. It
provides various kinds of graphs and plots for data visualization, viz.,
histogram, error charts, bar chats, etc.
ALGORITHM USED
Clustering
Clustering is a type of Unsupervised learning. This is very often used when
you don’t have labeled data.K-means Clustering is one of the popular
clustering algorithm. The goal of this algorithm is to find groups(clusters) in
the given data. In this post we will implement K-Means algorithm using
Python from scratch.
K-Means Clustering
K-Means is a very simple algorithm which clusters the data into K number of
clusters. The following image from PyPR is an example of K-Means
Clustering.
Use Cases
Image Segmentation
Clustering Languages
Species Clustering
Anomaly Detection
Algorithm
Step 2 - Assign each xixi to nearest cluster by calculating its distance to each
centroid.
Step 3 - Find new cluster center by taking the average of the assigned
points.
Step 4 - Repeat Step 2 and 3 until none of the cluster assignments change.
Step 1
C=c1,c2,…,ckC=c1,c2,…,ck
Step 2
In this step we assign each input value to closest center. This is done by
calculating Euclidean(L2) distance between the point and the each centroid.
argminci∈Cdist(ci,x)2argminci∈Cdist(ci,x)2
Step 3
In this step, we find the new centroid by taking the average of all the points
assigned to that cluster.
ci=1|Si|∑xi∈Sixici=1|Si|∑xi∈Sixi
In this step, we repeat step 2 and 3 until none of the cluster assignments
change. That means until our clusters remain stable, we repeat the
algorithm.
We often know the value of K. In that case we use the value of K. Else we
use the Elbow Method.
We run the algorithm for different values of K(say K = 10 to 1) and plot the K
values against SSE(Sum of Squared Errors). And select the value of K for the
elbow point
FUNCTION USED
sklearn.cluster.KMeans
Parameters:
n_clusters:int,optional,default:8
Thenumberofclusterstoformaswellasthenumberofcentroidsto
generate.
init:{‘k-means++’,‘random’oranndarray}Methodforinitialization,
defaultsto‘k-means++’:
‘k-means++’:selectsinitialclustercentersfork-meanclusteringinasmart
waytospeedupconvergence.SeesectionNotesink_initformoredetails.
‘random’:choosekobservations(rows)atrandomfromdatafortheinitial
centroids.
Ifanndarrayispassed,itshouldbeofshape(n_clusters,n_features)and
givestheinitialcenters.
n_init:int,default:10
Numberoftimethek-meansalgorithmwillberunwithdifferentcentroid
seeds.Thefinalresultswillbethebestoutputofn_initconsecutiverunsin
termsofinertia.
max_iter:int,default:300
Maximumnumberofiterationsofthek-meansalgorithmforasinglerun.
tol:float,default:1e-4
Relativetolerancewithregardstoinertiatodeclareconvergence
precompute_distances:{‘auto’,True,False}
Precomputedistances(fasterbuttakesmorememory).
‘auto’:donotprecomputedistancesifn_samples*n_clusters>12million.
Thiscorrespondstoabout100MBoverheadperjobusingdoubleprecision.
True:alwaysprecomputedistances
False:neverprecomputedistances
verbose:int,default0
Verbositymode.
random_state:int,RandomStateinstanceorNone,optional,default:None
Ifint,random_stateistheseedusedbytherandomnumbergenerator;If
RandomStateinstance,random_stateistherandomnumbergenerator;If
None,therandomnumbergeneratoristheRandomStateinstanceusedby
np.random.
copy_x:boolean,defaultTrue
Whenpre-computingdistancesitismorenumericallyaccuratetocenterthe
datafirst.Ifcopy_xisTrue,thentheoriginaldataisnotmodified.IfFalse,
theoriginaldataismodified,andputbackbeforethefunctionreturns,but
smallnumericaldifferencesmaybeintroducedbysubtractingandthen
addingthedatamean.
n_jobs:int
Thenumberofjobstouseforthecomputation.Thisworksbycomputing
eachofthen_initrunsinparallel.
If-1allCPUsareused.If1isgiven,noparallelcomputingcodeisusedatall,
whichisusefulfordebugging.Forn_jobsbelow-1,(n_cpus+1+n_jobs)
areused.Thusforn_jobs=-2,allCPUsbutoneareused.
K-meansalgorithmtouse.TheclassicalEM-stylealgorithmis“full”.The
“elkan”variationismoreefficientbyusingthetriangleinequality,but
currentlydoesn’tsupportsparsedata.“auto”chooses“elkan”fordense
dataand“full”forsparsedata.
CODE
importsys
importnumpy
importpandas
importmatplotlib
importseaborn
importscipy
print('Python:{}'.format(sys.version))
print('Numpy:{}'.format(numpy.__version__))
print('Pandas:{}'.format(pandas.__version__))
print('Matplotlib:{}'.format(matplotlib.__version__))
print('Seaborn:{}'.format(seaborn.__version__))
print('Scipy:{}'.format(scipy.__version__))
import pandas as pd
import numpy as np
data = pd.read_csv('creditcard.csv')
print(data.columns)
print(data.shape)
print(data.describe())
# V1 - V28 are the results of a PCA Dimensionality reduction to
protect user identities and sensitive features
Fraud = data[data['Class'] == 1]
Valid = data[data['Class'] == 0]
outlier_fraction = len(Fraud)/float(len(Valid))
print(outlier_fraction)
# Correlation matrix
corrmat = data.corr()
columns = data.columns.tolist()
target = "Class"
X = data[columns]
Y = data[target]
# Print shapes
print(X.shape)
print(Y.shape)
state = 1
classifiers = {
contamination=outlier_fraction,
random_state=state),
n_neighbors=20,
contamination=outlier_fraction)}
plt.figure(figsize=(9, 7))
n_outliers = len(Fraud)
for i, (clf_name, clf) in enumerate(classifiers.items()):
y_pred = clf.fit_predict(X)
scores_pred = clf.negative_outlier_factor_
else:
clf.fit(X)
scores_pred = clf.decision_function(X)
y_pred = clf.predict(X)
y_pred[y_pred == 1] = 0
y_pred[y_pred == -1] = 1
print(accuracy_score(Y, y_pred))
print(classification_report(Y, y_pred))