You are on page 1of 29

CREDIT CARD FRAUD DETECTION

A report of summer internship (2019-20

BACHELOR OF TECHNOLOGY

IN

COMPUTER SCIENCE AND INFORMATION TECHNOLOGY

Submitted by

KRITIK BANSAL

1802911018

Supervised by Prof. NAVPREET KAUR

DEPARTMENT OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY

KIET GROUP OF INSTITUTIONS, GHAZIABAD, UTTAR


PARDESH

(Affiliated to Dr. A.P.J. Abdul Kalam Technical University,

Lucknow, UP, India) Session 2019-20


Content-

PART-A
1. Introduction
2. Project Overview
3. Implementation (with snap shots)
4. Conclusion
5. Daily Log

PART- B

1. Summary of Paper-1
2. Summary of Paper-2
3. Summary of Paper-3
4. Summary of Paper-4
5. Summary of Paper-5

PART –C

1. Certificate of MOOC-1
2. Certificate of MOOC-2
CERTIFICATE
PART - A
Introduction
'Fraud' in credit card transactions is unauthorized and unwanted usage of an account by someone
other than the owner of that account. Necessary prevention measures can be taken to stop this abuse
and the behavior of such fraudulent practices can be studied to minimize it and protect against
similar occurrences in the future. In other words, Credit Card Fraud can be defined as a case where a
person uses someone else’s credit card for personal reasons while the owner and the card issuing
authorities are unaware of the fact that the card is being used. Fraud detection involves monitoring
the activities of populations of users in order to estimate, perceive or avoid objectionable behavior,
which consist of fraud, intrusion, and defaulting. This is a very relevant problem that demands the
attention of communities such as machine learning and data science where the solution to this
problem can be automated. This problem is particularly challenging from the perspective of learning,
as it is characterized by various factors such as class imbalance. The number of valid transactions far
outnumber fraudulent ones. Also, the transaction patterns often change their statistical properties
over the course of time.

These are not the only challenges in the implementation of a real-world fraud detection system,
however. In real world examples, the massive stream of payment requests is quickly scanned by
automatic tools that determine which transactions to authorize. Machine learning algorithms are
employed to analyses all the authorized transactions and report the suspicious ones. These reports are
investigated by professionals who contact the cardholders to confirm if the transaction was genuine
or fraudulent. The investigators provide a feedback to the automated system which is used to train
and update the algorithm to eventually improve the fraud-detection performance over time.
Some of the currently used approaches to detection of such fraud are:

• Artificial Neural Network


• Fuzzy Logic
• Genetic Algorithm
• Logistic Regression
• Decision tree
• Support Vector Machines
• Bayesian Networks
• Hidden Markov Model
• K-Nearest Neighbor
PROJECT OVERVIEW
OBJECTIVE-

The objective of this project is to successfully identify the Fraudulent transactions out of the
about 2L transactions from the data set available on Kaggle.

SOFTWARE USED – R

ALGORITHMS USED- DECISION TREE AND SAMPLING

ABOUT THE DATASET-

The datasets contain transactions made by credit cards in September 2013 by European
cardholders. This dataset presents transactions that occurred in two days, where we have 492
frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds)
account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation.
Unfortunately, due to confidentiality issues, we cannot provide the original features and more
background information about the data. Features V1, V2, … V28 are the principal components
obtained with PCA, the only features which have not been transformed with PCA are 'Time' and
'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first
transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be
used for example-dependant cost-sensitive learning. Feature 'Class' is the response variable and it
takes value 1 in case of fraud and 0 otherwise.
IMPLEMENTATION

Reading the dataset from the current working directory

Structure of the dataset

Converting Class integer into factor and printing the summary of the
dataset
Counting the missing values

Distribution and Probability of Fraud and Legit transactions

Pie Chart for credit transactions


No model predictions

Confusion matrix of No Model Predictions


Creating Fraction part of dataset

Creating a scatterplot of fraction of data


Creating Training and Test sets

Random Over Sampling

Random Over Sampling Scatterplot


Random Under Sampling and its Scatterplot
Random Over and Random Under Sampling “Both” with Scatterplot

Balancing the dataset using SMOTE


Class Distribution Scatterplot of Original Data
Class Distribution Scatterplot of Balanced Data

Decision Tree using SMOTE data


Prediction of Fraud Transactions on Test data and its Confusion Matrix for
Accuracy

Decision Tree without SMOTE data


Prediction of Fraud Transactions on Test data and its Confusion Matrix for
Accuracy

HENCE MODEL WITHOUT SMOTE SHOWS MORE ACCURACY


THAN WITH SMOTE Applying the Model on complete dataset and its
Confusion Matrix

FINAL ACCURACY = 99.5%


CONCLUSION
Credit card fraud is without a doubt an act of criminal dishonesty. This project has also
explained in detail, how machine learning can be applied to get better results in fraud detection
along with the algorithm, pseudocode, explanation its implementation and experimentation
results.

The code prints out the number of false positives it detected and compares it with the actual
values. This is used to calculate the accuracy score and precision of the algorithms. The
fraction of data we used for faster testing is 20% of the entire dataset. The complete dataset is
also used at the end and both the results are printed. These results along with the
classification report for each algorithm is given in the output as follows, where class 0 means
the transaction was determined to be valid and 1 means it was determined as a fraud
transaction. This result matched against the class values to check for false positives. Since the
entire dataset consists of only two days’ transaction records, it’s only a fraction of data that
can be made available if this project were to be used on a commercial scale. Being based on
machine learning algorithms, the program will only increase its efficiency over time as more
data is put into it.
DAILY LOG
Name of Student Kritik Bansal

Roll No. 1802911018

Name of Course Data Science in R

Date of Commencement 18th May,2020

Date of Completion 29th July,2020

S. No Leaning of the Day Date


1 Introduction to Data Science 18.05.2020
2 Basics of Data Science, Bigdata and Data Analytics Life 19.05.2020
Cycle
3 Data Types, Variables and Operators 21.05.2020
4 Decision Making 22.05.2020
5 Loops in R 25.05.2020
6 Continued, Loops in R 26.05.2020
7 Functions and Strings 28.05.2020
8 Vectors, Lists and Arrays 29.05.2020
9 Data Visualization and Exploration 01.06.2020
10 Importing External CSV and XML 02.06.2020
11 Handling missing values and detection of outliers 04.06.2020
12 Apply function 05.06.2020
13 WhatsApp Text Analysis 15.06.2020
14 Natural Language Processing (Twitter Sentiment analysis) 16.06.2020
15 Time series Algorithm and decision trees. 18.06.2020
16 Apriori Algorithm (Market Basket Analysis) 20.06.2020
17 Decision Trees and ID3 Algorithm 21.06.2020

S. No Learning of the Day Date


18 K means clustering and feature selection using Random Forest 22.06.2020
19 Logistic Regression 23.06.2020

20 Multiple linear regression with anova test and predictions on this 08.07.2020
dataset.
21 One-way anova, two-way anova, anova with replicates and anova 09.07.2020
without replicates.
22 Extreme Gradient Boosting Algorithm. 10.07.2020

23 Continued, Extreme Gradient Boosting Algorithm. 13.07.2020

24 Naive Bayes Classifier 14.07.2020

25 Density based Cluster Visualization 20.07.2020

26 Support Vector Machines 21.07.2020

27 Basics of RCloud 23.07.2020

28 Goodness of fit and other test using Chi-square test 24.07.2020

29 ARIMA model in time series algo and plotting ACF and PACF 27.07.2020
plots.
30 Capturing the market annualized volatility using Apple stock 28.07.2020
price data
PART-B

Research Paper-1

Title- Pivot Property in Weighted Least Regression Based on Single


Repeated Observations

Summary-

How genuine repeated observations influence the result of linear regression


could be an interesting topic in regression analysis. In this paper, we
discuss the pivot behavior in weighted linear regression based on certain
data pattern and give an explanation about the pivot behavior. Repetition of
identical observations is an important phenomenon in a variety of domains
such as clinical trials, categorical survey, traffic data and how genuine
repeated observations influence the result of linear regression could be an
interesting topic in regression analysis. This kind of data with genuine
repeats could be easily found in categorical data analysis and obtaining
estimators for unknown parameters within the framework of statistics
always comes down to a problem of optimization using the least square
method and that could be the origin of our study in this paper. We give a
plausible explanation for it in this paper. As we can see, the weighting
scheme plays a vital role throughout the whole study procedure. The
dependency between weight scheme and pivot behavior would be our future
research topic, could be a helpful tool for determining corresponding
weights for observations and the efficiency loss for estimators in regression
analysis under repeated observations could be estimated simulated using
Monte Carlo method. Meanwhile, the explanation of pivot behavior could
be applied in different scientific areas to give further explanations of impact
coming from repeated data points according to knowledge of their specific
domain. Meanwhile, whether longitudinal data would inherit this pivot
behavior, this could be another research topic
Research Paper-2

Title- A Mixed-Integer Programming Model to Configure a Post Supply


Chain Network

Summary-

Points of distribution, sales or service are important elements of the supply


chain. These are the final elements which are responsible for proper functioning
of the whole cargo distribution process. Proper location of these points in the
transport network is essential to ensure the effectiveness and reliability of the
supply chain. The location of these points is very important also from the
consumer’s point of view. In this paper, a mathematical model is proposed to
design of a post supply chain network to minimize transportation cost, facilities
location cost and holding cost. The proposed supply chain network consists of
four echelons: supplier, post office, distribution centre, and recipient. The bold
point of this study is as regards the post supply chain is examined, the demand
of the recipient’s point determines in supplier point not in delivery point.
Finally, the proposed model is solved by LINGO 17 software and the results are
analysed.

Example- Post-fire debris flows are natural disasters capable of destroying


structures and endangering human lives. These events are prevalent in certain
geographic regions and are expected to increase in frequency. Motivated by the
literature on operations research in disaster relief operations, we present several
novel formulations for hazard management of post-fire debris flows. The
deterministic model allocates a budget towards various mitigation options
including preventative efforts that reduce the probability of debris flow
initiation and reduction efforts that reduce volume conditional on initiation. The
objective minimizes expected damage to structures while weighting different
storm scenarios according to Poisson process probabilities. A two-stage
multiperiod decision-dependent stochastic programming model is then
developed to address the prevention of loss of life through emergency vehicle
routing. This stochastic program considers different storm scenarios and allows
mitigation actions taken in the first stage to affect second stage parameters and
scenario probabilities. The program routes emergency vehicles to pick up
injured people at damaged residences and then delivers them to a hospital. Case
study results are presented using real data based on Santa Barbara after the 2009
Jesusita wildfire. Case study optimal mitigation measures focus on primarily on
three out of the 17 basins. The deterministic results focus mostly on check dams
given our parameter values, while the stochastic results incorporate
prepositioned emergency vehicles. Smaller budgets have a large marginal
benefit from mitigation. We also generate a larger simulated data set based on
this case study to test the computational tractability of our formulations.

Research Paper -3
Title- Performance of Some Factor Analysis Techniques

Summary-

Thousands of variables have been proposed to explain or describe the complex


variety and interconnections of social and international relations. Perhaps an
equal number of hypotheses and theories linking these variables have been
suggested.

The few basic variables and propositions central to understanding remain to be


determined. The systematic dependencies and correlations among these
variables have been charted only roughly, if at all, and many, if not most, can be
measured only on presence-absence or rank order scales. And to take the data
on any one variable at face value is to beg questions of validity, reliability, and
comparability.

Confronted with entangled behaviour, unknown interdependencies, masses of


qualitative and quantitative variables, and bad data, many social scientists are
turning toward factor analysis to uncover major social and international
patterns. Factor analysis can simultaneously manage over a hundred variables,
compensate for random error and invalidity, and disentangle complex
interrelationships into their major and distinct regularities.

Factor analysis is not without cost, however. It is mathematically complicated


and entails diverse and numerous considerations in application. Its technical
vocabulary includes strange terms such as eigenvalues, rotate, simple structure,
orthogonal, loadings, and communality. Its results usually absorb a dozen or so
pages in a given report, leaving little room for a methodological introduction or
explanation of terms. Add to this the fact that students do not ordinarily learn
factor analysis in their formal training, and the sum is the major cost of factor
analysis: most laymen, social scientists, and policy-makers find the nature and
significance of the results incomprehensible.
The problem of communicating factor analysis is especially crucial for peace
research. Scholars in this field are drawn from many disciplines and
professions, and few of them are acquainted with the method. As our empirical
knowledge of conflict processes, behavior, conditions, and patterns become
increasingly expressed in factor analytic terms, those who need this knowledge
most in order to make informed policy decisions may be those who are most
deterred by the packaging. Indeed, they are unlikely to know that this
knowledge exists.

A conceptual map, therefore, is needed to guide the consumers of findings in


conflict and international relations through the terminological obstacles and
quantitative obstructions presented by factor studies. The aim of this paper is to
help draw such a map. Specifically, the aim is to enhance the understanding and
utilization of the results of factor analysis. Instead of describing how to apply
factor analysis or discussing the mathematical model involved, I shall try to
clarify the technical paraphernalia which may conceal important substantive
data, propositions, or scientific laws.

By way of orientation, the first section of this paper will present a brief
conceptual review of factor analysis. In the second section the scientific context
of the method will be discussed. The major uses of factor analysis will be listed
and its relation to induction and deduction, description and inference, causation
and explanation, and classification and theory will be considered. To aid
understanding, the third section will outline the geometrical and algebraic factor
models, and the fourth section will define the factor matrices and their
elements--the vehicles for presenting factor results. Since comprehending factor
rotation is important for interpreting the findings, the fifth and final section is
devoted to clarifying its significance.

Research Paper -4
Title- Dependent Ranked Set Sampling Designs for Parametric Estimation
with Applications

Summary-

We derive the likelihood function of the neoteric ranked set sampling (NRSS)
as dependent in sampling method and double neoteric ranked set sampling
(DNRSS) designs as combine between independent sampling method in the first
stage and dependent sampling method in the second stage and they compared
for the estimation of the parameters of the inverse Weibull (IW) distribution. An
intensive simulation has been made to compare the one and the two stages
designs. The results showed that likelihood estimation based on ranked set
sampling (RSS) as independent sampling method, NRSS and DNRSS designs
provide more efficient estimators than the usual simple random sampling
design. Moreover, the DNRSS is slightly more efficient than the NRSS and RSS
designs in the case of estimating the IW distribution parameters. Ranked set
sampling (RSS) is an advanced data collection method when the exact
measurement of an observation is difficult and/or expensive used in a number of
research areas, e.g., environment, bioinformatics, ecology, etc. In this method,
random sets are drawn from a population and the units in sets are ranked with a
ranking mechanism which is based on a visual inspection or a concomitant
variable. Because of the importance of working with a good design and easy
analysis, there is a need for a software tool which provides sampling designs
and statistical inferences based on RSS and its modifications.

Research Paper-5

Title- A New Family of Lifetime Distributions: Theory, Application and


Characterizations

Summary-

A new class of distributions with increasing, decreasing, bathtub-shaped and


unimodal hazard rate forms called generalized quadratic hazard rate-power
series distribution is proposed. The new distribution is obtained by
compounding the generalized quadratic hazard rate and power series
distributions. This class of distributions contains several important distributions
appeared in the literature, such as generalized quadratic hazard rate-geometric,
Poisson, -logarithmic, -binomial and -negative binomial distributions as special
cases. We provide comprehensive mathematical properties of the new
distribution. We obtain closed-form expressions for the density function,
cumulative distribution function, survival and hazard rate functions, moments,
mean residual life, mean past lifetime, order statistics and moments of order
statistics; certain characterizations of the proposed distribution are presented as
well. The special distributions are studied in some details. The maximum
likelihood method is used to estimate the unknown parameters. We propose to
use EM algorithm to compute the maximum likelihood estimators of the
unknown parameters. It is observed that the proposed EM algorithm can be
implemented very easily in practice. One data set has been analyzed for
illustrative purposes. It is observed that the proposed model and the EM
algorithm work quite well in practice.
PART-C
Certificate of MOOC – 1
Certificate of MOOC – 2

You might also like