Professional Documents
Culture Documents
Unearthing Fraudulence
by Adopting Data
Analytics
09-04-2017
____________
Abhishek Dangi
UTS Student ID - 12736625
2
Contents
Overview, Goals ……………………………………1
Specification, …………….…………………………2
Milestone, Data Mining scenario………………….3
Data Exploration ……………………………………4
Proposed Methodology…………………………….6
Project Plan……………………………………....…9
Reference………………………………………….10
Overview
“As of 2012, about 2.5 exabytes of data is created each day, and that number is
doubling every 40 months or so.” – Harvard Business Review, Big Data: The
Management Revolution
We have always had “Big Data” now we also have fast processing. The world we live in
continues to shrink by the technology innovations that make it so easy to do so much.
Fraud is a Billion-Dollar business and it's increasing exponentially every year and it
affects almost every financial industry now days. Since big Banks like yours generates
and contributes towards various financial, taxation, transaction, credit, investment and
many other databases, this big data should be utilized as in a way to provide smooth
and secure environment to the clients, the business and the financial institutions.
Thus, this project proposal for a data mining project is in line to enable you to better
understand the patterns that originate in detection of a bank fraud. Also since our
economy is moving towards a cashless fashion, the need of better security measures in
this fast pacing online world is a must for everyone the bank, the customers and the
business. As a result, this project focuses mainly on carding (credit card fraud).
Goals
❖ Generating lists of potential fraudulent profiles based on various databases.
❖ Predicting fraud using predictive analysis.
❖ Recognizing patterns for narrowing down onto illegal repeated fraudulent
transactions.
`
❖ Creating an AI solution for the bank that detects and prevents online transaction
fraud in real time using client’s previous activities.
❖ Auditing various databases to reveal potential security flaws in the system and
determining which sectors have been the target most fraudulent activities and
providing solid security measures.
❖ Minimizing the possibilities of False Positives.
Specifications
● Data Mining
○ Collecting and sorting of data from various databases as ATMs and POS
transactions, online transactions, geolocation, credit card services,
mortgages, criminal records etc.
○ Applying Data preprocessing techniques such as validation, error
correction, making up for missing or incorrect data.
○ Creating client profiles using resultant data.
○ Clustering and classification to find patterns and associating data into
groups.
○ Applying algorithms to find abnormalities in the behavior of transactions
from previously gathered models and profiles.
○ Estimation of risks and predicting future transactions.
Milestones
Financial institutions are using data analysis to detect fraud since many years. Fraud
detection requires complex and time consuming investigations that deals with different
domains like financial, economics and business practices. Fraud often consists of
repeated transactions using the same method.
The first industries to implement data analytics techniques were telephony companies
and the banks (Decker 1998). One early example of data analysis techniques in the
banking industry is FICO Falcon fraud assessment system, which is based on a neural
network shell.
Retail industries are also a victim fraud due to POS. So, as a result some super markets
have started to make use of digitized CCTV along with POS data of most susceptible
transactions to fraud.
Internet transactions have recently raised big concerns, with some research showing
that internet transaction fraud is 12 times higher than in-store fraud.
Problem Definition
Business Problem: - Strengthen Fraud detection strategy by using historical
data to identify transaction profiles that are likely to be involved in fraudulent
activities for further detailed examination.
Analytics problem: - Consider a target variable ‘T’ and use historic data to
build a predictive model for ‘T’.
i.e. T = 0 for good observations; T = 1 for potential fraud
observations
`
Data Exploration
Quite clear as the name suggests. Data is explored without any clear idea of what is
being looked for.
In this part of the project the data will be collected and prepared for the mining tasks
projected for the next phase.
First and foremost, the data must be collected correctly. We aim to use the data that
your bank has collected for example – customer profiles, credit card statements,
customer Track record etc. from your database. Altogether we will extract the following
attributes and format them into specific data types: -
Of card holder
T=0
Number or large purchases Transaction ok
On the card
Frequency of
Large purchases T=1
Transaction
probably fraud
Location where
Large purchases took
Data Preparation
Since there is no need to differ the current collection or organizing of data. We will
concentrate on smoothing the gathered data to remove inconsistencies. So, we will
need to remove or replace any missing values from the database to maintain accuracy.
As almost all products now days doesn’t have round number price (for example items
are generally priced at 599$ rather than 600$) hence keeping an eye out for round
number transactions could be useful.
The same can be said for even number amount of transactions like the amount which
has two zeroes ahead and before them like 1200.00$ could also be categorized as a
suspicious transaction.
We will use the data gathered from your databases to organize the tables below. The
first table that is the customer table will organize the details customers collected by
bank during account opening and from public records. While the second table will
summaries the data historical and real time data fed to the bank databases. It will be
updated by details of the transactions carried out by customers.
`
Customer Table
Account no. DOB Address Post code Acc balance transaction Card type
($) limit ($)
2256633 1 − 1 − 1996 1𝑠𝑡 𝑠𝑡 2144 1299 500 𝑑𝑒𝑏𝑖𝑡
2256622 1 − 1 − 1994 𝑠𝑢𝑠𝑎𝑛 𝑠𝑡 2000 5699 700 𝑐𝑟𝑒𝑑𝑖𝑡
2256611 1 − 5 − 1969 𝑏𝑟𝑜𝑎𝑑𝑤𝑎𝑦 2032 13982 3000 𝑑𝑒𝑏𝑖𝑡
2256603 1 − 4 − 1992 𝑔𝑒𝑜𝑟𝑔𝑒 𝑠𝑡. 2546 23695 5000 𝑑𝑒𝑏𝑖𝑡
2256637 2 − 5 − 1963 Mack RD. 2554 10233 2000 𝐶𝑟𝑒𝑑𝑖𝑡
2256665 22-11-1969 𝐶𝐵𝐷 2596 36955 6000 𝑑𝑒𝑏𝑖𝑡
Transaction Table
Transaction ID Transaction Transaction IP address Time of Risk value
amount ($) type Of transaction transaction (0-10)
223545 99.95 POS 192.168.1.1 12:50 0
223543 875 Transfer 192.168.1.1 11:25 3
223541 466 POS 192.168.1.1 2
223542 2500 ATM 192.168.1.1 23:30 8
223549 6500 ATM 192.168.1.1 22:21 9
Proposed Methodology
To predict the nature of the transaction as genuine or fraud, an effective data mining
solution is essential.
Clustering
Clustering is the task of grouping a set of objects in such a way that objects in the group
or cluster are more similar to other in the same cluster. It is the main task of exploratory
data mining. Our suggested solution tackles the above mention credit card fraud
problem two ways based on pattern recognizing. Each customer’s profile data and
historical transaction data is clustered into one common profile based on the average
large transaction amount (the amount of large transaction per a specific interval of time).
Also, a model is formed based on the geolocation of the transactions.
We will use the technique of Fuzzy clustering (also known as soft clustering) in which
each object like geolocation, transaction’s frequency and nature will be clustered to the
like hood of other similar clusters. Thus, the objects or values that doesn’t belong to any
clusters can be ruled out as outliers and can be flagged as fraudulent transactions in a
sense.
2
`
Evaluation of Results
From the Diagram, we can derive that the formation of cluster in the red shows the
frequency of the similar transactions of similar amount and in similar geographical area
for POS transactions, while the cauterization shown in blue area shows the ATM
transactions again with the specified criteria same as the above two the third
cauterization can be suggested. Now for our main motive the unfamiliar data sets which
do not fall in any cauterization are shown as abandoned data points in the diagram
which are the indications of fraudulent transactions and will be flagged for further
investigation.
Project Plan
For successful deployment of this model the Proposal will follow CRISP-DM Standard
methodology for Data mining.
The time line and budget of this proposal can be categorized from the table below.
Please note a contingency of 20% is already included to cover any unexpected
expenditures. Finally, we have a total estimated budget of $ 27404.00. which is a
realistic approach considering the amount total time the project deployment would take
and already included contingency. Also, we will offer full support for any technical
difficulties for a period of 4 months after deployment.
2
Subtotal 19,920
Hardware 3,500
costs
Contingency 3,984
(20%)
27,404
`
Hence, we think that you would consider our approach toward this data mining solution
for countering the fraud problem and give us the opportunity to work with your esteemed
bank.
Reference
John, S.N., Anele, C., Kennedy, O.O., Olajide, F. & Kennedy, C.G. 2016, 'Realtime
Fraud Detection in the Banking Sector Using Data Mining Techniques/Algorithm', 2016
International Conference on Computational Science and Computational Intelligence
(CSCI), pp. 1186-91.