You are on page 1of 11

32130 - Fundamentals of Data Analytics

Assignment - 1 The Data Analytics Consultant 1

Unearthing Fraudulence
by Adopting Data
Analytics
09-04-2017
____________

Abhishek Dangi
UTS Student ID - 12736625
2

Contents
Overview, Goals ……………………………………1
Specification, …………….…………………………2
Milestone, Data Mining scenario………………….3
Data Exploration ……………………………………4
Proposed Methodology…………………………….6
Project Plan……………………………………....…9
Reference………………………………………….10

Overview
“As of 2012, about 2.5 exabytes of data is created each day, and that number is
doubling every 40 months or so.” – Harvard Business Review, Big Data: The
Management Revolution
We have always had “Big Data” now we also have fast processing. The world we live in
continues to shrink by the technology innovations that make it so easy to do so much.
Fraud is a Billion-Dollar business and it's increasing exponentially every year and it
affects almost every financial industry now days. Since big Banks like yours generates
and contributes towards various financial, taxation, transaction, credit, investment and
many other databases, this big data should be utilized as in a way to provide smooth
and secure environment to the clients, the business and the financial institutions.
Thus, this project proposal for a data mining project is in line to enable you to better
understand the patterns that originate in detection of a bank fraud. Also since our
economy is moving towards a cashless fashion, the need of better security measures in
this fast pacing online world is a must for everyone the bank, the customers and the
business. As a result, this project focuses mainly on carding (credit card fraud).

Goals
❖ Generating lists of potential fraudulent profiles based on various databases.
❖ Predicting fraud using predictive analysis.
❖ Recognizing patterns for narrowing down onto illegal repeated fraudulent
transactions.

`
❖ Creating an AI solution for the bank that detects and prevents online transaction
fraud in real time using client’s previous activities.
❖ Auditing various databases to reveal potential security flaws in the system and
determining which sectors have been the target most fraudulent activities and
providing solid security measures.
❖ Minimizing the possibilities of False Positives.

Specifications
● Data Mining
○ Collecting and sorting of data from various databases as ATMs and POS
transactions, online transactions, geolocation, credit card services,
mortgages, criminal records etc.
○ Applying Data preprocessing techniques such as validation, error
correction, making up for missing or incorrect data.
○ Creating client profiles using resultant data.
○ Clustering and classification to find patterns and associating data into
groups.
○ Applying algorithms to find abnormalities in the behavior of transactions
from previously gathered models and profiles.
○ Estimation of risks and predicting future transactions.

● Implementation of machine learning for automated fraud detection.


○ Data mining for classification, clustering and segmentation of the data.
○ Automatically finding associations and rules in data that may signify
interesting patterns, including those related to fraud.
○ Preparing a system to detect fraud in form of profiles.
○ Pattern recognition to detect classes, clusters, or patterns of suspicious
behaviors automatically or to match specified rules.
○ Machine learning to automate the process of fraud detection.
○ Declination or reporting of transaction beforehand when certain fraud
behaviors are observed in system.
2

Milestones

Financial institutions are using data analysis to detect fraud since many years. Fraud
detection requires complex and time consuming investigations that deals with different
domains like financial, economics and business practices. Fraud often consists of
repeated transactions using the same method.

The first industries to implement data analytics techniques were telephony companies
and the banks (Decker 1998). One early example of data analysis techniques in the
banking industry is FICO Falcon fraud assessment system, which is based on a neural
network shell.

Retail industries are also a victim fraud due to POS. So, as a result some super markets
have started to make use of digitized CCTV along with POS data of most susceptible
transactions to fraud.
Internet transactions have recently raised big concerns, with some research showing
that internet transaction fraud is 12 times higher than in-store fraud.

Data Mining scenario and methodology using CRISP-


DM

Problem Definition
 Business Problem: - Strengthen Fraud detection strategy by using historical
data to identify transaction profiles that are likely to be involved in fraudulent
activities for further detailed examination.
 Analytics problem: - Consider a target variable ‘T’ and use historic data to
build a predictive model for ‘T’.
i.e. T = 0 for good observations; T = 1 for potential fraud
observations

`
Data Exploration

Quite clear as the name suggests. Data is explored without any clear idea of what is
being looked for.
In this part of the project the data will be collected and prepared for the mining tasks
projected for the next phase.

First and foremost, the data must be collected correctly. We aim to use the data that
your bank has collected for example – customer profiles, credit card statements,
customer Track record etc. from your database. Altogether we will extract the following
attributes and format them into specific data types: -

 Date of Birth – a date time type


 Income of customer – as a number
 Occupation of card holder – as various cat. (engineer, govt. employee, bank
employee)
 Transaction amount – as a number
 Number of large Transactions – as a number
 Frequency of large transactions – as a number
 Location of transaction – post code, country code, POS machine code
 Time of transaction – as date time type
 IP address of the online transaction – as a number
 Daily Transaction Limit – as a number set by client

Other Interesting Values


 Round no. transactions ● Declined authorization amount
 Even dollar amount ● Risk score
 Consumer disputed transactions ● Foreign card amount

Age, income, occupation


2

Of card holder

T=0
Number or large purchases Transaction ok
On the card

Frequency of
Large purchases T=1
Transaction
probably fraud
Location where
Large purchases took

Data Preparation

Since there is no need to differ the current collection or organizing of data. We will
concentrate on smoothing the gathered data to remove inconsistencies. So, we will
need to remove or replace any missing values from the database to maintain accuracy.

As almost all products now days doesn’t have round number price (for example items
are generally priced at 599$ rather than 600$) hence keeping an eye out for round
number transactions could be useful.

The same can be said for even number amount of transactions like the amount which
has two zeroes ahead and before them like 1200.00$ could also be categorized as a
suspicious transaction.

We will use the data gathered from your databases to organize the tables below. The
first table that is the customer table will organize the details customers collected by
bank during account opening and from public records. While the second table will
summaries the data historical and real time data fed to the bank databases. It will be
updated by details of the transactions carried out by customers.

`
Customer Table
Account no. DOB Address Post code Acc balance transaction Card type
($) limit ($)
2256633 1 − 1 − 1996 1𝑠𝑡 𝑠𝑡 2144 1299 500 𝑑𝑒𝑏𝑖𝑡
2256622 1 − 1 − 1994 𝑠𝑢𝑠𝑎𝑛 𝑠𝑡 2000 5699 700 𝑐𝑟𝑒𝑑𝑖𝑡
2256611 1 − 5 − 1969 𝑏𝑟𝑜𝑎𝑑𝑤𝑎𝑦 2032 13982 3000 𝑑𝑒𝑏𝑖𝑡
2256603 1 − 4 − 1992 𝑔𝑒𝑜𝑟𝑔𝑒 𝑠𝑡. 2546 23695 5000 𝑑𝑒𝑏𝑖𝑡
2256637 2 − 5 − 1963 Mack RD. 2554 10233 2000 𝐶𝑟𝑒𝑑𝑖𝑡
2256665 22-11-1969 𝐶𝐵𝐷 2596 36955 6000 𝑑𝑒𝑏𝑖𝑡

Transaction Table
Transaction ID Transaction Transaction IP address Time of Risk value
amount ($) type Of transaction transaction (0-10)
223545 99.95 POS 192.168.1.1 12:50 0
223543 875 Transfer 192.168.1.1 11:25 3
223541 466 POS 192.168.1.1 2
223542 2500 ATM 192.168.1.1 23:30 8
223549 6500 ATM 192.168.1.1 22:21 9

Proposed Methodology
To predict the nature of the transaction as genuine or fraud, an effective data mining
solution is essential.

Clustering
Clustering is the task of grouping a set of objects in such a way that objects in the group
or cluster are more similar to other in the same cluster. It is the main task of exploratory
data mining. Our suggested solution tackles the above mention credit card fraud
problem two ways based on pattern recognizing. Each customer’s profile data and
historical transaction data is clustered into one common profile based on the average
large transaction amount (the amount of large transaction per a specific interval of time).
Also, a model is formed based on the geolocation of the transactions.

We will use the technique of Fuzzy clustering (also known as soft clustering) in which
each object like geolocation, transaction’s frequency and nature will be clustered to the
like hood of other similar clusters. Thus, the objects or values that doesn’t belong to any
clusters can be ruled out as outliers and can be flagged as fraudulent transactions in a
sense.
2

Distribution based clustering


In this model, the clusters can be easily defined as objects belonging to same
distribution. For example, we will take the geolocation of a transaction and will try to
cluster it with other transactions of similar kind (as in identical transaction amount of
previous transactions in that geolocational area). A convenient property of this approach
is that it closely resembles the way artificial data sets are generated by grouping
random elements from a group together. This clustering will create models that can
capture correlation and dependencies between attributes.

We propose the following steps to create such a model:


1. Firstly, we will smooth out the data as considered above for any incomplete data.
2. Secondly, we will culturize the data stored in database by clustering same groups
of data together. For example, the generalized large transactions are confined
within a specific frequency and geolocational area, further the transaction limit is
also taken into consideration.
3. These distributed values will then be used in conjugation with other attributes of
the customer and transactions tables.
4. Then the most common types of transactions will be ruled out and then other
algorithms would be used on the remaining uncluttered sets of data for further
processing.
The Output of the above process can be summarized by the diagram below: -

`
Evaluation of Results

From the Diagram, we can derive that the formation of cluster in the red shows the
frequency of the similar transactions of similar amount and in similar geographical area
for POS transactions, while the cauterization shown in blue area shows the ATM
transactions again with the specified criteria same as the above two the third
cauterization can be suggested. Now for our main motive the unfamiliar data sets which
do not fall in any cauterization are shown as abandoned data points in the diagram
which are the indications of fraudulent transactions and will be flagged for further
investigation.

Deployment to your company


Firstly, we will deploy the same model in your bank and will track down the repeated
patterns of flagged activities. We will also take the consideration of the feedbacks by
customers. For example, if the clients flag the transaction made from their account to be
fraudulent than we will develop the patterns such as a specific online shopping site
whose transactions are reported most. We will than adjust that data into our model to
make it more full proof.
In the meantime, we will help your bank to strengthen your database and develop a
model that can track down real time fraudulent transactions.
For future, we will derive and deliver more intelligent and fast models which will also
narrow down the possibilities of false positives.

Project Plan
For successful deployment of this model the Proposal will follow CRISP-DM Standard
methodology for Data mining.
The time line and budget of this proposal can be categorized from the table below.
Please note a contingency of 20% is already included to cover any unexpected
expenditures. Finally, we have a total estimated budget of $ 27404.00. which is a
realistic approach considering the amount total time the project deployment would take
and already included contingency. Also, we will offer full support for any technical
difficulties for a period of 4 months after deployment.
2

CRISP-DM Task Duration Developer Cost per dev. Cost


Phase (days) hours Hour ($) ($)

Prepare Business understanding 1 8 90 720

Business Determine data mining goal 1 8 100 800


Understanding
Production of model 1 8 100 800

Collect initial data 1 8 100 800

Data Verify data quality 1 8 100 800


understanding
Smoothening of data 1 8 100 800

Select data 1 8 100 800

Data Integrate data 1 8 100 800


preparation
Format Data 1 8 100 800
Select modeling technique 1 8 100 800

Modelling Build the model 5 40 100 4,000

Assess the model 1 8 100 800

Evaluation Evaluate result 2 16 100 1,600

Plan development 1 8 100 800

Monitoring and maintenance 1 8 100 800


Deployment
Produce final report 1 8 100 800

Train employee 3 24 100 2400

Review Project 1 8 100 800

Subtotal 19,920

Hardware 3,500
costs

Contingency 3,984
(20%)

27,404

`
Hence, we think that you would consider our approach toward this data mining solution
for countering the fraud problem and give us the opportunity to work with your esteemed
bank.

Reference

IOSR Journal of Business and Management (IOSR-JBM) e-ISSN: 2278-487X, p-ISSN:


2319-7668. Volume 18, Issue 1.Ver. II (Jan. 2016), PP 09-14

John, S.N., Anele, C., Kennedy, O.O., Olajide, F. & Kennedy, C.G. 2016, 'Realtime
Fraud Detection in the Banking Sector Using Data Mining Techniques/Algorithm', 2016
International Conference on Computational Science and Computational Intelligence
(CSCI), pp. 1186-91.

Ekizoglu, B. & Demiriz, A. 2015, 'Fuzzy rule-based analysis of spatio-temporal ATM


usage data for fraud detection and prevention', 2015 12th International Conference on
Fuzzy Systems and Knowledge Discovery (FSKD), pp. 1009-14.

You might also like