You are on page 1of 7

IEEE CIS Fraud Detection

Team: Seekers
Kaveri Biswas (DT2019003), Keerthana P Girijan (DT2019004),
Shefali Bedarkar (DT2019008)
Agenda

• Project Description
• Features
• Exploratory Data Analysis
• Missingness and Imputation
• PCA and Feature Engineering
• Balancing Techniques
• Modeling and Score
• Future Work
Project Description

• The dataset of credit card transactions is provided by the Vesta Corporation, said to be world’s
leading payment service company
• The dataset divided into two files, transaction and identity for both train and test
• Train dataset: 354324 x 434; Test dataset: 236216 x 433
• ‘isFraud’ is the binary target variable
Features
• Transaction features:
• TransactionDT: timedelta from a given reference datatime (not an actual timestamp)
• TransactionAMT: transaction amount paid in USD
• ProductCD: product code, the product for each transaction
• card1 – card6: payment card information, such as card type, card category, issue bank, country, etc
• addr: address
• dist: distance
• P_ and R_ emaildomain: purchaser and recipient email domain
• C1-C14: counting, such as how many addresses are found to be associated with the payment card, etc. The actual
meaning is masked
• D1-D15: timedelta, such as days between previous transaction, etc.
• M1-M9: match, such as names on card and address, etc.
• Vxxx: Vesta engineered rich features, including ranking, counting, and other entity relations.
• Identity Features:
• Categorical Features: DeviceType, DeviceInfo, id_12 – id_38
Exploratory Data Analysis (EDA)
• While conducting EDA, we found that the data was sparse
• Only 3.52% of the total transactions were positively classified as ‘isFraud’
• V and id features in train data have more than 70% missing values
• Another observation we made was of ‘TransactionDT’. Both train and test data details had been
taken at the same time but train amount values more than test data
• We created new columns like hours, day, week, and month to take a closer look at the time and
target
• It seems that in the hours from 4am to 12pm the fraction of fraudulent transaction is significantly
higher than other hours. And from hour 2pm to 4pm, the fractions of fraud is the lowest. While
from 7am to 10am the fraction is the highest. So we can create another new feature, classifying
time periods into different levels of warning sign in terms of their fraud fraction.

You might also like