You are on page 1of 17

Fraud Detection in Banking Data

using Machine Learning

SHIVANESH A/L SIVAKUMAR


MCS231014
DR HABIBOLLAH BIN HARON
Problem Background
• Banking industry sees a surge in fraudulent
activities
• Traditional rule-based systems inadequate for
evolving fraud tactics
• Real transactions outnumber fake ones,
complicating detection
• Machine learning offers promise but faces
challenges in large, imbalanced datasets
• Need for a dynamic fraud detection system that
adapts to evolving threats in real-time
Problem Statement Objective
• What machine learning techniques be used to • To improve the overall performance of fraud
detect fraudulent activities in banking data? detection algorithms and increase computational
efficiency, apply Bayesian optimization to fine-
• How can we improve the performance of fraud tune hyperparameters.
detection systems for unbalanced datasets?
• To develop and compare the performance of
• What are the most predictive features of XGBoost, Random Forest, and Artificial Neural
fraudulent transactions? Networks in detecting fraudulent transactions in
banking data.

• To investigate the feature importance provided by


each algorithm. This can provide insights into
which features are most predictive of fraudulent
transactions.
Literature Review
Fraud in Banking Industry
• Cressey’s (1953) fraud triangle: Financial pressure, perceived opportunity, and rationalization drive fraud.
• Financial pressure, from diverse sources, motivates fraudulent behavior.
• Hollow’s (2014) study: Financial pressure is a key factor for bank employees, varying by positions.
• Hidajat’s (2020) research: Greed is a chief non-financial pressure for higher-ranking individuals in Indonesian rural
banks.
• Weak internal controls provide fraud opportunities (Ilter 2014, Hollow 2014, Asmah 2020).
• Contributing factors to opportunities: Poorly defined duties, lack of documentation, delayed transactions, and
inadequate controls (Kazemian et al. 2019).
• Weak internal control systems in banks facilitate fraud (Sanusi et al. 2015).
• Individuals with low self-control are more prone to fraud (Holtfreter et al. 2010).
Bayesian Optimization

• Invented by Jonas Mockus in the 1970s and 1980s.


• Optimizes algorithm performance through Bayesian statistical modeling.
• Components: Bayesian model for objective function, acquisition function for sampling decisions.
• Begins with space-filling experimental design, often random points.
• Iteratively allocates remaining budget for function evaluations.
• Enhances algorithm performance by optimizing hyperparameters.
• Applicable beyond machine learning: used in robotics, sensor placement, drug discovery, and
engineering design.
• Versatile and adaptable, making it valuable across diverse domains.
XGBoost

• Regularizing gradient boosting library for C++, Java, Python, R, Julia, Perl, Scala.
• Developed by Tianqi Chen for DMLC (DISTRIBUTE MACHINE LEARNING COMMAND) research project.
• Initial version: terminal app configured with Library for Support Vector Machines file.
• Boosting algorithm based on gradient boosted trees.
• Avoids overfitting with regularization term.
• Utilizes parallel and distributed computing for faster model creation. (Huang,2014)
• Employs sparsity-aware algorithm to remove missing values in split gain computation.
• Applied in finance, healthcare, e-commerce for fraud detection, churn prediction, credit risk modeling.
Random Forest

• Built on decision tree algorithm for regression and classification.


• Effective for high-accuracy predictions, especially with large datasets.
• Combines multiple classifiers to solve complex problems.
• Predicts average output from trees in the forest for enhanced precision.
• Overcomes decision tree limitations and reduces the need for dataset lifting.
• Each tree is a weak learner, but together they form a strong learner.
• Fast and effective for large and unbalanced datasets.
• Limitations in training regression problems across different datasets.
• Proposed by Olena et al. (2020) using random forest and isolation tree techniques. The system was
checked for finding users' location during transactions the study doesn't put enough focus on
keeping secrets safe and private.
Artificial Neural Networks (ANN)
• Inspiration from the human brain's structure led to the creation of Artificial Neural
Networks (ANNs).
• Warren McCulloch and Walter Pitts proposed the first mathematical model of a neuron in
1943.
• Perform tasks like the human brain; categorized as unsupervised and supervised.
• Unsupervised Neural Networks: 95% accuracy for fraud detection, identify patterns in credit
card transactions.
• Resilient to errors; can generate output with corrupted cells.
• Effective for Credit Card Fraud Detection (CCFD) due to high speed and processing
capabilities
• According to Mahji combination of ANN and clustering excels in detecting fraudulent
transactions.
Related Works
• Halvaiee & Akbari (2014): AIRS-based Fraud Detection Model (AFDM)
⚬ Proposed AIRS-based model. (IMMUNE SYSTEM INSPIRED ALGORITHM)
⚬ Improved fraud detection by up to 25%.
⚬ Reduced costs by up to 85%, system response time by up to 40%.
• Bahnsen et al. (2016): Transaction Aggregation Strategy
⚬ Developed strategy with von Mises distribution.
⚬ Introduced cost-based criterion.
⚬ Extended strategy for new features.
• Randhawa et al. (2018): Machine Learning Algorithms
⚬ Studied various models.
⚬ Proposed hybrid method with AdaBoost, majority voting for effective fraud detection.
• Porwal and Mukund (2018): Outlier Detection using Clustering
⚬ Proposed clustering for outlier detection.
⚬ Resistant to changing patterns.
⚬ Preferred precision-recall curve over ROC (receiver operating characteristic).
Comparison Between Algorithms
Performance Evaluation
• Accuracy:
⚬ Ratio of correct predictions to total predictions; suitable for balanced classes.
• Precision:
⚬ Ratio of correctly predicted positive observations to total predicted positives; emphasizes
precision.
• Recall (Sensitivity):
⚬ Ratio of correctly predicted positive observations to all actual positives; measures capture
ability.
• F1 Score:
⚬ Weighted average of Precision and Recall; balances precision and recall in imbalanced
classes.
• ROC Curve (Receiver Operating Characteristics):
⚬ Plot of true positive rate against false positive rate
Research Framework
Phase 1 : Data Accquisition

• Proposed cooperation with Maybank's data security and compliance team.


• Suggested signing a non-disclosure agreement for formalizing data use and security
conditions.
• Stressed positive impact on academic goals and banking industry's credit card security.
• Approached Maybank transparently, aiming for a collaborative relationship.
• Demonstrated commitment to responsible and ethical research practices.
Phase 2 : Data Cleaning and Data Exploration
• Python is used for crucial data cleaning, removing or modifying incorrect, incomplete,
irrelevant, or duplicated data.
• Data quality directly affects machine learning model effectiveness.
• Address missing values through imputation or removal.
• Remove duplicate rows to prevent model bias.
• Manage outliers using techniques like scaling and normalization.
• Data exploration is essential for understanding patterns and characteristics.
• Begins with variable identification, recognizing input and target variables, data types, and
categories
Phase 3 : Bayesian Optimizaton, Machine Learning and Data
Visualization
• Machine learning phase involves Bayesian optimization for hyperparameter tuning.
• Aims for optimal hyperparameters, enhancing fraud detection algorithm performance.
• Results in accurate, efficient models, saving computational resources and time.
• Utilizes XGBoost, Random Forest, and Artificial Neural Network for feature importance
metrics.
• Data visualization using Python libraries presents machine learning results effectively.
Phase 4 : Performance Evaluation
• Precision: Gauges accuracy of positive predictions (True Positives / Total Predicted Positives).
• Accuracy is a metric that measures how often a machine learning model correctly predicts the outcome.
It is calculated by dividing the number of correct predictions by the total number of predictions.
• Recall: Assesses model's ability to identify actual positive instances (True Positives / Total Actual
Positives).
• F1 Score: Harmonic mean of precision and recall, useful for imbalanced classes.
• ROC-AUC Metric: Measures model's ability to differentiate between positive and negative instances.
• Data Visualization: Visualize precision, recall, and F1 scores with bar charts or line graphs for model
comparison; ROC curve illustrates diagnostic ability; Confusion matrices visualized in heatmap or table
format for interpretation and presentation.

You might also like