Professional Documents
Culture Documents
ON
CAPSTONE CREDIT CARD DEFAULT DETECTION
Of
BONAFIDE CERTIFICATE
Certified that this project report on “Capstone Credit Card Default Detection” is
GAUTAM, VARDAAN AGARWAL” who carried out the project work under my
supervision.
SIGNATURE
Assistant Professor
Department of Computer Science
Maharaja Surajmal Institute
DECLARATION
We hereby declare that this submission is our own work and that, to the best of our
person nor material which to a substantial extent has been accepted for the award of any
other degree or diploma of the university or other institute of higher learning, except
Date-
ACKNOWLEDGEMENT
It gives us a great sense of pleasure to present the report of BCA Project undertaken
during BCA Final Year. We owe a special debt of gratitude to Dr. Pooja Singh, Asst.
her constant support and guidance throughout the course of our work. Her sincerity,
thoroughness and perseverance have been a constant source of inspiration for us. It is
only her cognizant efforts that our endeavor have seen light of the day. We also do
not like to miss the opportunity to acknowledge the contribution of all faculty
members of the department for their kind assistance and cooperation during the
development of our project. Last but not the least, we acknowledge our friends for
Date-
ABSTRACT
Credit card default detection is presently the most frequently occurring problem in
the modern world. Credit risk plays a vital role in the banking sector. Main activities
of banks involve granting loan, credit card, investment, Mortgage, and others.
Credit cards have been one of the most booming financial services by banks over the
past few years. However, with the growing number of credit card users, banks have
As such data analytics can provide solutions to tackle the current phenomenon and
This project provides a performance evaluation of credit card default prediction with
Classification algorithms like logistic regression, decision tree, and random forest are
used to test the variable in predicting credit default and random forest proved to have
the higher accuracy, sensitivity, specificity and area under the curve.
This result shows that random forest best describes which factors should be
Software Requirements
Python
Google Colab
Numpy
Pandas
Matplotlib
Seaborn
Sklearn
- Hardware Requirements
-ER Diagram
-Data Flow Diagram
Banking sector is one of the most volatile and vulnerable sectors in the world with its
increasing risk factors.
Credit card default is a serious concern in banking sectors. They are suffering from huge
losses due to inability to recover the money granted to the customers.
There are large number of peoples who applies for credit card every year due to which the
assessment of whether an individual will be able of repaying the loan has become very
crucial in the banking sector.
So our aim of this project is to distinguish the potential defaulters from the non-defaulters
and thereby reducing the bad debt.
We can achieve this by applying Machine Learning Classification Models like
LogisiticRegression , Decision Tree , Random Forest Classifier , Naïve Bayes etc.
This project aims to bridge this gap of uncertainty by utilizing a data-driven approach by using
past data of credit card customers in conjunction with machine learning to predict whether or not
1. Python
2. Google Colab/ Jupyter Notebook
3. Numpy
4. Pandas
5. Matplotlib
6. Seaborn
7. Scikit Learn
Python
It provides large standard libraries that include the areas like string operations,
Internet, web service tools, operating system interfaces and protocols. Most of the
highly used programming tasks are already scripted into it that limits the length of the
codes to be written in Python.
Integration Feature
Python integrates the Enterprise Application Integration that makes it easy to develop
Web services by invoking COM or COBRA components. It has powerful control
capabilities as it calls directly through C, C++ or Java via Python. Python also
processes XML and other mark-up languages as it can run on all modern operating
systems through same byte code.
The language has extensive support libraries and clean object-oriented designs that
increase two to tenfold of programmer’s productivity while using the languages like
Java, VB, Perl, C, C++ and C#.
Productivity
With its strong process integration features, unit testing framework and enhanced
control capabilities contribute towards the increased speed for most applications and
productivity of applications. It is a great option for building scalable multi-protocol
network applications.
Colab is basically a free Jupyter notebook environment running wholly in the cloud. Most
importantly, Colab does not require a setup, plus the notebooks that you will create can be
simultaneously edited by your team members – in a similar manner you edit documents in
Google Docs. The greatest advantage is that Colab supports most popular machine learning
libraries which can be easily loaded in your notebook.
We can perform the following using Google Colab;
1. Write and execute code in Python
2. Create/Upload/Share notebooks
3. Import/Save notebooks from/to Google Drive
4. Import/Publish notebooks from GitHub
5. Import external datasets
6. Integrate PyTorch, TensorFlow, Keras, OpenCV
7. Free Cloud service with free GPU
NUMPY
NumPy is a Python library used for working with arrays. NumPy stands for Numerical Python.
NumPy is a Python library and is written partially in Python, but most of the parts that require
fast computation are written in C or C++.
In Python we have lists that serve the purpose of arrays, but they are slow to process.
NumPy aims to provide an array object that is up to 50x faster than traditional Python lists.
The array object in NumPy is called ndarray, it provides a lot of supporting functions that make
working with ndarray very easy.
Arrays are very frequently used in data science, where speed and resources are very important.
PANDAS
Pandas is an open-source library that is built on top of NumPy library. It is a Python package that
offers various data structures and operations for manipulating numerical data and time series. It
is mainly popular for importing and analyzing data much easier. Pandas is fast and it has high-
performance & productivity for users.
Pandas makes it simple to do many of the time consuming, repetitive tasks associated with
working with data, including:
1. Data cleansing
2. Data fill
3. Data normalization
4. Merges and joins
5. Data visualization
6. Statistical analysis
7. Data inspection
8. Loading and saving data
Matplotlib
One of the greatest benefits of visualization is that it allows us visual access to huge amounts of
data in easily digestible visuals. Matplotlib consists of several plots like line, bar, scatter,
histogram etc.
Matplotlib is not a part of the Standard Libraries which is installed by default when Python, there
are several toolkits which are available that extend python matplotlib functionality. Some of
them are separate downloads, others can be shipped with the matplotlib source code but have
external dependencies.
1. Basemap: It is a map plotting toolkit with various map projections, coastlines and political
boundaries.
2. Cartopy: It is a mapping library featuring object-oriented map projection definitions, and arbitrary
point, line, polygon and image transformation capabilities.
3. Excel tools: Matplotlib provides utilities for exchanging data with Microsoft Excel.
4. Mplot3d: It is used for 3-D plots.
5. Natgrid: It is an interface to the natgrid library for irregular gridding of the spaced data.
Seaborn
Seaborn is an amazing visualization library for statistical graphics plotting in Python. It provides
beautiful default styles and color palettes to make statistical plots more attractive. It is built on
the top of matplotlib library and also closely integrated to the data structures from pandas.
Seaborn aims to make visualization the central part of exploring and understanding data. It
provides dataset-oriented APIs, so that we can switch between different visual representations
for same variables for better understanding of dataset.
Plots are basically used for visualizing the relationship between variables. Those variables can be
either be completely numerical or a category like a group, class or division. Seaborn divides plot
into the below categories –
1. Relational plots: This plot is used to understand the relation between two variables.
2. Categorical plots: This plot deals with categorical variables and how they can be visualized.
3. Distribution plots: This plot is used for examining univariate and bivariate distributions
4. Regression plots: The regression plots in seaborn are primarily intended to add a visual guide that
helps to emphasize patterns in a dataset during exploratory data analyses.
5. Matrix plots: A matrix plot is an array of scatterplots.
6. Multi-plot grids: It is an useful approach is to draw multiple instances of the same plot on different
subsets of the dataset.
Scikit-learn
Scikit-learn is a free software machine learning library for the Python programming language. It
features various classification, regression and clustering algorithms including support vector
machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to
interoperate with the Python numerical and scientific libraries NumPy and SciPy. Scikit-learn is
a NumFOCUS fiscally sponsored project.
It is the most useful and robust library for machine learning in Python. It provides a selection of
efficient tools for machine learning and statistical modeling including classification, regression,
clustering and dimensionality reduction via a consistence interface in Python. This library, which
is largely written in Python, is built upon NumPy, SciPy and Matplotlib.
The term machine learning was coined in 1959 by Arthur Samuel, an American IBMer and
pioneer in the field of computer gaming and artificial intelligence.
Tom M. Mitchell provided a widely quoted, more formal definition of the algorithms studied in
the machine learning field: "A computer program is said to learn from experience E with respect
to some class of tasks T and performance measure P if its performance at tasks in T, as measured
by P, improves with experience E."
This definition of the tasks in which machine learning is concerned offers a fundamentally
operational definition rather than defining the field in cognitive terms. This follows Alan
Turing's proposal in his paper "Computing Machinery and Intelligence", in which the question
"Can machines think?" is replaced with the question "Can machines do what we (as thinking
entities) can do?".
APPLICATIONS OF MACHINE LEARNING
1. Supervised Learning
2. Unsupervised Learning
3. Semi-supervised Learning
4. Reinforcement Learning
Supervised Learning
Supervised learning is the types of machine learning in which machines are trained using well
"labelled" training data, and on basis of that data, machines predict the output. The labelled data
means some input data is already tagged with the correct output.
In supervised learning, the training data provided to the machines work as the supervisor that
teaches the machines to predict the output correctly. It applies the same concept as a student
learns in the supervision of the teacher.
Algorithms used - Support vector machine, Neural network, Linear and logistics regression,
random forest, and Classification trees.
The main types of supervised learning problems include regression and classification problems:-
1.Regression
Regression algorithms are used if there is a relationship between the input variable and the
output variable. It is used for the prediction of continuous variables, such as Weather forecasting,
Market Trends, etc. Below are some popular Regression algorithms which come under
supervised learning:
1. Linear Regression
2. Regression Trees
3. Non-Linear Regression
4. Bayesian Linear Regression
5. Polynomial Regression
2.Classification
Classification algorithms are used when the output variable is categorical, which means there are
two classes such as Yes-No, Male-Female, True-false, etc.
1. Random Forest
A random forest is a machine learning technique that’s used to solve regression and classification
problems. It utilizes ensemble learning, which is a technique that combines many classifiers to provide
solutions to complex problems.
A random forest algorithm consists of many decision trees. The ‘forest’ generated by the random forest
algorithm is trained through bagging or bootstrap aggregating. Bagging is an ensemble meta-algorithm
that improves the accuracy of machine learning algorithms.
The (random forest) algorithm establishes the outcome based on the predictions of the decision trees. It
predicts by taking the average or mean of the output from various trees. Increasing the number of trees
increases the precision of the outcome.
A random forest eradicates the limitations of a decision tree algorithm. It reduces the overfitting of
datasets and increases precision. It generates predictions without requiring many configurations in
packages (like scikit-learn).
Decision trees are the building blocks of a random forest algorithm. A decision tree is a decision
support technique that forms a tree-like structure. An overview of decision trees will help us
understand how random forest algorithms work.
A decision tree consists of three components: decision nodes, leaf nodes, and a root node. A
decision tree algorithm divides a training dataset into branches, which further segregate into
other branches. This sequence continues until a leaf node is attained. The leaf node cannot be
segregated further.
The nodes in the decision tree represent attributes that are used for predicting the outcome.
Decision nodes provide a link to the leaves. The following diagram shows the three types of
nodes in a decision tree.
2. Decision Trees
Decision tree is the most powerful and popular tool for classification and prediction. A Decision
tree is a flowchart like tree structure, where each internal node denotes a test on an attribute,
each branch represents an outcome of the test, and each leaf node (terminal node) holds a class
label.
Strengths and Weakness of Decision Tree approach
3. Logistic Regression
Logistic regression is named for the function used at the core of the method, the logistic
function.
The logistic function, also called the sigmoid function was developed by statisticians to describe
properties of population growth in ecology, rising quickly and maxing out at the carrying
capacity of the environment. It’s an S-shaped curve that can take any real-valued number and
map it into a value between 0 and 1, but never exactly at those limits.
1 / (1 + e^-value)
Where e is the base of the natural logarithms (Euler’s number or the EXP() function in your
spreadsheet) and value is the actual numerical value that you want to transform. Below is a plot
of the numbers between -5 and 5 transformed into the range 0 and 1 using the logistic function.
LOGISTIC FUNCTION
Representation Used for Logistic Regression
Logistic regression uses an equation as the representation, very much like linear regression.
Input values (x) are combined linearly using weights or coefficient values (referred to as the
Greek capital letter Beta) to predict an output value (y). A key difference from linear regression
is that the output value being modeled is a binary values (0 or 1) rather than a numeric value.
Where y is the predicted output, b0 is the bias or intercept term and b1 is the coefficient for the
single input value (x). Each column in your input data has an associated b coefficient (a constant
real value) that must be learned from your training data.
The actual representation of the model that you would store in memory or in a file are the
coefficients in the equation (the beta value or b’s).
Logistic regression models the probability of the default class (e.g. the first class).
For example, if we are modeling people’s sex as male or female from their height, then the first
class could be male and the logistic regression model could be written as the probability of male
given a person’s height, or more formally:
P(sex=male|height)
Written another way, we are modeling the probability that an input (X) belongs to the default
class (Y=1), we can write this formally as:
P(X) = P(Y=1|X)
Note that the probability prediction must be transformed into a binary values (0 or 1) in order to
actually make a probability prediction. More on this later when we talk about making
predictions.
Logistic regression is a linear method, but the predictions are transformed using the logistic
function. The impact of this is that we can no longer understand the predictions as a linear
combination of the inputs as we can with linear regression, for example, continuing on from
above, the model can be stated as:
I don’t want to dive into the math too much, but we can turn around the above equation as
follows (remember we can remove the e from one side by adding a natural logarithm (ln) to the
other):
ln(p(X) / 1 – p(X)) = b0 + b1 * X
This is useful because we can see that the calculation of the output on the right is linear again
(just like linear regression), and the input on the left is a log of the probability of the default
class.
This ratio on the left is called the odds of the default class (it’s historical that we use odds, for
example, odds are used in horse racing rather than probabilities). Odds are calculated as a ratio
of the probability of the event divided by the probability of not the event, e.g. 0.8/(1-0.8) which
has the odds of 4. So we could instead write:
ln(odds) = b0 + b1 * X
Because the odds are log transformed, we call this left hand side the log-odds or the probit. It is
possible to use other types of functions for the transform (which is out of scope_, but as such it is
common to refer to the transform that relates the linear regression equation to the probabilities as
the link function, e.g. the probit link function.
We can move the exponent back to the right and write it as:
odds = e^(b0 + b1 * X)
All of this helps us understand that indeed the model is still a linear combination of the inputs, but that
this linear combination relates to the log-odds of the default class.
1. With the help of supervised learning, the model can predict the output on the basis of prior
experiences.
2. In supervised learning, we can have an exact idea about the classes of objects.
3. Supervised learning model helps us to solve various real-world problems such as fraud
detection, spam filtering, etc.
1. Supervised learning models are not suitable for handling the complex tasks.
2. Supervised learning cannot predict the correct output if the test data is different from the
training dataset.
3. Training required lots of computation times.
4. In supervised learning, we need enough knowledge about the classes of object.
2) Unsupervised Learning
Unsupervised learning, also known as unsupervised machine learning, uses machine learning
algorithms to analyze and cluster unlabelled datasets. These algorithms discover hidden patterns
or data groupings without the need for human intervention. Its ability to discover similarities and
differences in information make it the ideal solution for exploratory data analysis, cross-selling
strategies, customer segmentation, and image recognition.
The unsupervised learning algorithm can be further categorized into two types of problems:
Clustering:
Clustering is a method of grouping the objects into clusters such that objects with most
similarities remains into a group and has less or no similarities with the objects of another group.
Cluster analysis finds the commonalities between the data objects and categorizes them as per
the presence and absence of those commonalities.
Association:
An association rule is an unsupervised learning method which is used for finding the
relationships between variables in the large database. It determines the set of items that occurs
together in the dataset. Association rule makes marketing strategy more effective. Such as people
who buy X item (suppose a bread) are also tend to purchase Y (Butter/Jam) item. A typical
example of Association rule is Market Basket Analysis.
1. Unsupervised learning is used for more complex tasks as compared to supervised learning
because, in unsupervised learning, we don't have labelled input data.
2. Unsupervised learning is preferable as it is easy to get unlabelled data in comparison to
labelled data.
1. Unsupervised learning is intrinsically more difficult than supervised learning as it does not
have corresponding output.
2. The result of the unsupervised learning algorithm might be less accurate as input data is
not labelled, and algorithms do not know the exact output in advance.
3) Reinforcement Learning
2. Negative
In this type of learning, the algorithm is trained upon a combination of labeled and unlabelled
data. Typically, this combination will contain a very small amount of labeled data and a very
large amount of unlabelled data. The basic procedure involved is that first, the programmer will
cluster similar data using an unsupervised learning algorithm and then use the existing labeled
data to label the rest of the unlabelled data. The typical use cases of such type of algorithm have
a common property among them – The acquisition of unlabelled data is relatively cheap while
labeling the said data is very expensive.
1. Continuity Assumption: The algorithm assumes that the points which are closer to
each other are more likely to have the same output label.
2. Cluster Assumption: The data can be divided into discrete clusters and points in the
same cluster are more likely to share an output label.
3. Manifold Assumption: The data lie approximately on a manifold of much lower
dimension than the input space. This assumption allows the use of distances and
densities which are defined on a manifold.
A true positive is an outcome where the model correctly predicts the positive class.
Similarly, a true negative is an outcome where the model correctly predicts the negative class.
A false positive is an outcome where the model incorrectly predicts the positive class.
And a false negative is an outcome where the model incorrectly predicts the negative class.
Sensitivity: Sensitivity is known as the True Positive Rate. Essentially, it informs us about the
proportion of actual positive cases that have gotten predicted as positive by our model.
Therefore when the value of sensitivity is high then it means our model is good at predicting the true
positives correctly. It is the ratio of true positives to all positives.
Specificity: Specificity is known as the True Negative Rate. It informs us about the proportion of actual
negative cases that have gotten predicted as negative by our model. It is the ratio of true negatives to all
negatives.
Therefore when the value of specificity is high then it means our model is good at predicting the true
negatives.
FUNCTIONING OF THE
APPLICATION
Step 1: - Import the libraries
Since Performance Tag is the target variables, rows with NA's are removed
No NA values in Credit data
No NA in demographic data
AGE
Inference : Middle line of box is called median . Here Median is 45
GENDER
Inference : Most of the applicants for loan are Male
Marital Status (at the time of application)
Inference : Married people applies more for loan than the un-married ones
No of dependents
Inference : Although No of dependents 1, 2,3 are looking similar but 3 is more likely to
Inference : Income column contains some missing values like the negative ones and
the zero value
Inference : We can say that people having 10-40 lakhs Income per annum applied more for loan than
the others
Education
Inference : People having Professional and Masters Education applied more for loan
Profession
Inference : More applicants did not have done 90 Days past due or worse in 6 months
No of times 60 DPD or worse in last 6 months
Inference : More applicants did not have done 60 Days past due or worse in 6 months
No of times 30 DPD or worse in last 6 months
Inference : More applicants did not have done 30 Days past due or worse in 6 months
No of times 90 DPD or worse in last 12 months
Inference : More applicants did not have done 90 Days past due or worse in 12 months
No of times 60 DPD or worse in last 12 months
Inference : More applicants did not have done 60 Days past due or worse in 12 months
No of times 60 DPD or worse in last 12 months
Inference : More applicants did not have done 30 Days past due or worse in 12 months
Avgas CC Utilization in last 12 months
Inference : Most No of Inquiries in last 12 months (excluding home & auto loans) - 0
Presence of open home loan
Inference : No of Trades - 3
CONCLUSION
Decision Tree Classification Model
TotalWorkingYears
total
Age
DistanceFromHome
MonthlyIncome
MaritalStatus_Single
YearsAtCompany
PercentSalaryHike
BusinessTravel_Travel_Frequently
NumCompaniesWorked_7-9
EnvironmentSatisfaction_Low
NumCompaniesWorked_4-6
MODEL RESULTS:-
Accuracy - 94
Sensitivity - 94
Specificity - 93
No of dependents,
Outstanding Balance,
Total No of Trades,
Logistic Regression
Important Variable:-
MODEL RESULTS(%):-
Accuracy - 63
Sensitivity - 69
Specificity –55
Random Forest Classifier
Important Variable:
MODEL RESULTS(%):-
Accuracy - 97
Sensitivity - 100
Specificity –94
REFERENCES
https://towardsdatascience.com/machine-learning/home
https://www.javatpoint.com/machine-learning
https://www.analyticsvidhya.com/blog/2017/09/common-machine-learning-algorithms/
https://www.geeksforgeeks.org/decision-tree/