You are on page 1of 96

A MINOR PROJECT REPORT

ON
CAPSTONE CREDIT CARD DEFAULT DETECTION

In partial fulfillment for the award of the degree

Of

BACHELOR OF COMPUTER APPLICATIONS

MAHARAJA SURAJMAL INSTITUTE

C-4, JANAK PURI,


NEW DELHI -
110058

Submitted By: Submitted to:


Vinay Ghildiyal (09414902019) Dr Pooja Singh
Ritik Tandon (07514902019) Assistant Professor
Tarun Gautam (02514902019) Department of Computer Science
Vardaan Agarwal (35614902019) Maharaja Surajmal Institute
MAHARAJA SURAJMAL INSTITUTE

BONAFIDE CERTIFICATE

Certified that this project report on “Capstone Credit Card Default Detection” is

the bonafide work of “VINAY GHILDIYAL, RITIK TANDON, TARUN

GAUTAM, VARDAAN AGARWAL” who carried out the project work under my

supervision.

SIGNATURE

Dr. Pooja Singh


SUPERVISOR

Assistant Professor
Department of Computer Science
Maharaja Surajmal Institute
DECLARATION

We hereby declare that this submission is our own work and that, to the best of our

knowledge and belief, it contains no material previously published or written by another

person nor material which to a substantial extent has been accepted for the award of any

other degree or diploma of the university or other institute of higher learning, except

where due acknowledgment has been made in the text.

Vinay Ghildiyal (09414902019)

Ritik Tandon (07514902019)

Tarun Gautam (02514902019)

Vardaan Agarwal (35614902019)

Date-
ACKNOWLEDGEMENT

It gives us a great sense of pleasure to present the report of BCA Project undertaken

during BCA Final Year. We owe a special debt of gratitude to Dr. Pooja Singh, Asst.

Professor, Department of Computer Science, Maharaja Surajmal Institute, Delhi for

her constant support and guidance throughout the course of our work. Her sincerity,

thoroughness and perseverance have been a constant source of inspiration for us. It is

only her cognizant efforts that our endeavor have seen light of the day. We also do

not like to miss the opportunity to acknowledge the contribution of all faculty

members of the department for their kind assistance and cooperation during the

development of our project. Last but not the least, we acknowledge our friends for

their contribution in the completion of the project.

Vinay Ghildiyal (09414902019)

Ritik Tandon (07514902019)

Tarun Gautam (02514902019)

Vardaan Agarwal (35614902019)

Date-
ABSTRACT

Credit card default detection is presently the most frequently occurring problem in

the modern world. Credit risk plays a vital role in the banking sector. Main activities

of banks involve granting loan, credit card, investment, Mortgage, and others.

Credit cards have been one of the most booming financial services by banks over the

past few years. However, with the growing number of credit card users, banks have

been facing an escalating credit card default rate.

As such data analytics can provide solutions to tackle the current phenomenon and

management credit risks.

This project provides a performance evaluation of credit card default prediction with

the help of Machine Learning Classification modelling algorithms.

Classification algorithms like logistic regression, decision tree, and random forest are

used to test the variable in predicting credit default and random forest proved to have

the higher accuracy, sensitivity, specificity and area under the curve.

This result shows that random forest best describes which factors should be

considered with an accuracy of 97 % and an Area under Curve of 97.3 % when

assessing the credit risk of credit card customers.


Table of Contents
1. INTRODUCTION
1.1 Reason for choosing this project
1.2 Reason to Choose Machine Learning in Python
1.3 Objectives of the Project
2. METHODS AND MATERIALS USED

 Software Requirements
 Python
 Google Colab
 Numpy
 Pandas
 Matplotlib
 Seaborn
 Sklearn
- Hardware Requirements
-ER Diagram
-Data Flow Diagram

3.Intro to Machine Learning


 History of machine learning
 Application of machine learning
 Types of Machine Learning
4. FUNCTIONING OF APPLICATION

Main Code& Screenshot


Conclusion
REFRENCES
INTRODUCTION
Reason for choosing this project

Banking sector is one of the most volatile and vulnerable sectors in the world with its
increasing risk factors.
Credit card default is a serious concern in banking sectors. They are suffering from huge
losses due to inability to recover the money granted to the customers.
There are large number of peoples who applies for credit card every year due to which the
assessment of whether an individual will be able of repaying the loan has become very
crucial in the banking sector.
So our aim of this project is to distinguish the potential defaulters from the non-defaulters
and thereby reducing the bad debt.
We can achieve this by applying Machine Learning Classification Models like
LogisiticRegression , Decision Tree , Random Forest Classifier , Naïve Bayes etc.

Reason to Choose Machine Learning in Python

Python is a programming language that supports the creation of a wide range of


applications.Developers regard it as a great choice for Artificial Intelligence (AI),
Machine Learning, and Deep Learning projects.
It has a huge number of libraries and frameworks: The Python language comes with
many libraries and frameworks that make coding easy. This also saves a significant
amount of time.
The most popular libraries are NumPy, which is used for scientific calculations;
SciPy for more advanced computations; and scikit, for learning data mining and data
.Fast development: Python has a syntax that is easy to understand and friendly.
Objectives of the Project

This project aims to bridge this gap of uncertainty by utilizing a data-driven approach by using

past data of credit card customers in conjunction with machine learning to predict whether or not

a consumer will default on their credit cards.

The goal behind using this model is to achieve two things:


 Bring more consistency to the loaning process
 Investigate what the key drivers are behind a potential defaulter

In addition to answering these pressing questions, I personally wanted to focus on the


learning process as well; to become better/more efficient with a data science project
workflow, and also enter a domain I was unfamiliar with (and uncomfortable with, having
come from a Mechanical Engineering background) to step out of my comfort zone. With that
being said, let’s delve into how one may embark on a Classifier Data Science Project.
METHODS AND MATERIALS USED
Software Requirements

The following is the software requirements for the project:

1. Python
2. Google Colab/ Jupyter Notebook
3. Numpy
4. Pandas
5. Matplotlib
6. Seaborn
7. Scikit Learn

Python

Python is a high-level, interpreted and general-purpose dynamic programming


language that focuses on code readability. The syntax in Python helps the
programmers to do coding in fewer steps as compared to Java or C++.

The Python language has diversified application in the software development


companies such as in gaming, web frameworks and applications, language
development, prototyping, graphic design applications, etc. This provides the
language a higher plethora over other programming languages used in the industry.
Some of its advantages are:

Extensive Support Libraries

It provides large standard libraries that include the areas like string operations,
Internet, web service tools, operating system interfaces and protocols. Most of the
highly used programming tasks are already scripted into it that limits the length of the
codes to be written in Python.

Integration Feature

Python integrates the Enterprise Application Integration that makes it easy to develop
Web services by invoking COM or COBRA components. It has powerful control
capabilities as it calls directly through C, C++ or Java via Python. Python also
processes XML and other mark-up languages as it can run on all modern operating
systems through same byte code.

Improved Programmer’s Productivity

The language has extensive support libraries and clean object-oriented designs that
increase two to tenfold of programmer’s productivity while using the languages like
Java, VB, Perl, C, C++ and C#.

Productivity

With its strong process integration features, unit testing framework and enhanced
control capabilities contribute towards the increased speed for most applications and
productivity of applications. It is a great option for building scalable multi-protocol
network applications.

Google Colab/Jupyter Notebook

Colab is basically a free Jupyter notebook environment running wholly in the cloud. Most
importantly, Colab does not require a setup, plus the notebooks that you will create can be
simultaneously edited by your team members – in a similar manner you edit documents in
Google Docs. The greatest advantage is that Colab supports most popular machine learning
libraries which can be easily loaded in your notebook.
 
We can perform the following using Google Colab;
1. Write and execute code in Python
2. Create/Upload/Share notebooks
3. Import/Save notebooks from/to Google Drive
4. Import/Publish notebooks from GitHub
5. Import external datasets 
6. Integrate PyTorch, TensorFlow, Keras, OpenCV
7. Free Cloud service with free GPU

NUMPY

NumPy is a Python library used for working with arrays. NumPy stands for Numerical Python.
NumPy is a Python library and is written partially in Python, but most of the parts that require
fast computation are written in C or C++.
In Python we have lists that serve the purpose of arrays, but they are slow to process.

NumPy aims to provide an array object that is up to 50x faster than traditional Python lists.

The array object in NumPy is called ndarray, it provides a lot of supporting functions that make
working with ndarray very easy.

Arrays are very frequently used in data science, where speed and resources are very important.

PANDAS

Pandas is an open-source library that is built on top of NumPy library. It is a Python package that
offers various data structures and operations for manipulating numerical data and time series. It
is mainly popular for importing and analyzing data much easier. Pandas is fast and it has high-
performance & productivity for users.

Pandas makes it simple to do many of the time consuming, repetitive tasks associated with
working with data, including:

1. Data cleansing
2. Data fill
3. Data normalization
4. Merges and joins
5. Data visualization
6. Statistical analysis
7. Data inspection
8. Loading and saving data

Matplotlib

Matplotlib is an amazing visualization library in Python for 2D plots of arrays. Matplotlib is a


multi-platform data visualization library built on NumPy arrays and designed to work with the
broader SciPy stack. It was introduced by John Hunter in the year 2002.

One of the greatest benefits of visualization is that it allows us visual access to huge amounts of
data in easily digestible visuals. Matplotlib consists of several plots like line, bar, scatter,
histogram etc.

Matplotlib is not a part of the Standard Libraries which is installed by default when Python, there
are several toolkits which are available that extend python matplotlib functionality. Some of
them are separate downloads, others can be shipped with the matplotlib source code but have
external dependencies.
1. Basemap: It is a map plotting toolkit with various map projections, coastlines and political
boundaries.
2. Cartopy: It is a mapping library featuring object-oriented map projection definitions, and arbitrary
point, line, polygon and image transformation capabilities.
3. Excel tools: Matplotlib provides utilities for exchanging data with Microsoft Excel.
4. Mplot3d: It is used for 3-D plots.
5. Natgrid: It is an interface to the natgrid library for irregular gridding of the spaced data.

Seaborn

Seaborn is an amazing visualization library for statistical graphics plotting in Python. It provides
beautiful default styles and color palettes to make statistical plots more attractive. It is built on
the top of matplotlib library and also closely integrated to the data structures from pandas.
Seaborn aims to make visualization the central part of exploring and understanding data. It
provides dataset-oriented APIs, so that we can switch between different visual representations
for same variables for better understanding of dataset.

Different categories of plot in Seaborn

Plots are basically used for visualizing the relationship between variables. Those variables can be
either be completely numerical or a category like a group, class or division. Seaborn divides plot
into the below categories –
1. Relational plots: This plot is used to understand the relation between two variables.
2. Categorical plots: This plot deals with categorical variables and how they can be visualized.
3. Distribution plots: This plot is used for examining univariate and bivariate distributions
4. Regression plots: The regression plots in seaborn are primarily intended to add a visual guide that
helps to emphasize patterns in a dataset during exploratory data analyses.
5. Matrix plots: A matrix plot is an array of scatterplots.
6. Multi-plot grids: It is an useful approach is to draw multiple instances of the same plot on different
subsets of the dataset.

Scikit-learn

Scikit-learn is a free software machine learning library for the Python programming language. It
features various classification, regression and clustering algorithms including support vector
machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to
interoperate with the Python numerical and scientific libraries NumPy and SciPy. Scikit-learn is
a NumFOCUS fiscally sponsored project.
It is the most useful and robust library for machine learning in Python. It provides a selection of
efficient tools for machine learning and statistical modeling including classification, regression,
clustering and dimensionality reduction via a consistence interface in Python. This library, which
is largely written in Python, is built upon NumPy, SciPy and Matplotlib.

2.1.1 Hardware requirements

The following is the hardware requirements for the project:


512MB RAM + 1GB of disk +0.5 CPU core. Server overhead: 2-4GB or 10% system
overhead (whatever is larger), 0.5 CPU cores, 10GB disk space.
DATA FLOW DIAGRAM
ER DIAGRAM
INTRODUCTION TO MACHINE
LEARNING
INTRODUCTION

Machine learning is an application of


artificial intelligence (AI) that provides
systems the ability to automatically learn
and improve from experience without
being explicitly programmed. Machine
learning focuses on the development of
computer programs that can access data
and use it to learn for themselves.
The primary aim is to allow the
computers to learn automatically without
human intervention or assistance and
adjust actions accordingly.

HISTORY OF MACHINE LEARNING

The term machine learning was coined in 1959 by Arthur Samuel, an American IBMer and
pioneer in the field of computer gaming and artificial intelligence.
Tom M. Mitchell provided a widely quoted, more formal definition of the algorithms studied in
the machine learning field: "A computer program is said to learn from experience E with respect
to some class of tasks T and performance measure P if its performance at tasks in T, as measured
by P, improves with experience E."
This definition of the tasks in which machine learning is concerned offers a fundamentally
operational definition rather than defining the field in cognitive terms. This follows Alan
Turing's proposal in his paper "Computing Machinery and Intelligence", in which the question
"Can machines think?" is replaced with the question "Can machines do what we (as thinking
entities) can do?".
APPLICATIONS OF MACHINE LEARNING

1. Smartphones detect faces while taking photos or unlocking themselves.


2. Amazon recommends you the products based on your browsing history.
3. Banks using Machine Learning to detect Fraud transactions in real-time.
4. Classifying incoming mails as spam or ham.

TYPES OF MACHINE LEARNING ALGORITHMS

1. Supervised Learning
2. Unsupervised Learning
3. Semi-supervised Learning
4. Reinforcement Learning

Supervised Learning

Supervised learning is the types of machine learning in which machines are trained using well
"labelled" training data, and on basis of that data, machines predict the output. The labelled data
means some input data is already tagged with the correct output.

In supervised learning, the training data provided to the machines work as the supervisor that
teaches the machines to predict the output correctly. It applies the same concept as a student
learns in the supervision of the teacher.

Algorithms used - Support vector machine, Neural network, Linear and logistics regression,
random forest, and Classification trees.

The main types of supervised learning problems include regression and classification problems:-
1.Regression

Regression algorithms are used if there is a relationship between the input variable and the
output variable. It is used for the prediction of continuous variables, such as Weather forecasting,
Market Trends, etc. Below are some popular Regression algorithms which come under
supervised learning:

1. Linear Regression
2. Regression Trees
3. Non-Linear Regression
4. Bayesian Linear Regression
5. Polynomial Regression

2.Classification

Classification algorithms are used when the output variable is categorical, which means there are
two classes such as Yes-No, Male-Female, True-false, etc.

1. Random Forest

A random forest is a machine learning technique that’s used to solve regression and classification
problems. It utilizes ensemble learning, which is a technique that combines many classifiers to provide
solutions to complex problems.

A random forest algorithm consists of many decision trees. The ‘forest’ generated by the random forest
algorithm is trained through bagging or bootstrap aggregating. Bagging is an ensemble meta-algorithm
that improves the accuracy of machine learning algorithms.

The (random forest) algorithm establishes the outcome based on the predictions of the decision trees. It
predicts by taking the average or mean of the output from various trees. Increasing the number of trees
increases the precision of the outcome.

A random forest eradicates the limitations of a decision tree algorithm. It reduces the overfitting of
datasets and increases precision. It generates predictions without requiring many configurations in
packages (like scikit-learn).

Features of a Random Forest Algorithm

1. It’s more accurate than the decision tree algorithm.


2. It provides an effective way of handling missing data.
3. It can produce a reasonable prediction without hyper-parameter tuning.
4. It solves the issue of overfitting in decision trees.
5. In every random forest tree, a subset of features is selected randomly at the node’s
splitting point.

How random forest algorithm works

Understanding decision trees

Decision trees are the building blocks of a random forest algorithm. A decision tree is a decision
support technique that forms a tree-like structure. An overview of decision trees will help us
understand how random forest algorithms work.

A decision tree consists of three components: decision nodes, leaf nodes, and a root node. A
decision tree algorithm divides a training dataset into branches, which further segregate into
other branches. This sequence continues until a leaf node is attained. The leaf node cannot be
segregated further.

The nodes in the decision tree represent attributes that are used for predicting the outcome.
Decision nodes provide a link to the leaves. The following diagram shows the three types of
nodes in a decision tree.

2. Decision Trees

Decision tree is the most powerful and popular tool for classification and prediction. A Decision
tree is a flowchart like tree structure, where each internal node denotes a test on an attribute,
each branch represents an outcome of the test, and each leaf node (terminal node) holds a class
label.
Strengths and Weakness of Decision Tree approach 

The strengths of decision tree methods are: 


 
 Decision trees are able to generate understandable rules.
 Decision trees perform classification without requiring much computation.
 Decision trees are able to handle both continuous and categorical variables.
 Decision trees provide a clear indication of which fields are most important for prediction
or classification.

The weaknesses of decision tree methods : 


 
 Decision trees are less appropriate for estimation tasks where the goal is to predict the
value of a continuous attribute.
 Decision trees are prone to errors in classification problems with many class and relatively
small number of training examples.
 Decision tree can be computationally expensive to train. The process of growing a
decision tree is computationally expensive. At each node, each candidate splitting field
must be sorted before its best split can be found. In some algorithms, combinations of
fields are used and a search must be made for optimal combining weights. Pruning algorithms
can also be expensive since many candidate sub-trees must be formed and compared

3. Logistic Regression

Logistic regression is named for the function used at the core of the method, the logistic
function.

The logistic function, also called the sigmoid function was developed by statisticians to describe
properties of population growth in ecology, rising quickly and maxing out at the carrying
capacity of the environment. It’s an S-shaped curve that can take any real-valued number and
map it into a value between 0 and 1, but never exactly at those limits.

1 / (1 + e^-value)

Where e is the base of the natural logarithms (Euler’s number or the EXP() function in your
spreadsheet) and value is the actual numerical value that you want to transform. Below is a plot
of the numbers between -5 and 5 transformed into the range 0 and 1 using the logistic function.

LOGISTIC FUNCTION
Representation Used for Logistic Regression

Logistic regression uses an equation as the representation, very much like linear regression.

Input values (x) are combined linearly using weights or coefficient values (referred to as the
Greek capital letter Beta) to predict an output value (y). A key difference from linear regression
is that the output value being modeled is a binary values (0 or 1) rather than a numeric value.

Below is an example logistic regression equation:

y = e^(b0 + b1*x) / (1 + e^(b0 + b1*x))

Where y is the predicted output, b0 is the bias or intercept term and b1 is the coefficient for the
single input value (x). Each column in your input data has an associated b coefficient (a constant
real value) that must be learned from your training data.

The actual representation of the model that you would store in memory or in a file are the
coefficients in the equation (the beta value or b’s).

Logistic Regression Predicts Probabilities

Logistic regression models the probability of the default class (e.g. the first class).

For example, if we are modeling people’s sex as male or female from their height, then the first
class could be male and the logistic regression model could be written as the probability of male
given a person’s height, or more formally:

P(sex=male|height)

Written another way, we are modeling the probability that an input (X) belongs to the default
class (Y=1), we can write this formally as:

P(X) = P(Y=1|X)

Note that the probability prediction must be transformed into a binary values (0 or 1) in order to
actually make a probability prediction. More on this later when we talk about making
predictions.

Logistic regression is a linear method, but the predictions are transformed using the logistic
function. The impact of this is that we can no longer understand the predictions as a linear
combination of the inputs as we can with linear regression, for example, continuing on from
above, the model can be stated as:

p(X) = e^(b0 + b1*X) / (1 + e^(b0 + b1*X))

I don’t want to dive into the math too much, but we can turn around the above equation as
follows (remember we can remove the e from one side by adding a natural logarithm (ln) to the
other):

ln(p(X) / 1 – p(X)) = b0 + b1 * X

This is useful because we can see that the calculation of the output on the right is linear again
(just like linear regression), and the input on the left is a log of the probability of the default
class.

This ratio on the left is called the odds of the default class (it’s historical that we use odds, for
example, odds are used in horse racing rather than probabilities). Odds are calculated as a ratio
of the probability of the event divided by the probability of not the event, e.g. 0.8/(1-0.8) which
has the odds of 4. So we could instead write:
ln(odds) = b0 + b1 * X

Because the odds are log transformed, we call this left hand side the log-odds or the probit. It is
possible to use other types of functions for the transform (which is out of scope_, but as such it is
common to refer to the transform that relates the linear regression equation to the probabilities as
the link function, e.g. the probit link function.

We can move the exponent back to the right and write it as:

odds = e^(b0 + b1 * X)

All of this helps us understand that indeed the model is still a linear combination of the inputs, but that
this linear combination relates to the log-odds of the default class.

Advantages of Supervised learning:

1. With the help of supervised learning, the model can predict the output on the basis of prior
experiences.
2. In supervised learning, we can have an exact idea about the classes of objects.
3. Supervised learning model helps us to solve various real-world problems such as fraud
detection, spam filtering, etc.

Disadvantages of supervised learning:

1. Supervised learning models are not suitable for handling the complex tasks.
2. Supervised learning cannot predict the correct output if the test data is different from the
training dataset.
3. Training required lots of computation times.
4. In supervised learning, we need enough knowledge about the classes of object.
2) Unsupervised Learning

Unsupervised learning, also known as unsupervised machine learning, uses machine learning
algorithms to analyze and cluster unlabelled datasets. These algorithms discover hidden patterns
or data groupings without the need for human intervention. Its ability to discover similarities and
differences in information make it the ideal solution for exploratory data analysis, cross-selling
strategies, customer segmentation, and image recognition.

Types of Unsupervised Learning Algorithm:

The unsupervised learning algorithm can be further categorized into two types of problems:

Clustering:

Clustering is a method of grouping the objects into clusters such that objects with most
similarities remains into a group and has less or no similarities with the objects of another group.
Cluster analysis finds the commonalities between the data objects and categorizes them as per
the presence and absence of those commonalities.

Association:
An association rule is an unsupervised learning method which is used for finding the
relationships between variables in the large database. It determines the set of items that occurs
together in the dataset. Association rule makes marketing strategy more effective. Such as people
who buy X item (suppose a bread) are also tend to purchase Y (Butter/Jam) item. A typical
example of Association rule is Market Basket Analysis.

Advantages of Unsupervised Learning

1. Unsupervised learning is used for more complex tasks as compared to supervised learning
because, in unsupervised learning, we don't have labelled input data.
2. Unsupervised learning is preferable as it is easy to get unlabelled data in comparison to
labelled data.

Disadvantages of Unsupervised Learning

1. Unsupervised learning is intrinsically more difficult than supervised learning as it does not
have corresponding output.
2. The result of the unsupervised learning algorithm might be less accurate as input data is
not labelled, and algorithms do not know the exact output in advance.

3) Reinforcement Learning

Reinforcement learning is an area of Machine Learning. It is about taking suitable action to


maximize reward in a particular situation. It is employed by various software and machines to
find the best possible behavior or path it should take in a specific situation. Reinforcement
learning differs from supervised learning in a way that in supervised learning the training data
has the answer key with it so the model is trained with the correct answer itself whereas in
reinforcement learning, there is no answer but the reinforcement agent decides what to do to
perform the given task. In the absence of a training dataset, it is bound to learn from its
experience.

Main points in Reinforcement learning – 


 
 Input: The input should be an initial state from which the model will start
 Output: There are many possible outputs as there are a variety of solutions to a particular
problem
 Training: The training is based upon the input, The model will return a state and the user
will decide to reward or punish the model based on its output.
 The model keeps continues to learn.
 The best solution is decided based on the maximum reward.

Types of Reinforcement: There are two types of Reinforcement: 


 
1. Positive

Positive Reinforcement is defined as when an event, occurs due to a particular behavior,


increases the strength and the frequency of the behavior. In other words, it has a positive
effect on behavior. 

Advantages of reinforcement learning are: 


 Maximizes Performance
 Sustain Change for a long period of time
 Too much Reinforcement can lead to an overload of states which can diminish the
results

2. Negative

Negative Reinforcement is defined as strengthening of behavior because a negative


condition is stopped or avoided. 

Advantages of reinforcement learning: 


 Increases Behavior
 Provide defiance to a minimum standard of performance
 It Only provides enough to meet up the minimum behavior

4) Semi Supervised Learning Method

In this type of learning, the algorithm is trained upon a combination of labeled and unlabelled
data. Typically, this combination will contain a very small amount of labeled data and a very
large amount of unlabelled data. The basic procedure involved is that first, the programmer will
cluster similar data using an unsupervised learning algorithm and then use the existing labeled
data to label the rest of the unlabelled data. The typical use cases of such type of algorithm have
a common property among them – The acquisition of unlabelled data is relatively cheap while
labeling the said data is very expensive. 

A Semi-Supervised algorithm assumes the following about the data

1. Continuity Assumption: The algorithm assumes that the points which are closer to
each other are more likely to have the same output label.
2. Cluster Assumption: The data can be divided into discrete clusters and points in the
same cluster are more likely to share an output label.
3. Manifold Assumption: The data lie approximately on a manifold of much lower
dimension than the input space. This assumption allows the use of distances and
densities which are defined on a manifold.

Practical applications of Semi-Supervised Learning –


 
1. Speech Analysis: Since labeling of audio files is a very intensive task, Semi-
Supervised learning is a very natural approach to solve this problem.
2. Internet Content Classification: Labeling each webpage is an impractical and
unfeasible process and thus uses Semi-Supervised learning algorithms. Even the
Google search algorithm uses a variant of Semi-Supervised learning to rank the
relevance of a webpage for a given query.
3. Protein Sequence Classification: Since DNA strands are typically very large in size,
the rise of Semi-Supervised learning has been imminent in this field.

IMPORTANT TERMS TO UNDERSTAND

A true positive is an outcome where the model correctly predicts the positive class.

Similarly, a true negative is an outcome where the model correctly predicts the negative class.

A false positive is an outcome where the model incorrectly predicts the positive class.

And a false negative is an outcome where the model incorrectly predicts the negative class.

Sensitivity: Sensitivity is known as the True Positive Rate. Essentially, it informs us about the
proportion of actual positive cases that have gotten predicted as positive by our model.

Formula: Senstivity = True Positive / (True Postive + False Negative)

Therefore when the value of sensitivity is high then it means our model is good at predicting the true
positives correctly. It is the ratio of true positives to all positives.

Specificity: Specificity is known as the True Negative Rate. It informs us about the proportion of actual
negative cases that have gotten predicted as negative by our model. It is the ratio of true negatives to all
negatives.

Formula: Specificity = True Negative / (True Negative + False Positive)

Therefore when the value of specificity is high then it means our model is good at predicting the true
negatives.
FUNCTIONING OF THE
APPLICATION
Step 1: - Import the libraries

Step 2: - Import the dataset


Step 3: - Data cleaning
This implies that there are 1425 records with no performance tag in both
datasets which indicates that the applicant is not given credit card.
Inference - Almost 2% of the observations has NA values in 'perfromance.tag'.

Since Performance Tag is the target variables, rows with NA's are removed
No NA values in Credit data
No NA in demographic data

# As we can say that both dataframe has same application Id so we


can concat them on Application ID to form a new dataframe comprises
of these 2 dataframes
No more duplicate values in the data
Step 4: - Data Visualisation
(i) Univariate Analysis:-

AGE
Inference : Middle line of box is called median . Here Median is 45

GENDER
Inference : Most of the applicants for loan are Male
Marital Status (at the time of application)

Inference : Married people applies more for loan than the un-married ones
No of dependents

Inference : Although No of dependents 1, 2,3 are looking similar but 3 is more likely to

applying for loan


Income

Inference : Income column contains some missing values like the negative ones and
the zero value
Inference : We can say that people having 10-40 lakhs Income per annum applied more for loan than
the others
Education

Inference : People having Professional and Masters Education applied more for loan
Profession

Inference : Most loan applicants have SAL Profession


Type of Residence

Inference : Most loan applicants are living in rented house


No of months in current residence

Inference : 1-10 months


No of months in current company

Inference : 1-10 months


Performance Tag

Inference : Most applicants did not default


No of times 90 DPD or worse in last 6 months

Inference : More applicants did not have done 90 Days past due or worse in 6 months
No of times 60 DPD or worse in last 6 months

Inference : More applicants did not have done 60 Days past due or worse in 6 months
No of times 30 DPD or worse in last 6 months

Inference : More applicants did not have done 30 Days past due or worse in 6 months
No of times 90 DPD or worse in last 12 months

Inference : More applicants did not have done 90 Days past due or worse in 12 months
No of times 60 DPD or worse in last 12 months

Inference : More applicants did not have done 60 Days past due or worse in 12 months
No of times 60 DPD or worse in last 12 months

Inference : More applicants did not have done 30 Days past due or worse in 12 months
Avgas CC Utilization in last 12 months

Inference : 1-20000 is the average utilization of credit card by the customer


No of trades opened in last 6 months

Inference : No of trades opened in last 6 months - 1


No of trades opened in last 12 months

Inference : No of tades opened in last 12 months - 1 and 2


No of PL trades opened in last 6 months

Inference : Many people did not do PL trading in last 6 months


No of PL trades opened in last 6 months

Inference : Many people did not do PL trading in last 12 months


No of Inquiries in last 6 months (excluding home & auto loans)

Inference : No of Inquiries in last 6 months (excluding home & auto loans) - 0


No of Inquiries in last 12 months (excluding home & auto loans)

Inference : Most No of Inquiries in last 12 months (excluding home & auto loans) - 0
Presence of open home loan

Inference : Most customers did not apply for home loans


Outstanding Balance

Inference : It means that Outstanding Balance is skewed to left


Outstanding Balance

Inference: Most people have Outstanding Balance upto 10 lakh


Presence of open auto loan

Inference : Most people did not opt for auto loans


Total No of Trades

Inference : No of Trades - 3
CONCLUSION
Decision Tree Classification Model

Important Variables in Decision Tree classification Model are as follow:

 TotalWorkingYears

 total

 Age

 DistanceFromHome

 MonthlyIncome

 MaritalStatus_Single

 YearsAtCompany

 PercentSalaryHike

 BusinessTravel_Travel_Frequently

 NumCompaniesWorked_7-9

 EnvironmentSatisfaction_Low

 NumCompaniesWorked_4-6

MODEL RESULTS:-

Accuracy - 94

Sensitivity - 94

Specificity - 93

No of Inquiries in last 6 months (excluding home & auto loans),

No of times 30 DPD or worse in last 12 months,


No of PL trades opened in last 6 months,

No of Inquiries in last 12 months (excluding home & auto loans),

No of trades opened in last 6 months,

No of months in current residence,

No of dependents,

No of times 30 DPD or worse in last 6 months,

Outstanding Balance,

Total No of Trades,

Avgas CC Utilization in last 12 months

Logistic Regression

Important Variable:-

No of times 30 DPD or worse in last 12 month,

No of PL trades opened in last 12 months,

Presence of open home loan', 'Presence of open auto loan,

Profession_SE', 'residence_Living with Parents', 'residence_Others

MODEL RESULTS(%):-

Accuracy - 63

Sensitivity - 69

Specificity –55
Random Forest Classifier

Important Variable:

No of Inquiries in last 6 months (excluding home & auto loans)',

No of times 30 DPD or worse in last 12 months',

No of trades opened in last 6 months',

No of PL trades opened in last 6 months',

No of Inquiries in last 12 months (excluding home & auto loans),

No of times 90 DPD or worse in last 12 months,

No of times 30 DPD or worse in last 6 months,

No of PL trades opened in last 12 months,

No of times 60 DPD or worse in last 12 months,

No of trades opened in last 12 months,

No of times 60 DPD or worse in last 6 months,

MODEL RESULTS(%):-

Accuracy - 97

Sensitivity - 100

Specificity –94
REFERENCES
https://towardsdatascience.com/machine-learning/home

https://www.javatpoint.com/machine-learning

https://www.analyticsvidhya.com/blog/2017/09/common-machine-learning-algorithms/

https://www.geeksforgeeks.org/decision-tree/

You might also like