You are on page 1of 22

To create a model for predicting the delinquency status of

Residential Mortgage loans from the month of June 2020 to


December 2020

SIP project
report
Submitted in partial fulfillment of the requirements for
the
PGDM Program
Batch
2019-21

By Sahana Nagarajan
19PGDM244

Supervisors: 1. Mr. Madhawendra Kumar Jha


2. Prof. Faisal Nazir Zargar

International Management Institute


New Delhi
2020

1
ACKNOWLEDGEMENT

The completion of this internship and this project would not have been possible without the
kind support and help of many individuals and the organization. I would like to extend my
sincere thanks to all of them.

I would like to express my heartfelt gratitude to my faculty mentor Prof. Faisal Nazir Zargar
and corporate mentors Mr. Madhawendra Kumar Jha and Ms. Ruchika Singhal for their
guidance, constant supervision and extending their support by clarifying all my queries
patiently. I was always encouraged to raise questions, try different approaches of addressing
issues and they connected with me at regular intervals to ensure I had a smooth journey.

I would also like to thank my institute – International Management Institute, New Delhi, for
giving me an opportunity to foray into this field and complete this project. I would also like to
express my sincere gratitude to all the other members of the organization Deloitte and my
fellow interns for their support and time.

2
3
SELF- DECLARATION

I do hereby declare that this SIP Report Titled “------------------------------“ submitted to the
International Management Institute New Delhi, in partial fulfillment of the Summer Internship
Project , is an original work done by me under the Auspices of the Placement Office of
International Management Institute, New Delhi. This work has not been submitted to any other
University/ Institution for any purpose.

Signature
Name:
Roll Number:
Program and Batch details:
Date:

Signature
Professor
Faculty Mentor:
International Management Institute New Delhi
Date:

Signature
Professor
Dean/Chair Career Development:
International Management Institute New Delhi
Date:

4
TABLE OF CONTENTS
Page
S No. Sub Particulars No.
1 Executive Summary 6
2 Introduction 7
3 Objective of the study 7
4 Brief description of the concepts in the study 7
4a Delinquency rate 7
4b Residential Mortgage loans 7
4c Fannie Mae 8
4d Economic factors 8
4e Machine Learning 9
4f Supervised Learning 9
5 Methodology followed 9
5a Stage 1 :- Sampling 10
5b Stage 2:- Exploring 11
5c Stage 3:- Modifying 12
5d Stage 4:- Modelling 13
5e Stage 5:- Assessing 16
5f Stage 6:- Deployment 17
6 Results 18
7 Conclusion 18
8 Limitation of the study of scope for future improvement 19
9 Personal learning from SIP 19
10 Bibliography 19
11 Appendices 20
11a Appendix 1:- All the variables in acquisition dataset and its type 20
Appendix 2:- The variables that were eliminated after the exploring
11b stage 20
Appendix 3:- The variables that were eliminated after feature
11c selection 21
Appendix 4:- The final features and target variable used in the
11d study 21
11e Appendix 5:- The assessment results of K Nearest Neighbour 22

5
EXECUTIVE SUMMARY
For the Summer Internship Program 2020 I got the opportunity to intern at Deloitte USI from
20th April to 12th June 2020. It was a virtual internship where the interns had to work from
home. During this time, I underwent weakly spotlight sessions with various Deloitte leaders
where I got to learn about the different domains that the company works in, their approach and
how the company deals with their clients. The weekly informal virtual connect with the
Analytics team in Gurgaon helped me gain insights about the Analytics profile of Deloitte
Advisory. Based on the project requirements I got a comprehensive learning on Python,
Machine Learning and Banking & Finance concepts.

I had to work on a project to develop a model that predicts the delinquency status of the active
loans from the month of June 2020 to December 2020. The loan data of past 11 years on a
quarterly basis was collected from Fannie Mae website for the project. The data was cleaned
and modified to make it more efficient to achieve my objective. The model for prediction was
created by applying four Machine Learning algorithms – Naïve Bayes, Logistic Regression,
Random Forest and K Nearest Neighbor on the modified dataset. The model that was
developed was used to predict the delinquency status of all the known active loans. The model
was also deployed so that it can be utilized by end users to predict the delinquency status by
keying in all the necessary input. The project’s main aim was to help lenders predict the
borrower’s status beforehand so that they can confront the risk in an effective way.

6
INTRODUCTION

Deloitte Touché Tohmatsu Limited is a professional multinational services network. Deloitte,


one of the "Big Four" organizations associated with accounting today, is the largest
professional services network in the world based on revenue generated and number of
professionals employed.

Deloitte specializes in services such as consulting, tax, audit, financial advisory and enterprise
risk. It has more than 286,200 professionals spread across the globe. According to the reports
of 2018, Deloitte is the 4th largest privately-owned company in the United States.

The domain I was associated with is broadly referred to as Risk Advisory. Risk advisory
provides services in the areas of data quality and integrity, enterprise risk management,
information security and privacy, business continuity management and sustainability and
project risk and cyber risk.

OBJECTIVE OF THE STUDY

During the tenure of my internship I worked on the project that determines the delinquency
status of a loan, various loan, borrower and economic related factors that have a larger impact
on the delinquency rate. This study will help the lenders to reduce non-performing loans and
improve delinquency management.

BRIEF DESCRIPTION OF THE CONCEPTS IN THE STUDY

Delinquency rate

The delinquency rate is to determine the quality of the loan portfolio of lending companies or
banks and is commonly used by analysts. It compares the percentage of loans that are due to
the total number of loans. A lower rate indicates that there are fewer loans in the lender’s loan
portfolio that are playing outstanding debt late. Therefore, it’s more desirable.

Figure 1:- Delinquency rate

7
Formula for Delinquency rate

Residential Mortgage loans

A loan that one or more persons get in order to buy a house or any other residential property in
which they will live. The borrowers repay it over a specified period of time. The security for
the loan is a lien on the property.

In a residential mortgage, a homebuyer pledges their house to the bank or some other type of
lender, which has a claim on the house should the homebuyer default on paying the mortgage.

Fannie Mae

The Federal National Mortgage Association (FNMA), typically known as Fannie Mae, was
founded in 1938 and is a government-sponsored enterprise (GSE). It was established to
stimulate the housing market by making more mortgages available to moderate- to low-income
borrowers. It purchases and guarantees mortgages through the secondary mortgage market.
Fannie Mae does not provide or originate mortgages to borrowers.

Fannie Mae supports Americans who wants to buy a house through single-family mortgage
products and solutions. Their Single-Family business helps lenders use innovative digital
processes to originate quality affordable mortgages. Their funding makes products such as the
30-year, fixed-rate mortgage possible, providing homeowners with predictable mortgage
payments over the life of the loan. It remains the most desirable choice among homeowners.

Economic factors

There were six economic factors that was considered in the project that might affect the
delinquency rate.

i. Unemployment rate – It is expressed as a percentage of the share of the labor force


that is jobless. Increasing unemployment rates result in negative shocks to income that
impact on borrower's ability to pay a mortgage.
ii. Interest rate - It is the amount of interest due for a period, as a percentage of the
amount lent. Increases in prime interest rates lead to increase in mortgage payments
and could thus constrain the liquidity of borrowers.
iii. Inflation - It is an increase in the prices of services and goods in an economy over
some period of time. As prices of consumables including transport costs and food
prices increase, consumers tend to enter into arrears in the bond repayments to meet
basic needs.

8
iv. House Price Index - It measures the changes in price of residential housing as a
percentage change from a specific start date. Decreasing house prices impair the
borrower's equity position, which in turn increases delinquency risk because
homeowners are in a worse position to refinance or sell if a trigger event occurs.
v. GDP - It is the monetary value of all finished services and goods manufactured within
a country during a specific period. When GDP decreases, it impacts the earning
potential of people, thus increasing the risk of delinquency.
vi. Delinquency rate – Within a financial institution's loan portfolio, it is the percentage of
loans whose payments are delinquent.

Machine Learning

Machine learning is artificial intelligence (AI) application that provides systems the capability
to learn automatically and overhaul from previous experience without being explicitly
programmed. The process of learning starts with observing the data, such as many examples,
instruction, or direct experience, in order to look for trends in data and make better decisions
in the coming future based on the examples provided earlier. The primary goal is to allow the
computers learn automatically without human assistance or intervention and adjust the actions
accordingly.

Supervised Learning

Supervised learning is a machine learning task of understanding a function that links an input
to an output based on a fed in example input-output pairs. When training a supervised learning
algorithm, the training data will have inputs paired with the correct outputs. During training, the
algorithm will search for patterns in the input data that match with the desired outputs. After
training, this algorithm will take in unseen new inputs and will determine which class the new
inputs will be classified on the basis of prior training data. The objective of a supervised learning
model is to predict the class for newly presented input data correctly.

METHODOLOGY FOLLOWED

The six different stages that were followed in the project are as follows:

Figure 2:- Stages of the project

Stage 1:- Sampling

This stage is where all the data that is required to meet the objective is collected. There are
two types of data that was collected

9
Type 1 :- Data related to loan and borrowers that was collected from Fannie Mae website.

The dataset that was collected was on a quarterly basis. The first 1000 loan data from 2008 Q1
to 2019 Q1 was used for this project. It has two datasets

1. Acquisition dataset – This contains one entry per loan that was given out during that
particular quarter. It contains loan related details like interest rate, loan terms, the loan
amount, etc and borrower related details like borrower credit score, debt-to-income
ratio,etc.

Figure 3:- Sample of acquisition dataset

This layout of this dataset that contains 25 variables is shown in appendix 1.

2. Performance dataset – This is a dynamic dataset that captures the performance of the
loan during its entire tenure i.e., from the day it was issued till the day it was closed.
Only two columns were chosen from this dataset - Loan id to map it with the acquisition
dataset and Delinquency status.

Figure 4:- Sample of performance dataset

10
Type 2 :- Data related to the six economic factors that were being considered in the project
were collected from public sources.

Figure 5:- Sample of economic factors dataset

Stage 2 :- Exploring

In order to gain understanding and ideas, the relationship and trends between the variables in
the data were explored using various graphs. The variables were divided into two sets

1. Numerical variables - It is a variable where the measurement or number has


a numerical meaning. Correlation heat map and scatter plots were used to find the
relationship between all the numerical variables.

Figure 6:- Correlation heat map

11
Figure 7:- Scatter plot between LTV and CLTV

2. Categorical variables – These are variables that represent categories or types. Bar
chart was used to find how these variables are distributed.

Figure 8:- Bar chart for occupancy type

Stage 3 :- Modifying

Based on the results from the previous stage, certain modifications are done to the dataset to
make it more efficient and appropriate for project’s use. There were broadly six modifications
that were made.

1. Selecting - According to the trends derived in the previous stage, the variables that
contribute to the project were filtered out. The variables that were eliminated are
given in appendix 2.
2. Handling missing values – The variables that had missing values were identified
and handled by filling it with 0 or the mean of all the values in the column.

12
3. Creating - Categories were created for all economic variables. Each category
consisted of range of values and they were assigned a value of 0,1,2 and so on. The
economic factors were mapped to the loans according to the month of the loan
origination date. Columns containing economic factors of the previous month were
also added to the dataset to include historical data which will help in better
determination of the delinquency status.
4. Transforming – Categorical variables with string categories cannot be fed to the
machine. Therefore all the string categories were converted to numericals.
5. Scaling – All the numeric variables are not of the same scale. Some algorithms are
very sensitive to magnitude. A higher scale numeric variable tends to dominate over
the other variables and result in ambiguous outputs. So, the numerical variables were
scaled for data normalisation within a particular range.
6. Combining – The modified acquisition, performance and economic factors datasets
were combined to form the final dataset with 443372 rows and 30 columns.

Figure 9:- Sample of the final dataset

Stage 4 – Modelling

In this stage further tweaking of data was done and then the Machine Learning algorithms were
applied to the final dataset that was derived. In this study input-output pairs that can be used
to train the algorithm was present. Therefore Supervised Machine Learning algorithms was
used to which the input and output variables were fed, trained and a model was created. This
model is used to predict the output for new input sets. The input variable that is fed to the
machine are called as FEATURES and the output variable is called the TARGET VARIABLE.
In this study the features are CLTV, borrower credit score, GDP, unemployment rate, etc and
the target variable is the delinquency status.

13
Data had to be further enhanced to make it more suitable for modelling for it to give accurate
results. These enhancements were done in 2 steps – Feature selection and Dataset
Enhancement.

1. Feature selection – Earlier in stage 2 the relationship of variables with one another was
determined and this was used to eliminate all the variables that contribute less to this
study. At this stage the relationship between the features and target variable was
determined to eliminate the features that has lesser impact on the target variable. This
also helps in reducing the dimensionality of the entire dataset. Feature selection was
done using two methods.

Method 1 :- Correlation heat map – The correlation of all the features with the target
variables was determined. Lower the correlation, lower is the impact of that feature on
the target variable and thus can be eliminated.

Figure 10:- Correlation heat map for feature selection

Method 2 :- Relationship scores – Relationship scores between features and target


variables were determined by using chi-square test for categorical variables and
ANOVA f-value for numerical variables. Lower the score, lower is the impact of the
feature on the target variable.

Figure 11:- Relationship score between features and the target variable

14
From the results of the above two methods, few variables were eliminated. These
variables are specified in appendix 3.

Before feature selection, there were 25 and after feature selection, the number of
features reduced to 16.

2. Dataset enhancement

One of a very prevalent problem in the area of defaults was faced in this study too. It
was the issue of unbalanced dataset. An unbalanced dataset is one in which the target
variable has more observations in one specific class than the others. The challenge
appears when machine learning algorithms try to identify these rare cases in rather big
datasets. Due to the disparity of classes in the variables, the algorithm tends to categorize
into the class with more instances, the majority class, while at the same time giving the
false sense of a highly accurate model. Both the inability to predict rare events,
the minority class, and the misleading accuracy detracts from the predictive models that
are build.

To reduce the effect of unbalanced dataset, it was enhanced by adding more data of
delinquent loans. As mentioned earlier, only the first 1000 loan data from each quarter
was considered. From the remaining loan data only the delinquent loans were filtered
and appended to the final dataset in hand. This increased the rows from 443372 to
522465.

The features and target variable in the final dataset is given in appendix 4.

To the final dataset derived, machine learning algorithms are applied. The target variable in
consideration is delinquency status which has two outcomes – 1 when the loan is delinquent
and 0 when the loan is not delinquent. This makes it a binary variable therefore classification
algorithms were applied. Four classification algorithms were applied in this study. They are
K Nearest Neighbour, Logistic regression, Random forest classifier and Naïve bayes.

Logistic Regression

Logistic regression is a classification algorithm used to classify observations to a discrete set


of labels. Using the logistic sigmoid function, the output is transformed and it returns a
probability value which is then mapped to two or more discrete labels.

K Nearest Neighbour

K-Nearest Neighbour is one of the most essential and basic classification algorithms in
Machine Learning. It is commonly used in real-life circumstances as it is non-parametric, i.e.,
it does not make any underlying assumptions about the way the data is distributed.

15
Algorithm
Let p be an unknown point. Let m be the number of training data samples.
1. Store the training samples in an array of data points arr[]. This means each element of
this array represents a tuple (x, y).
2. For i=0 to m: Calculate Euclidean distance d(arr[i], p).
3. Make set S of K smallest distances obtained. Each of these distances corresponds to an
already classified data point.
4. Return the majority label among S.

Random forest classifier


A Random Forest is an ensemble technique capable of performing both classification and
regression tasks with the implementation of multiple decision trees. It is done by dividing
entire dataset into many subsets and a decision tree is generated for each subset. The aggregate
of the results of all the decision tress generated is the optimal solution.

Naïve Bayes classifier


Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem.
It is a family of algorithms where all of them share a common principle, i.e. the pair of features
that are being classified is independent.
This algorithm is based on Bayes’ theorem which is as given below:

P(y|X) = {P(X|y) P(y)}/{P(X)}

Applying naïve assumption which is independence among the features to the above mentioned
Bayes’ theorem gives the fundamental structure of Naïve Bayes classifier.

Stage 5 – Assessing

Three different techniques were used to assess the results of all the algorithms that were applied
to the final dataset.

1. Confusion matrix - The number of incorrect and correct predictions are summarized
with count values and is bifurcated for each class
2. Classification report - The report shows the following main classification metrics on
a per-class basis –
• Accuracy - closeness of the measurements to the actual value
• Precision - to the fraction of relevant instances among the total retrieved
instances
• Recall – It is also known as sensitivity, refers to the fraction of relevant instances
retrieved over the total amount of relevant instances
• f1-score – The harmonic mean of precision and recall on the basis of per-class

16
Figure 12:- Confusion matrix and Classification report

Receiver Operating Characteristic (ROC) curve and Area Under the Curve (AUC) - ROC
curve is a graph that captures the performance of a classification model at different
classification thresholds. This curve plots two parameters:

True Positive Rate = True positive / (True Positive + False Positive)


False Positive Rate = False Positive / (False Positive + True Negative)
AUC is the area under this curve. It tells how much model is capable of differentiating between
labels. If AUC is higher, better the model is at predicting 0s as 0s and 1s as 1s.

By assessing the results of all the algorithms, K Nearest Neighbour algorithm predicted the
results accurately. The results is shown in appendix 5.

Stage 6 – Deploying

K Nearest Neighbour gave more accurate results. Therefore a model was created by applying
Random forest classifier to the final dataset.

The model was applied to all the active loans (23778 loans) to predict its delinquency status
from June 2020 to December 2020.
The model was also deployed to make it available for the end-users so that they can make use
of it. A webpage was created using HTML where the users can input their data that is required.
This was connected with the model in the backend that processes the input and gives out the
output that is the delinquency status which is displayed to the user on the webpage. The layout
of the webpage is shown below.

Figure 13:- Layout of webpage where the model was deployed

17
RESULTS

The graphs below shows the number of delinquency loans per month from January 2020 to
December 2020 that was predicted by K Nearest Neighbour model.

Figure 14:- Number of delinquent loans per month predicted by the model

The model predicts that around 400 loans will become delinquent from June to December 2020.
It can be seen from the graph that the delinquent number of loans are also increasing gradually.
This increase might be due to the changing economic economic conditions like increasing
unemployment rate, decreasing GDP due to the COVID-19 pandemic.

CONCLUSIONS

The results that are observed shows that there might be an increase in the delinquency rate as
the year progresses. The main reason attributed to this surge is the COVID-19 pandemic that
is affecting the global economy. The increasing unemployment rate, decreasing GDP and
changes in other economic factors might lead to a decrease money crunch for many borrowers
because of which the delinquency rate might surge.

The various norms that the government has brought in like mortgage forbearance might reduce
the rate, but this factor is not a part of this study as data related to the borrowers who claimed
these advantages were not available.

This study is very relevant especially at this point of time when the world is going through a
pandemic as this will help the lenders to know their borrowers status beforehand so that they
can come up with efficient ways to confront risks.

18
LIMITATION OF THE STUDY & SCOPE FOR FUTURE
IMPROVEMENT

As the internship was held virtually and the entire project was done in intern’s own laptop,
access to data was restricted. All the data that was collected were from public sources. Getting
access to wide borrower details like annual income, gender, marital status was difficult and
these factors tend to affect delinquency rate. The project was deployed only with limited data
that was openly available which was a limitation to the study.

The scope of improvement can be on two sides. Firstly, access to more information about the
borrower and lenders can make the prediction more accurate. Secondly, more advanced and
sophisticated algorithms of machine learning can be used to create the models which might
improve the efficiency and accuracy of the model.

PERSONAL LEARNINGS FROM THE SIP

The internship with Deloitte was a complete learning process. I gained deeper knowledge
about python, machine learning and banking and finance domain specifically the lending
process. Having no prior experience in any of this field, this internship gave me a chance to
start from scratch and improve my knowledge base on all these domains. Apart from the
project, the various other activities that were conducted by the team gave me an idea about the
current situation in the corporate world and helped me pick up good presentation skills. Being
the first batch to have a virtual internship in Deloitte gave a whole new experience and moulded
me to adapt to the new normal.

BIBLIOGRAPHY

• https://fred.stlouisfed.org/
• http://erepository.uonbi.ac.ke/bitstream/handle/11295/75052/Agao_The%20effect%20of%
20macroeconomic%20variables%20on%20the%20mortgage%20uptake.pdf?sequence=4
• http://wiredspace.wits.ac.za/handle/10539/12760
• https://www.newyorkfed.org/medialibrary/media/research/staff_reports/sr732.pdf
• file:///C:/Users/sahan/Desktop/sahana/Deloitte/Research%20data/Factors_Driving_Deman
d_and_Default_Risk_in_Residen.pdf
• file:///C:/Users/sahan/Downloads/0066_16CTN08-8313%20(1).pdf

19
APPENDICES

Appendix 1 – All the variables in acquisition dataset and its type

File Field Type


Posi Name
tion
1 LOAN IDENTIFIER NUMERICAL
2 ORIGINATION CHANNEL CATEGORICAL
3 SELLER NAME NUMERICAL
4 ORIGINAL INTEREST RATE NUMERICAL
5 ORIGINAL UPB NUMERICAL
6 ORIGINAL LOAN TERM NUMERICAL
7 ORIGINATION DATE DATE
8 FIRST PAYMENT DATE DATE
9 ORIGINAL LOAN-TO-VALUE (LTV) NUMERICAL
10 ORIGINAL COMBINED LOAN-TO-VALUE (CLTV) NUMERICAL
11 NUMBER OF BORROWERS NUMERICAL
12 ORIGINAL DEBT TO INCOME RATIO NUMERICAL
13 BORROWER CREDIT SCORE AT ORIGINATION NUMERICAL
14 FIRST TIME HOME BUYER INDICATOR CATEGORICAL
15 LOAN PURPOSE CATEGORICAL
16 PROPERTY TYPE CATEGORICAL
17 NUMBER OF UNITS NUMERICAL
18 OCCUPANCY TYPE CATEGORICAL
19 PROPERTY STATE CATEGORICAL
20 ZIP CODE SHORT CATEGORICAL
21 PRIMARY MORTGAGE INSURANCE PERCENT NUMERICAL
22 PRODUCT TYPE CATEGORICAL
23 CO-BORROWER CREDIT SCORE AT ORIGINATION NUMERICAL
24 MORTGAGE INSURANCE TYPE CATEGORICAL
25 RELOCATION MORTGAGE INDICATOR CATEGORICAL

Appendix 2 – The variables that were eliminated after the exploring stage
Variable name Reason
ORIGINAL LOAN-TO-VALUE (LTV) Another variable CLTV is a better measure than
LTV
NUMBER OF BORROWERS This project focuses on only single borrower
PROPERTY TYPE This project focuses on only one type – Single
family
NUMBER OF UNITS It had lower correlation with all the variables in the
dataset
PROPERTY STATE Another variable ZIP CODE is a better indicator
for location

20
PRODUCT TYPE All the loans are of the same type – Fixed rate
mortgage loans
CO-BORROWER CREDIT SCORE AT This project focuses only on single borrowers,
ORIGINATION therefore the co-borrower credit score is not
required
RELOCATION MORTGAGE INDICATOR The variability is less. Almost all the loans are not
relocations mortgage loans

Appendix 3 – The variables that were eliminated after feature selection

Variables that were eliminated after feature


selection
ORIGINAL UPB
ORIGINAL LOAN TERM
FIRST TIME HOME BUYER INDICATOR
CURRENT MONTH’S INFLATION
PREVIOUS MONTH’S INFLATION
CURRENT MONTH’S INTEREST RATE
PREVIOUS MONTH’S INTEREST RATE
LOAN PURPOSE

Appendix 4 – The final features and target variable used in the study

Features Target variable


ORIGINAL INTEREST RATE DELINQUENCY STATUS
ORIGINAL COMBINED LOAN-TO-VALUE
(CLTV)
ORIGINAL DEBT TO INCOME RATIO
BORROWER CREDIT SCORE AT
ORIGINATION
OCCUPANCY TYPE
ZIP CODE SHORT
PRIMARY MORTGAGE INSURANCE %
MORTGAGE INSURANCE TYPE
CURRENT MONTH’S GDP
CURRENT MONTH’S UNEMPLOYMENT
RATE
CURRENT MONTH’S HPI

21
CURRENT MONTH’S DELINQUENCY
RATE
PREVIOUS MONTH’S GDP
PREVIOUS MONTH’S UNEMPLOYMENT
RATE
PREVIOUS MONTH’S HPI
PREVIOUS MONTH’S DELINQUENCY
RATE

Appendix 5 – The assessment results of K Nearest Neighbour

22

You might also like