You are on page 1of 15

Finance and

Risk
Analytics
Report By:
Surabhi Sood

1 Project Objective
The objective of the report is to create an India credit risk(default) model, using the data provided
in the spreadsheet raw-data.xlsx, and validate it on validation_data.xlsx.
This report will consist of the following:

a. Importing the dataset in R


b. Understanding the structure of dataset
c. Statistical exploration
d. Insights from the dataset
The objective is use the logistic regression framework to develop the credit default model.

2 Assumptions

Following are the assumptions we have taken into consideration:

 Logistic Regression
 Binary Logistic Regression requires the dependent variable to be binary/dichotomous.
 For Logistic Regression, predictor variables should be independent. There should be little
or no multicollinearity among the predictor variables.
 Logistic regression assumes linear relationship between each of the independent
variables and logit of the outcome.
 There should be no influential values (extreme values or outliers) in the continuous
predictors.

3 Statistical Data Analysis – Step by step approach


The below steps were followed to drive the analysis of the given RAW dataset:

1. Data Exploration and data preparation


2. Logistic Regression Model building
3. Performance Measurement and decile division
4. Inferences

3.1 Environment Set up and Data Import


3.1.1 Install necessary Packages and Invoke Libraries

library(ipred)

library(ROCR)

library(ggplot2)

library(dplr)
library(corrplot)

library(StatMeasures)

The packages were installed, and the libraries invoked to create the model.
Various other libraries will be used and installed as and when needed while developing the
model.

3.1.2 Set up working Directory


Setting a working directory on starting of the R session makes importing and exporting data
files and code files easier. Basically, working directory is the location/ folder on the PC where
you have the data, codes etc. related to the project.

The working directory is set using the setwd command like below:

setwd("C:/Users/surabhi1.arora/Downloads")
3.1.3 Import and Read the Dataset
The given dataset is in .csv format. Hence, the command ‘read.csv’ is used for importing the
file.

raw_data <- read.csv("C:/Users/surabhi1.arora/Downloads/raw_data.csv" , header = T)


validation_data <- read.csv("C:/Users/surabhi1.arora/Downloads/validation_data.csv" , header = T)

Solution with Statistical Calculations

Question 1:

EDA of the data available.

raw_data<- read.csv("C:/Users/surabhi1.arora/Downloads/raw-data.csv" , header = T)


attach(raw_data)
dim(raw_data)
names(raw_data)
summary(raw_data)
str(raw_data)

Key Observations about the dataset :


• Number of observations in the dataset: 3541
• Number of variables: 52
• Data consists of numerical/Integers except for “Deposits accepted by Commercial Banks”
• For the dependent variable, "Net worth next year", the 100th percentile or the maximum
value observed is 805773.4, while the 0th percentile or minimum value is -74265.6
•The median/ 50th percentile is 116.3 while the median is 1616.3, thereby showing skewness
existing in the data
•Thus we conclude that the given data set is not normally distributed.

Missing Values :

Another important observation that can be made through the output of summary command is
that there are several variables which have missing value such as "Total Income" having 198
NA's, "Change in stock" having 458 NA's, and in "Other income" variable, the number of
missing values are 1295. A possible reason for the same can be no declaration of other income
being made by companies. Some other variable might have missing values for those columns
may or may not be applicable for certain sectors, industries or companies. Hence the
treatment of these missing values is essential to get an accurate model calculation.

Also, we need to analyze and select which independent variables need to be considered for the
analysis for their effect on the dependent variable.

To get a better understanding of the dependent variable which is “Net Worth Next Year”, we
will create a default variable, customized to 0 or 1 depending on if the value is positive or
negative ; in case net worth next year is positive , the default value is 0 as then there is no
default expected in that case.
> summary(as.factor(Default_value))
0 1
3298 243

This shows that the new variable “Default_value” has 3298 0’s and 243 1’s.
Hence we conclude that there are 243 defaulting companies in the dataset given.

This gives the default rate as 243/(3298+243) = 6.86%

If we see the plot of the number of defaulters and non defaulters we have :

To find out the variables which have missing values in the cells, we use the below query :

> sapply(raw_data,function(x) (sum(is.na(x))) )

We will then replace the missing values with the median of the variables to impute the dataset.

Checking again if we have any missing values remaining we run the query again and see that all the
values returned are 0

sapply(raw_data , function(x) sum(is.na(x)))

Checking for Outliers :

An outlier is an observation that is numerically distant from the rest of the data. When reviewing a
boxplot, an outlier is defined as a data point that is located outside the fences (“whiskers”) of the
boxplot.
Seeing the Boxplot we have

boxplot(raw_data)

The above boxplot shows that only one variable has outlier which is : Shares Outstanding

This gives the below boxplot :

Multicollinearity :
Multicollinearity refers to a situation in which two or more explanatory variables in a multiple regression
model are highly linearly related.
Checking for collinearity between the independent variables we use the corrplot function present in the
corrplot library :

From the above plot we see that there are several variables that have a correlation amongst
themselves, therefore indicating the presence of multicollinearity in the dataset.
For example,
 Total expenses show a relatively strong correlation with net worth
 'current assets', shows a strong correlation with PBT as well as cash profit. Seeing as several
independent variables show a correlation amongst themselves, this could cause reduction of
accuracy for data modelling and hence, needs to be addressed.
 It can also be observed that the dependent variable, Net worth next year, shows a relatively
strong correlation with total income, total expenses, current liabilities, current assets, etc.

So seeing the above graph, we cannot use these independent variables for modelling due to
multicollinearity without further analysis.
 Question 2:

Modelling

As we see from the EDA of the given dataset, the dependent variable
“Net worth next year” is now converted into a new variable –
“Default_value” which contains binary responses of 0 and 1 for non-
defaulters and defaulters.

Because of the binary response variable we can use logistic regression. Rather than
modelling the response Y directly, logistic regression models the probability that Y belongs
to a particular category, in our case the probability of a non-performing loan. This probability
can be computed by the logistic function,

P = exp(b0 + b1X1 + … + bNXN) / [ 1 + exp(b0 + b1X1 + … + bNXN) ]

where

 P is the probability of default


 b0 , b1 , … , bN are the coefficient estimates
 N the number of observations
 X1 , … , XN are the independent variables

To use the glm function to run the logistic regression modelling in R, we have to first remove a few
variables from the dataset which are not significant in data modelling
 Column 1 which is a general number depicting the company has no contribution in determining
whether the given company is a credit defaulter or not, hence we will remove that column
from the modelling dataset
 Column 2 is the dependent variable which has further been modelled in a new variable “
Default_value” and hence will not be used for the model dataset
 Column 52nd, which is “PE on BSE” has a lot of NA values , that means this variable is not
available for most companies and hence this is also not considered for modelling
 The variable “ Deposits (accepted by commercial banks)” is a logit function with no values
available hence they have no impact on the dependent variable, so we will remove that
column as well

With these we have the below model using all the rest of the variables,
this gives the output as below:

Call: glm(formula = Default_value ~ ., family = binomial(), data =


raw_data_model)
Coefficients:
(Intercept)
-2.164e+00
Total.assets
2.095e-03
Net.worth
-1.499e-03
Total.income
1.243e-03
Change.in.stock
2.895e-03
Total.expenses
-1.217e-03
Profit.after.tax
-5.154e-03
PBDITA
-5.316e-03
PBT
1.019e-02
Cash.profit
-1.176e-02
PBDITA.as...of.total.income
1.000e-03
PBT.as...of.total.income
2.529e-03
PAT.as...of.total.income
-2.922e-03
Cash.profit.as...of.total.income
-5.176e-04
PAT.as...of.net.worth
-1.506e-02
Sales
-1.278e-04
Income.from.financial.services
1.015e-02
Other.income
2.069e-02
Total.capital
-1.370e-04
Reserves.and.funds
-3.707e-03
Borrowings
2.950e-03
Current.liabilities...provisions
-7.167e-04
Deferred.tax.liability
3.642e-03
Shareholders.funds
3.361e-03
Cumulative.retained.profits
-1.868e-03
Capital.employed
-5.268e-03
TOL.TNW
2.005e-02
Total.term.liabilities...tangible.net.worth
-1.730e-02
Contingent.liabilities...Net.worth....
2.518e-04
Contingent.liabilities
-2.306e-03
Net.fixed.assets
1.428e-03
Investments
1.183e-03
Current.assets
-1.623e-03
Net.working.capital
1.119e-03
Quick.ratio..times.
-1.925e-01
Current.ratio..times.
-1.916e-02
Debt.to.equity.ratio..times.
1.949e-02
Cash.to.current.liabilities..times.
2.013e-01
Cash.to.average.cost.of.sales.per.day
6.333e-05
Creditors.turnover
-5.067e-03
Debtors.turnover
8.009e-04
Finished.goods.turnover
2.379e-04
WIP.turnover
-6.295e-03
Raw.material.turnover
-1.565e-03
Shares.outstanding
2.135e-09
Equity.face.value
-6.320e-06
EPS
1.588e-03
Adjusted.EPS
-1.586e-03
Total.liabilities
NA

Degrees of Freedom: 3540 Total (i.e. Null); 3493 Residual


Null Deviance: 1771
Residual Deviance: 1152 AIC: 1248

As this model is very difficult to interpret, we will reduce the number of independent variables
and try to run the model again. From the multicollinearity check done above, we know that a
lot of these variables are correlated to one another and so the accuracy of model built using all
of them will not be very high.

Going through all the variables and dividing them into 4 broad buckets of size, profitability,
leverage and liability we have the below observations:

 SIZE: The total assets, net worth and sales of a company individually cannot be used as
a determining factor for credit risk as large companies will have a higher value of these
variables, so we will have to take a ratio of them to have a fair understanding of the
relation of these variables on the net worth next year (Dependent Variable)
 PROFITABILITY: The profit as a percentage of total income can be used to determine
whether a company is at risk of defaulting on its loan or not
 LEVERAGE: The current ratio which is Current assets divided by current liabilities
 LIQUIDITY: The Quick Ratio which is Total cash divided by current liabilities. It is
calculated by subtracting inventory from the current assets and dividing the same by
the current liabilities

Dividing the data variables in these buckets we have chosen one variable from each bucket to make the
model:

For size, we have taken the net worth, for profitability the profit before taxes as a percentage of total
income has been selected, and for leverage and liquidity we have used Current Ratio and Quick Ratio.
model2 =
glm(Default_value~
raw_data$net_worth+raw_data$PBT.as...of.total.income+raw_data$Current.ratio..times.
+raw_data$Quick.ratio..times., family=binomial)

model2
Call:
glm(formula = Default ~ Net.worth + PBT.as...of.total.income +
Quick.ratio..times. + Current.ratio..times., family = binomial)

Deviance Residuals:
Min 1Q Median 3Q Max
-3.2610 -0.3704 -0.3605 -0.3301 4.9838

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.5744617 0.0979907 -26.273 < 2e-16 *
Net.worth -0.0003367 0.0001222 -2.755 0.00587 **
PBT.as...of.total.income -0.0022553 0.0004484 -5.030 4.9e-07 *
Quick.ratio..times. 0.0838619 0.0865449 0.969 0.33255
Current.ratio..times. -0.0842405 0.0858996 -0.981 0.32675
---
Signif. codes: 0 ‘*’ 0.001 ‘*’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1586.0 on 3391 degrees of freedom


Residual deviance: 1507.6 on 3387 degrees of freedom
(149 observations deleted due to missingness)
AIC: 1517.6

Number of Fisher Scoring iterations: 9


From the above model output, we see that all the 4 independent variables are significant in
representing the dependency of the dependent variable, 'Net worth next year' .
The equation which represents the dependency of the net worth next year on the selected
independent variables (Sales/Net Worth, net profit margin, quick ratio and current ratio) is:

E ^ -2.57-0.003networth-0.0022profit/Income-0.084CurrentRatio+0.084QuickRatio
Y=___________________________________________________________________________
1+ (E ^ -2.57-0.003networth-0.0022profit/Income-0.084CurrentRatio+0.084QuickRatio)

Where Y is the dependent variable (Net worth next year). It is to be noted that for ease of
understanding, the names of the ratio have been provided in the equation instead of the actual
variable names.
It is evident from the equation that the intercept value is negative along with the coefficients
of net worth, operating profit margin as well as the current ratio, any positive change in these
would bring about a higher degree of negative impact on Y.

The next step is to evaluate the precision of the model, which is done using the fitted values
generated from the model.

prediction=ifelse(model2$fitted.values>0.07,1,0)
table(model2$y,prediction)

prediction
0 1
0 3014 166
1 115 97
Plot of the fitted values is as below

From the above table we see that 2854 companies have been correctly identified as non-
defaulters whereas 232 of them have been marked as defaulters although they are not.
Similarly, 91 companies have been wrongly categorized in low risk non-defaulters although
they are.
The performance measurement stats show :

 Sensitivity of the model can be calculated using the formula TP/TP+FN, which in this case is:
3014/(3014+166)= 0.9477987or 94.7%.
 Similarly, specificity can be calculated as: TN/(TN+FP), which is: 97/(97+115)= 0.4575472 or
45.7%.
 Lastly, the accuracy for the model can be calculated as (TP+TN)/(TP+TN+FP+FN)=
3014+97/(3014+166+97+115)= 0.917158 or 91.7%.

Ideally, the threshold value can be manipulated to generate a better sensitivity and accuracy of
the model. But seeing as the output for sensitivity and accuracy is already above 90%, the
manipulation is not necessary.
 Question 3

Performance Measurement

To check the acceptability of the model, we will calculate the performance indices of the model
output against the validation dataset (test dataset) :

We will be using the “predict” function to generate the output of the model against the
validation data. Before running the predict function , we will impute the missing values in test dataset
with the median value of each column for a much better prediction output.

As we see the predict values are very low, so to convert them to 0 and 1 for comparison with
the “default” column of validation dataset, we have used a very low threshold of 0.069 for the
conversion.

This gives the below comparison with the default data of validation dataset :

value= validation$Default
table_LR = table(value,test)
table_LR

0 1
0 604 57
1 24 30

This is the confusion matrix output which is one of the performance index to measure the
model.
The other key parameters are :

 Sensitivity : TP/TP+FN = 604/604+57 = 0.913767 or 91.4%


 Specificity : TN/TN+FP = 30/30+24 = 0.5555556 or 55.6%
 Accuracy : TP+TN/ TP+TN+FP+FN = 604+30/604+30+57+24 = 0.8867133 or 88.7%

Calculating the Area under the Curve which is as below :

> AUC
[1] 0.7346613

This gives the Area under the curve as 73.5%.

Plotting the ROC curve, we have:


Deciles and Hosmer-Lemeshow Test

We have validated the model using a lot of KPIs like confusion matrix, sensitivity, specificity and
AUC; all indicate that the model is a good model. Another test for the performance in statistical
manner lies in evaluating the model's performance in terms of the goodness of fit test or
Hosmer-Lemeshow Test.

Prior to implementing the test, the data is sorted in descending order based on the probability
of default (generated in the prediction model: pred_mod) and divided into deciles keeping
probability as the reference point. For this purpose, dplyr library was used and mutate as well as
ntile functions were used and the output was stored in a separate variable 'dec'. Now based on
'dec', sorting was done in descending order of the decile ranks.

For the goodness-of-fit test for the generated logistic regression model, a new library
"generalhoslem" is invoked, which has the 'logitgof' function that performs the Hosmer-
Lemeshow test.
The Hosmer-Lemeshow goodness of fit test is based on dividing the sample up according to
their predicted probabilities, or risks. Specifically, based on the estimated parameter values, for
each observation in the sample the probability that is calculated, based on each observation's
covariate values:

The observations in the sample are then split into g groups according to their predicted
probabilities. In this case, the value for g=10, continuing with the concept of deciles
(furthermore, it is to be noted that the default value for g is 10 in this test). Then the first group
consists of the observations with the lowest 10% predicted probabilities. The second group
consists of the 10% of the sample whose predicted probabilities are next smallest, etc. (Bartlett,
2020)
The output generated after running the test is:

> logitgof(model2$y,fitted(model2),g=10)

Hosmer and Lemeshow test (binary model) data: model2$y, fitted(model2) X-squared =
214.47, df = 8, p-value < 2.2e-16

Seeing as the p-value is far lesser than significance level of 0.05, it suggests that the null
hypothesis of significant difference between expected and specified proportions is upheld.
Hence, we cannot reject the null hypothesis. This further shows a poor fit of the model.
Irrespective of the previous performance metrics indicating towards a good model, statistically
the model generated (model2) is insignificant.
Though the goodness-of-fit rules out the current model for prediction, it is to be noted that
Hosmer-Lemeshow test has a limitation wherein it shows significant p-value, indicating a poor
fit, it gives no indication with respect to the aspect(s) the model is fitting poorly and hence,
demands further analysis. So this model testing based on deciles is although a statistically
appropriate test and in our case gives an inconclusive result demanding further analysis, the
model outputs of other KPIs like Confusion Matrix, Sensitivity, Specificity and Accuracy show
that the model is a good model.ss

You might also like