You are on page 1of 30

Mini Project – Default Risk Prediction

Sravanthi.M

1
Table of Contents
1. Project Objective...............................................................................................................................3
2. Assumptions......................................................................................................................................3
3. Exploratory Data Analysis – Step by step approach...........................................................................3
3.1. Environment Set up and Data Import........................................................................................3
3.1.1.Install necessary Packages and Invoke Libraries.................................................................3
3.1.2.Cleaning up data................................................................................................................3
3.1.3.Reading the Data and visualization....................................................................................3
3.2. Variable Identification................................................................................................................3
4. Conclusion.........................................................................................................................................3
5. Detailed Explanation of Findings…………………………………………………………………………………………………….4

1. EDA
 Outlier Treatment

 Missing Value Treatment

 New Variables Creation (One ration for profitability, leverage, liquidity and company's size
each)

 Check for multicollinearity

 Univariate & bivariate analysis

2. Modeling
 Build Logistic Regression Model on most important variables

 Analyze coefficient & their signs

3. Model Performance Measures


 Predict accuracy of model on dev and validation datasets

 Sort the data in descending order based on probability of default and then divide into 10
dociles based on probability & check how well the model has performed

6. Source Code
1 Project Objective
The objective of the project is to create India credit risk(default) model using the given training
dataset
and validate it. Logistic regression framework is to be used to develop the credit default model.

2 Assumptions
 The data provided in raw data comprises of financial data.
 Major data points are variables.

3 Exploratory Data Analysis – Step by step approach


A Typical Data exploration activity consists of the following steps:

1. Load data and visualization


2. Preprocessing data
3. Check/Make the data
4. Do Hypothesis testing. If time series in non-stationary then stationarize it

We shall follow these steps in exploring the provided dataset.

3.1 Reading Data


3.1.1 Install necessary Packages and Invoke Libraries
Use this section to install necessary packages and invoke associated libraries. Having all the
packages at the same places increases code readability. For installation we will use
install. packages (“Package name”)

3.2 Variable Identification


We are using
 summary: is a generic function used to produce result summaries of the results of various
model fitting functions. The function invokes particular methods which depend on the
class of the first argument.
 hist(): To plot histogram

 dim(): To find dimensions of the data

4 Conclusion
Major data points or variables are Net worth next year, Total assets, Net worth, Total income, Total
expenses, Profit after tax, PBDITA, PBT (Profit Before Tax), Cash profit, PBDITA as % of total income,
PBT as % of total income, Cash profit as % of total income, PAT as % of net worth, Sales, Total
capital, Reserves and funds, Borrowings, Current liabilities & provisions, Capital employed, Net fixed
assets, Investments, Net working capital, Debt to equity ratio (times), Cash to current liabilities
(times), Total liabilities.

In addition to the above variables there are other financial parameters which define the financial
strength of the organization taking the total tally of variables to 51.
5 Detailed Explanation of Findings

5.1 EDA
Ans: There are two datasets: Training and Testing dataset with similar variables. The dataset consists of
Organisation details such as Net worth next year, Total assets, Net worth, Total income, Total expenses,
Profit after tax, PBDITA, PBT (Profit Before Tax), Cash profit, PBDITA as % of total income, PBT as
% of total income, Cash profit as % of total income, PAT as % of net worth, Sales, Total capital,
Reserves and funds, Borrowings, Current liabilities & provisions, Capital employed, Net fixed assets,
Investments, Net working capital, Debt to equity ratio (times), Cash to current liabilities (times), Total
liabilities

The dataset is imported for further analysis

Output:

 The raw dataset contains 3541 observations and 52 variables.


 The validate dataset contains 715 observations and 52 variables.
The training dataset does not have default variable so the default variable is created by splitting the
observations of the Net worth Next Year variable. Generally, it is expected that the firms that will have
negative net worth next year are likely to default. Negative observations in ‘Net worth Next Year’ will
be ‘1’ in the Default variable and the positive observations will be ‘0’ in the Default variable.
Output:

Missing Values: There are 17188 missing values


Missing Values Treatment:
The dataset should contain only numeric values for performing logistic regression but the below code
shows that few variables are of class ‘character’.

Output:

From the above plot we can notice that the training dataset has 9.5% missing observations
The training dataset has missing observations. The missing observations are replaced with median of that
column and the missing columns are removed from the dataset. Onrunning the plot again, it shows that
the training dataset does not have any missing observations or columns.
Output:

Similarly, the testing dataset also has variables of the type character.

In the following code, the variables of type character are changed into the type numeric and then the
missing observations in each column is replaced with the median of that column.

Output:
From the above plot we can notice that the testing dataset has 9.4% missing observations and 1.9%
missing columns.

From the above code, the variables of type character are changed into the type numeric and then the
missing observations in each column is replaced with the median of that column. The missing columns
are removed from the dataset.
Output:

From the above plot shows that the dataset does not contain any missing observations or missing
columns.

Outlier Treatment:
The outliers in the dataset are treated by replacing the observations lesser than the 1st percentile with
value of the 1st percentile and the observations more than the 99th percentile with the value of the 99th
percentile. This outlier treatment is done for every column in the dataset.

The quantile function identifies the observations less than 1st percentile and more than the 99th
percentile. The squish function replaces the values of these identified outliers with the value of the 1st
percentile and the 99th percentile and Redundant variables are removed from the Training and Testing
dataset.

Output:

Univariate and Bivariate analysis:


The variables can be explored further and can be analyzed using univariate and multivariate analysis
Variable Creation:

New variables are created as per the requirement. One ratio for Profitability, Liquidity and Leverage is
required as per the problem statement.

 The profitability ratio is derived by dividing Profit after tax by Sales


 The liquidity ratio is derived by dividing Net Working Capital by Total Assets
 The leverage ratio is derived by dividing Total Assets by Total Equity

Other ratios are also created by dividing multiple variables by Total assets and the contribution of these
ratios towards the model can be found later.
2 Modelling: Logistic Regression

Ans: The Logistic regression model is used for this dataset. Initially, all the variables are used as the
predictors with the Default variable as the response variable.
g1m(formuJa = default- Profitability, family=binomial,data = train)

(I nt er cept ) -1.412e+00 4.506e-01 -3.133 0.001731 **


’Total assets’ 2.626e-03 8.S4Ze-04 3.0Z2 0.002124 **
Net worth’ -1.b58e-04 Z. 373e — D3 -0. Od€i 0. 94 765 9
’Total income’ 2.943e-03 2.466e-D3 1.193 0.232771
’change in stock’ 2.b42e-03 3.100e-D3 0.820 0.412099
’Total expenses’ -2.592e-04 1. S36e-D3 -0. 169 0.865949
’ProfJt after tax’ S.2b2e-03 1.387e-D2 0.379 0.7O4%76
PBDITA -1.149e-OZ 4.814e-D3 -2. 386 0.017010 *
PBT 4.103e-03 1.375e-D2 0.29g o. s432
’cash profit’ -1.6b6e-0Z S.646e-D3 -2. 933 0.003361 **
PBDITA as & of t ot at 1 n come -1.929e-04 F.694e-D3 -0. 025 0.980000
’PBT as % of total income’ -1.0â8e-02 2.01Ze-D2 -0.326 0.s9g984
PAT as % of total income’ 1.054e-02 2.083e-D2 0.306 0.612764
’cash profit as X of total income’ -1.332e-02 Z.984e-D3 -1.669 0.095211 .
PAT as & of net work h -9.095e-03 Z . 708e — D 3 — 3. 359 0. OOO782 ^ ^ ^
sal es -2.6b9e—03 1.951e-D3 -1. 363 0.l728Z9
Income ron IN nance zt1 3 ervi ces 1.315e—03 Z.880e-D3 0.167 0.867466
' Otner 1ncome' 1.226e-0Z 1.016e-D2 1.207 0.227373
’Total capital 6.2b2e-04 1.560e-D3 0.401 0.688638
Reserves and funds' -3.319e-03 1.679e-D3 -1.P76 o.Mgio8
Bar r axe ngs 8.901e-03 4.40Ze-D3 2.022 0.O43l5Z *
" Cur rerrc IN abet i t i es & provi sJ one -2.541e-03 1.444e-D3 -1. 760 0.07843F
Deferred tax 1 J abi 1 i ty' -2.661e-03 3. 75fie — D 3 —o. for o. ‹ g7s‹
" S harehol der s funds' 7.725e—03 4.950e-D3 1.ski o.itg6o8
Cunul at1ve retaJ ned profi t3 -2.36le—03 1.818e-D3 -1.299 0.1939%5
’capital employed’ -1.145e-0Z 4.5ZZe-D3 -2.532 0.011328 *
TO L /TNW* -2.030e-03 1.199e-DZ -0.169 0.B6S495
’Total term liabilities / tangible net worth’ 3.110e-02 1.738e-D2 1.789 0.073579
conzingenr liablliries / Nez worrh (%)’ 1.606e-04 Z.98Ze-D4 0. 539 0.590196
’conringenr l?abiliries’ -1.888e-03 9.233e-D4 -2.045 0.040846 *
’Ner fixed assecs’ 7.386e-04 6.365e-D4 1.160 0.245909
invesrmenrs 1.500e—03 1.448e—D3 1.036 0.300256
currenr assecs’ 3.025e—04 1.28Ze—D3 0. 236 0. 813540
’Nez working capital 9. 339e-04 1. 3ZZe—D3 0.706 0.47993Z
’Quick radio (cimes)’ -9.337e-02 7. 017e—D2 —I. 331 0. i83309
’currenr raz?o (rimes)’ 1.156e-01 S.177e—D2 2.234 0. O2350S
Debt zo equ I:y rar 1a (zi mes) " -9.808e-03 1. 616e—D2 —0. 607 0.543799
" cash zo current: 1 abi 11s i es {zJ nes)' 1.089e-02 8.967e-0Z 0.121 0.903316
" cash zo aVerage case of s a1 es per day' 9.305e-05 S. 34 fie — D l . 466 0. 142 560
5
cr edsnor s turnover -1.345e—03 3.914e—D3 -0. 344 0.731134
The result of the Logistic Regression shows that few variables are important and contribute more towards
the model.The most important variables identified from the previous Logistic regression model is
used as predictors in this model with the Default variable as the response.
Few of the most important variables are Total Assets, Cash Profit, PAT as % of net
worth, Reserves and Funds, Current Liabilities and Provisions, Capital employed, Net
Working Capital/Total Assets and Networth/Total Assets. These variables have very
less Pr(>|z|).

Among the most important variables, the variables with positive estimates are Total
Assets, Current ratio, Sales/Total Assets and the variables with negative estimates are
Cash Profit, PAT as % of net worth,Current Liabilities and Provisions, Capital
employed

Analysis:
The model has an AIC value of 875.4 and predicts the training and testing dataset with
almost 95% accuracy (seen later in the document).
Few of the most important variables are Total Assets, Cash Profit, PAT as % of net
worth, Reserves and Funds, Current Liabilities and Provisions, Capital employed, Net
Working Capital/Total Assets and Networth/Total Assets. These variables have very
less Pr(>|z|).

Among the most important variables, the variables with positive estimates are Total
Assets, Current ratio, Sales/Total Assets and the variables with negative estimates are
Cash Profit, PAT as % of net worth,Current Liabilities and Provisions, Capital
employed

21 | P a g
e
Model Performance and Measure:

The Logistic regression model that was created is used to predict the Training dataset.
obs
pred 0 1
0 3227 107
1 35 109
attr(,"class")
[1] "confusion.matrix"

The confusion matrix shows that there are 35 Type 1 error and 107 type 2 error.
[1] 0.9591719

The accuracy of the model is 95.9179%


Setting levels: control = 0, case = 1
Setting direction: controls < cases
Call:
roc.default(response = train$Default, predictor = PredLOGIT)
Data: PredLOGIT in 3262 controls (train$Default 0) < 216 cases (train$Default 1)
Area under the curve: 0.9423

The same logistic regression model is used to predict the testing dataset.
obs
pred 0 1
0 639 20
1 22 34
22 | P a g
e
attr(,"class
")
[1] "confusion.matrix"
The confusion matrix shows that there are 22 Type 1 error and 20 Type 2 error.
[1] 0.9412587
The accuracy of the model is 94.125%
Setting levels: control = 0, case = 1
Setting direction: controls < cases

Call:
roc.default(response = test$`Default - 1`, predictor = PredLOGIT)
Data: PredLOGIT in 661 controls (test$`Default - 1` 0) < 54 cases (test$`Default - 1` 1).
Area under the curve: 0.941

Deciling:

The training dataset is then divided into 10 deciles based on the probability of default.

After the deciles are created, they then ranked.

23 | P a g
e
The ranks of the deciles are seen above. The deciles are sorted in the descending order.
The 10th decile has the maximum number of defaults in the form of cnt_resp.

The testing dataset is then divided into 10 deciles based on the probability of default.

The deciles are then ranked.

The ranks of the deciles are seen above. The deciles are sorted in the descending
order. The 10th decile has the maximum number of defaults in the form of
cnt_resp.

The mean is taken for both the Training and Testing dataset to differentiate the
predicted and observed values.
The plot shows that the model almost accurately predicted both the Training and
Testing dataset with an accuracy of almost 95%

24 | P a g
e
6.Source code
#Loading relevant libraries for current session

library(caTools)
install.packages("car")
library(car)
install.packages("lattice")
library(caret)
library(ROCR)
library(corrplot)
install.packages("ipred")
library(ipred)
library(ggplot2)
install.packages("dplyr")
library(dplyr)
library(StatMeasures)
install.packages("scales")
library(scales)
install.packages("DataExplorer")
library(DataExplorer)

##Set working Directory##


setwd("D:/College Data/FRA")
getwd()
rawdata = read.csv("raw-data.csv", header = TRUE)
dim(rawdata)
summary(rawdata)
validationdata = read.csv("validation_data.csv", header = TRUE)
dim(validationdata)
names(rawdata)
names(validationdata)

##Another copy of the data is made to work further

train <- rawdata


test <- validationdata
Default<-ifelse(train$Networth.Next.Year<0,1,0)
summary(as.factor(Default))
summary(train$Networth.Next.Year)
respons_rate <- round(prop.table(table(train$Default)),2)
respons_rate
train$Default <- ifelse(train$'Networth.Next.Year' < 0 ,1,0)

##Companies with Total Assets less than 3 is removed from further analysis.
train <- train[!train$`Total assets` <= 3, ]

##Missing Values
sum(is.na(train))

##Missing values Treatment


train<-as.data.frame(train)

for(i in 1:length(train)){
25 | P a g
e
print(paste(colnames(train[i]),class(train[,i])))}
plot_intro(train)

for(i in 1:ncol(train)){
train[,i] <- as.numeric(train[,i])
train[is.na(train[,i]), i] <- median(train[,i], na.rm = TRUE)
}
train <- train[,-22]
sum(is.na(train))
plot_intro(train)

##Missing values#
sum(is.na(test))

## Testing dataset also has variables of the type character.

test<-as.data.frame(test)
for(i in 1:length(test)){
print(paste(colnames(test[i]),class(test[,i])))
}
plot_intro(test)

for(i in 1:ncol(test)){
test[,i] <- as.numeric(test[,i])
test[is.na(test[,i]), i] <- median(test[,i], na.rm = TRUE)
}

test <- test[,-22]


plot_intro(test)

##Outlier Treatment

boxplot(rawdata)

for(i in 2:ncol(train)){
q <- quantile(train[,i], c(0.1, 0.99))
train[,i] <- squish(train[,i], q)
}

##Redundant variables are removed from the Training and Testing dataset

train <- train[,-c(1,2)]


test <- test[,-1]

##Univariate and Bivariate analysis

plot_str(train)
plot_intro(train)
plot_missing(train)
plot_histogram(train)
plot_qq(train)
plot_bar(train)
plot_correlation(train)

26 | P a g
e
##Variable Creation

train$Profitability <- train$`Profit after tax`/train$Sales


train$PriceperShare <- train$EPS*train$`PE on BSE`

##The liquidity ratio is derived by dividing Net Working Capital by Total


Assets

train$NWC2TA <- train$`Net working capital`/train$`Total assets`


train$TotalEquity <- train$`Total liabilities`/train$`Debt to equity ratio
(times)`

##The leverage ratio is derived by dividing Total Assets by TotalEquity

train$EquityMultiplier <- train$`Total assets`/train$TotalEquity

train$Networth2Totalassets <- train$`Net worth`/train$`Total assets`

train$Totalincome2Totalassets<- train$`Total income`/train$`Total assets`


train$Totalexpenses2Totalassets <-train$`Total expenses`/train$`Total
assets`
train$Profitaftertax2Totalassets <-train$`Profit after tax`/train$`Total
assets`
train$PBT2Totalassets <-train$PBT/train$`Total assets`
train$Sales2Totalassets <-train$Sales/train$`Total assets`
train$Currentliabilitiesprovisions2Totalassets <-train$`Current liabilities
& provisions`/train$`Total assets`
train$Capitalemployed2Totalassets <-train$`Capital employed`/train$`Total
assets`
train$Netfixedassets2Totalassets <-train$`Net fixed assets`/train$`Total
assets`
train$Investments2Totalassets <-train$Investments/train$`Total assets`
train$Totalliabilities2Totalassets <-train$`Total liabilities`/train$`Total
assets`

##Similar variables are created for the test dataset as well.

test$TotalEquity <- test$Total.liabilities/test$Debt.to.equity.ratio..times.


test$EquityMultiplier <- test$Total.assets/test$TotalEquity
test$Networth2Totalassets <- test$Net.worth/test$Total.assets
test$Totalincome2Totalassets<- test$Total.income/test$Total.assets
test$Totalexpenses2Totalassets <-test$Total.expenses/test$Total.assets
test$Profitaftertax2Totalassets <-test$Profit.after.tax/test$Total.assets
test$PBT2Totalassets <-test$PBT/test$Total.assets
test$Sales2Totalassets <-test$Sales/test$Total.assets
test$Currentliabilitiesprovisions2Totalassets <-
test$Current.liabilities...provisions/test$Total.assetst
test$Capitalemployed2Totalassets <-test$Capital.employed/test$Total.assets
test$Netfixedassets2Totalassets <-test$Net.fixed.assets/test$Total.assets
test$Investments2Totalassets <-test$Investments/test$Total.assets
test$Totalliabilities2Totalassets <-test$Total.liabilities/test$Total.assets

##Logistic Regression

27 | P a g
e
trainLOGIT<- glm(Default~.,data = train, family=binomial)

summary(trainLOGIT)

trainLOGIT <- glm(Default~`Total assets`+`Total income`+`Change in


stock`+`Total expenses`+`Profit after tax`
+PBDITA+`Cash profit`+`PBDITA as % of total income`+`PBT
as % of total income`
+`PAT as % of total income`+`Cash profit as % of total
income`+`PAT as % of net worth`
+`Total capital`+`Reserves and
funds`+`Borrowings`+`Current liabilities & provisions`
+`Capital employed`+`Total term liabilities / tangible net
worth`+`Contingent liabilities`
+`Current ratio (times)`+Investments+`Finished goods
turnover`+`TOL/TNW`+`PE on BSE`
+`Net fixed assets`+`Debt to equity ratio (times)`+`Cash
to average cost of sales per day`

+PriceperShare+NWC2TA+Networth2Totalassets+Sales2Totalassets+
Capitalemployed2Totalassets
+ Investments2Totalassets , data= train, family =
binomial)
summary(trainLOGIT)

PredLOGIT <- predict.glm(trainLOGIT, newdata=train, type="response")


tab.logit<-confusion.matrix(train$Default,PredLOGIT,threshold = 0.5)
tab.logit
accuracy.logit<-sum(diag(tab.logit))/sum(tab.logit)
accuracy.logit
roc.logit<-roc(train$Default,PredLOGIT )
roc.logit
plot(roc.logit)
PredLOGIT <- predict.glm(trainLOGIT, newdata=test, type="response")
tab.logit<-confusion.matrix(test$`Default - 1`,PredLOGIT,threshold = 0.5)
tab.logit
accuracy.logit<-sum(diag(tab.logit))/sum(tab.logit)
accuracy.logit
roc.logit<-roc(test$`Default - 1`,PredLOGIT )
roc.logit
plot(roc.logit)
train$pred = predict(trainLOGIT, train, type="response")

decile <- function(x)


{
deciles <- vector(length=10)
for (i in seq(0.1,1,.1))
{
deciles[i*10] <- quantile(x, i, na.rm=T)
}
return (
ifelse(x<deciles[1], 1,
ifelse(x<deciles[2], 2,
ifelse(x<deciles[3], 3,
28 | P a g
e
ifelse(x<deciles[4], 4,
ifelse(x<deciles[5], 5,
ifelse(x<deciles[6], 6,
ifelse(x<deciles[7], 7,
ifelse(x<deciles[8], 8,
ifelse(x<deciles[9], 9, 10
))))))))))
}
train$deciles <- decile(train$pred)

tmp_DT = data.table(train)

rank <- tmp_DT[, list(cnt=length(Default),


cnt_resp=sum(Default==1),
cnt_non_resp=sum(Default==0)
), by=deciles][order(-deciles)]

rank$rrate <- round(rank$cnt_resp / rank$cnt,4);


rank$cum_resp <- cumsum(rank$cnt_resp)
rank$cum_non_resp <- cumsum(rank$cnt_non_resp)
rank$cum_rel_resp <- round(rank$cum_resp / sum(rank$cnt_resp),4);
rank$cum_rel_non_resp <- round(rank$cum_non_resp /
sum(rank$cnt_non_resp),4);
rank$ks <- abs(rank$cum_rel_resp - rank$cum_rel_non_resp) * 100;
rank$rrate <- percent(rank$rrate)
rank$cum_rel_resp <- percent(rank$cum_rel_resp)
rank$cum_rel_non_resp <- percent(rank$cum_rel_non_resp)
trainRank <- rank
View(rank)
test$pred = predict(trainLOGIT, test, type="response")

decile <- function(x)


{
deciles <- vector(length=10)
for (i in seq(0.1,1,.1))
{
deciles[i*10] <- quantile(x, i, na.rm=T)
}
return (
ifelse(x<deciles[1], 1,
ifelse(x<deciles[2], 2,
ifelse(x<deciles[3], 3,
ifelse(x<deciles[4], 4,
ifelse(x<deciles[5], 5,
ifelse(x<deciles[6], 6,
ifelse(x<deciles[7], 7,
ifelse(x<deciles[8], 8,
ifelse(x<deciles[9], 9, 10
))))))))))
}
test$deciles <- decile(test$pred)

tmp_DT = data.table(test)

rank <- tmp_DT[, list(cnt=length(`Default - 1`),


29 | P a g
e
cnt_resp=sum(`Default - 1`==1),
cnt_non_resp=sum(`Default - 1`==0)
), by=deciles][order(-deciles)]

rank$rrate <- round(rank$cnt_resp / rank$cnt,4);


rank$cum_resp <- cumsum(rank$cnt_resp)
rank$cum_non_resp <- cumsum(rank$cnt_non_resp)
rank$cum_rel_resp <- round(rank$cum_resp / sum(rank$cnt_resp),4);
rank$cum_rel_non_resp <- round(rank$cum_non_resp /
sum(rank$cnt_non_resp),4);
rank$ks <- abs(rank$cum_rel_resp - rank$cum_rel_non_resp) * 100;
rank$rrate <- percent(rank$rrate)
rank$cum_rel_resp <- percent(rank$cum_rel_resp)
rank$cum_rel_non_resp <- percent(rank$cum_rel_non_resp)
testRank<-rank

View(rank)

mean.obs.train = aggregate(Default ~ rank, data = train, mean)


mean.pred.train = aggregate(pred ~ rank, data = train, mean)

mean.obs.val = aggregate( `Default - 1`~ rank, data = test, mean)


mean.pred.val = aggregate(pred ~ rank, data = test, mean)

# plot the mean vs deciles


par(mfrow=c(1,2))
plot(mean.obs.train[,2], type="b", col="black", ylim=c(0,0.8),
xlab="Decile", ylab="Prob")
lines(mean.pred.train[,2], type="b", col="red", lty=2)
title(main="Training Sample")

plot(mean.obs.val[,2], type="b", col="black", ylim=c(0,0.8), xlab="Decile",


ylab="Prob")
lines(mean.pred.val[,2], type="b", col="red", lty=2)
title(main="Validation Sample")

30 | P a g
e

You might also like