You are on page 1of 13

BUSINESS REPORT

Financial Risk Analysis Project


By Surabhi Chaughule
Milestone 1

FRA-Part A credit Risk


Problem Statement

Businesses or companies can fall prey to default if they are not able to keep up their debt
obligations. Defaults will lead to a lower credit rating for the company which in turn reduces its
chances of getting credit in the future and may have to pay higher interest on existing debts as
well as any new obligations. From an investor's point of view, he would want to invest in a
company if it is capable of handling its financial obligations, can grow quickly, and is able to
manage the growth scale.

A balance sheet is a financial statement of a company that provides a snapshot of what a


company owns, owes, and the amount invested by the shareholders. Thus, it is an important
tool that helps evaluate the performance of a business.

Data that is available includes information from the financial statement of the companies for the
previous year.

Dependent variable - No need to create any new variable, as the 'Default' variable is already
provided in the dataset, which can be considered as the dependent variable.

Test Train Split - Split the data into train and test datasets in the ratio of 67:33 and use a
random state of 42 (random_state=42). Model building is to be done on the train dataset and
model validation is to be done on the test dataset.

Dataset: Credit Risk Dataset

Data Dictionary: Data Dictionary

 Preliminary exploration of the Dataset:


All messy column names are fixed. All spaces, commas, brackets are replaced with either ‘_’
The first 5 rows of the dataset are as follows:

This dataset contains 2058 rows and 58 columns

 Check for Duplicate Records


No duplicate records exist

 Data type No of variables


 A description of the first few columns of the dataset is given below:
From the descriptive stats, we can see that most of the variables have outliers.
1.1 Outlier Treatment
Detecting Outliers
Conventionally, outliers are identified based on the inter-quartile distance as follows:
Q1 – 25th Percentile
Q3 – 75th Percentile
IQR = Q3 – Q1
Lower outlier = Value < 1.5 * IQR
Upper Outlier = Value > 1.5 * IQR
Based on this definition, the number of outliers in the dataset is as follows:
A view for each of the individual variables is as follows:
Observations
Many variables have outliers as defined by the above criteria.
- If all the records with outliers are dropped, only 112 records remain. Modeling
cannot be done with only these records, so, dropping records is not an option.
- The next option is to cap the outliers at both the ends.
- By increasing the boundaries to 10th and 90th Percentile, the number of outliers is as
follows:

1.2 Missing Value Treatment


The dataset with the treated outliers is used for further analysis. There is a
total of 220 missing values as seen below:
Accordingly, the missing values have been treated using the KNNImputer
After imputing the missing values, it is confirmed that there are no
more missing values

1.4 Univariate & Bivariate analysis with proper interpretation

Univariate Analysis:
We have done univariate analysis for Default customer

Bivariate Analysis:
We have shown bi-variate analysis of top 6 variables with Default variable
Multi- variate Analysis:

1.5 Train-Test Split


As stated in the problem, the dataset has to be split in the ratio of 67:33 using
random_state 42.

1.6 Build Logistic Regression Model (using statsmodel library) on most important
variables on Train Dataset and choose the optimum cutoff

Selecting top 14 features for model building


Training result:

Test result:
Since Precision is low for model, we will SMOTE. Using random_state=42 again running the
model
Training Set result:

Test set result:

You might also like