You are on page 1of 36

import

pandas as pd

df = pd.read_csv('work.txt')

# Access and manipulate the data in the DataFrame


# For example, you can print the first few rows
df.head()

Unnamed:
id loan_amnt funded_amnt funded_amnt_inv term int_rate installment grade sub_grade ... hardship_start_date hardship
0

36
0 0 1077501 5000 5000 4975.0 10.65% 162.87 B B2 ... NaN
months

60
1 1 1077430 2500 2500 2500.0 15.27% 59.83 C C4 ... NaN
months

36
2 2 1077175 2400 2400 2400.0 15.96% 84.33 C C5 ... NaN
months

36
3 3 1076863 10000 10000 10000.0 13.49% 339.31 C C1 ... NaN
months

60
4 4 1075358 3000 3000 3000.0 12.69% 67.79 B B5 ... NaN
months

5 rows × 142 columns

df.drop('Unnamed: 0', axis =1, inplace =True)


df.head()

id loan_amnt funded_amnt funded_amnt_inv term int_rate installment grade sub_grade emp_title ... hardship_start_date hards

36
0 1077501 5000 5000 4975.0 10.65% 162.87 B B2 NaN ... NaN
months

60
1 1077430 2500 2500 2500.0 15.27% 59.83 C C4 Ryder ... NaN
months

36
2 1077175 2400 2400 2400.0 15.96% 84.33 C C5 NaN ... NaN
months

AIR
36
3 1076863 10000 10000 10000.0 13.49% 339.31 C C1 RESOURCES ... NaN
months
BOARD

University
60
4 1075358 3000 3000 3000.0 12.69% 67.79 B B5 Medical ... NaN
months
Group

5 rows × 141 columns

df_filtered = df.dropna(axis=1, how='all')

df_filtered
id loan_amnt funded_amnt funded_amnt_inv term int_rate installment grade sub_grade emp_title ... collections_12_mths_ex_

36
0 1077501 5000 5000 4975.0 10.65% 162.87 B B2 NaN ...
months

60
1 1077430 2500 2500 2500.0 15.27% 59.83 C C4 Ryder ...
months

36
2 1077175 2400 2400 2400.0 15.96% 84.33 C C5 NaN ...
months

AIR
36
3 1076863 10000 10000 10000.0 13.49% 339.31 C C1 RESOURCES ...
months
BOARD

University
60
4 1075358 3000 3000 3000.0 12.69% 67.79 B B5 Medical ...
months
Group

... ... ... ... ... ... ... ... ... ... ... ...

36 Gilbert
197 1065350 9000 9000 9000.0 12.69% 301.91 B B5 ...
months Express

36
198 1067028 13250 13250 13250.0 10.65% 431.60 B B2 Talbert House ...
months

36
199 1061877 20000 20000 20000.0 13.49% 678.61 C C1 NaN ...
months

36 Kohls
200 1067018 3000 3000 3000.0 14.65% 103.49 C C3 ...
months Corporation

Hospice
36
201 1067223 7350 7350 7350.0 10.65% 239.42 B B2 Peachtree, ...
months
LLC

202 rows × 60 columns

df_filtered.isnull().sum()
id 0
loan_amnt 0
funded_amnt 0
funded_amnt_inv 0
term 0
int_rate 0
installment 0
grade 0
sub_grade 0
emp_title 10
emp_length 1
home_ownership 0
annual_inc 0
verification_status 0
issue_d 0
loan_status 0
pymnt_plan 0
url 0
purpose 1
title 1
zip_code 1
addr_state 1
dti 1
delinq_2yrs 1
earliest_cr_line 1
fico_range_low 1
fico_range_high 1
inq_last_6mths 1
mths_since_last_delinq 155
mths_since_last_record 197
open_acc 1
pub_rec 1
revol_bal 1
revol_util 1
total_acc 1
initial_list_status 1
out_prncp 1
out_prncp_inv 1
total_pymnt 1
total_pymnt_inv 1
total_rec_prncp 1
total_rec_int 1
total_rec_late_fee 1
recoveries 1
collection_recovery_fee 1
last_pymnt_d 2
last_pymnt_amnt 1
last_credit_pull_d 1
last_fico_range_high 1
last_fico_range_low 1
collections_12_mths_ex_med 1
policy_code 1
application_type 1
acc_now_delinq 1
chargeoff_within_12_mths 1
delinq_amnt 1
pub_rec_bankruptcies 1
tax_liens 1
hardship_flag 1
debt_settlement_flag 1
dtype: int64

df_filtered.drop(['mths_since_last_delinq',
'mths_since_last_record'],axis = 1, inplace=True)

C:\Users\Simeon\AppData\Local\Temp\ipykernel_9788\2937646126.py:1: SettingWithCopyWarning:

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ret


urning-a-view-versus-a-copy

df_filtered.shape

(202, 58)

df_filtered.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 202 entries, 0 to 201
Data columns (total 58 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 202 non-null int64
1 loan_amnt 202 non-null int64
2 funded_amnt 202 non-null int64
3 funded_amnt_inv 202 non-null float64
4 term 202 non-null object
5 int_rate 202 non-null object
6 installment 202 non-null float64
7 grade 202 non-null object
8 sub_grade 202 non-null object
9 emp_title 192 non-null object
10 emp_length 201 non-null object
11 home_ownership 202 non-null object
12 annual_inc 202 non-null float64
13 verification_status 202 non-null object
14 issue_d 202 non-null object
15 loan_status 202 non-null object
16 pymnt_plan 202 non-null object
17 url 202 non-null object
18 purpose 201 non-null object
19 title 201 non-null object
20 zip_code 201 non-null object
21 addr_state 201 non-null object
22 dti 201 non-null float64
23 delinq_2yrs 201 non-null float64
24 earliest_cr_line 201 non-null object
25 fico_range_low 201 non-null float64
26 fico_range_high 201 non-null float64
27 inq_last_6mths 201 non-null float64
28 open_acc 201 non-null float64
29 pub_rec 201 non-null float64
30 revol_bal 201 non-null float64
31 revol_util 201 non-null object
32 total_acc 201 non-null float64
33 initial_list_status 201 non-null object
34 out_prncp 201 non-null float64
35 out_prncp_inv 201 non-null float64
36 total_pymnt 201 non-null float64
37 total_pymnt_inv 201 non-null float64
38 total_rec_prncp 201 non-null float64
39 total_rec_int 201 non-null float64
40 total_rec_late_fee 201 non-null float64
41 recoveries 201 non-null float64
42 collection_recovery_fee 201 non-null float64
43 last_pymnt_d 200 non-null object
44 last_pymnt_amnt 201 non-null float64
45 last_credit_pull_d 201 non-null object
46 last_fico_range_high 201 non-null float64
47 last_fico_range_low 201 non-null float64
48 collections_12_mths_ex_med 201 non-null float64
49 policy_code 201 non-null float64
50 application_type 201 non-null object
51 acc_now_delinq 201 non-null float64
52 chargeoff_within_12_mths 201 non-null float64
53 delinq_amnt 201 non-null float64
54 pub_rec_bankruptcies 201 non-null float64
55 tax_liens 201 non-null float64
56 hardship_flag 201 non-null object
57 debt_settlement_flag 201 non-null object
dtypes: float64(31), int64(3), object(24)
memory usage: 91.7+ KB

df_filtered.dropna(inplace =True)

C:\Users\Simeon\AppData\Local\Temp\ipykernel_9788\3027548963.py:1: SettingWithCopyWarning:

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ret


urning-a-view-versus-a-copy

#df_filtered.dropna(inplace =True)
df_filtered.shape

(190, 58)

df3 = df_filtered.describe().transpose()
df3
count mean std min 25% 50% 75% max

id 190.0 1.064882e+06 18785.828330 822464.00 1.067020e+06 1068100.500 1.069054e+06 1.077430e+06

loan_amnt 190.0 1.176421e+04 6624.218685 1000.00 7.000000e+03 10000.000 1.500000e+04 3.500000e+04

funded_amnt 190.0 1.151500e+04 6275.863313 1000.00 7.000000e+03 10000.000 1.400000e+04 3.500000e+04

funded_amnt_inv 190.0 1.143271e+04 6214.836906 1000.00 7.000000e+03 10000.000 1.400000e+04 3.500000e+04

installment 190.0 3.509846e+02 189.330581 35.31 2.231175e+02 325.740 4.341800e+02 1.140070e+03

annual_inc 190.0 5.968847e+04 30780.693678 15000.00 4.000000e+04 50702.000 7.500000e+04 2.250000e+05

dti 190.0 1.465958e+01 6.293200 1.00 1.005750e+01 14.830 1.999000e+01 2.985000e+01

delinq_2yrs 190.0 8.947368e-02 0.367164 0.00 0.000000e+00 0.000 0.000000e+00 3.000000e+00

fico_range_low 190.0 7.047895e+02 28.875905 660.00 6.812500e+02 700.000 7.250000e+02 7.900000e+02

fico_range_high 190.0 7.087895e+02 28.875905 664.00 6.852500e+02 704.000 7.290000e+02 7.940000e+02

inq_last_6mths 190.0 8.368421e-01 1.002489 0.00 0.000000e+00 1.000 1.000000e+00 5.000000e+00

open_acc 190.0 8.926316e+00 3.290976 2.00 7.000000e+00 8.000 1.100000e+01 2.000000e+01

pub_rec 190.0 2.631579e-02 0.160496 0.00 0.000000e+00 0.000 0.000000e+00 1.000000e+00

revol_bal 190.0 1.324842e+04 10248.618405 0.00 7.018000e+03 11132.500 1.661425e+04 7.435100e+04

total_acc 190.0 1.978421e+01 9.229053 3.00 1.200000e+01 18.000 2.600000e+01 5.100000e+01

out_prncp 190.0 0.000000e+00 0.000000 0.00 0.000000e+00 0.000 0.000000e+00 0.000000e+00

out_prncp_inv 190.0 0.000000e+00 0.000000 0.00 0.000000e+00 0.000 0.000000e+00 0.000000e+00

total_pymnt 190.0 1.251527e+04 7674.778085 1014.53 7.179692e+03 11253.485 1.555261e+04 4.000901e+04

total_pymnt_inv 190.0 1.237771e+04 7516.898974 1014.53 7.179690e+03 11253.485 1.551584e+04 4.000901e+04

total_rec_prncp 190.0 1.011161e+04 6418.660523 456.46 6.000000e+03 9489.735 1.318750e+04 3.500000e+04

total_rec_int 190.0 2.284095e+03 1990.225162 29.90 9.116775e+02 1644.715 2.983508e+03 1.008508e+04

total_rec_late_fee 190.0 1.225171e+00 5.438710 0.00 0.000000e+00 0.000 0.000000e+00 3.624700e+01

recoveries 190.0 1.183412e+02 485.821223 0.00 0.000000e+00 0.000 0.000000e+00 3.874790e+03

collection_recovery_fee 190.0 1.362000e+01 82.454367 0.00 0.000000e+00 0.000 0.000000e+00 6.708193e+02

last_pymnt_amnt 190.0 2.893200e+03 4463.872216 1.00 2.624275e+02 574.440 4.180885e+03 2.841243e+04

last_fico_range_high 190.0 6.694211e+02 75.202243 499.00 6.152500e+02 679.000 7.227500e+02 8.290000e+02

last_fico_range_low 190.0 6.523947e+02 128.180911 0.00 6.112500e+02 675.000 7.187500e+02 8.250000e+02

collections_12_mths_ex_med 190.0 0.000000e+00 0.000000 0.00 0.000000e+00 0.000 0.000000e+00 0.000000e+00

policy_code 190.0 1.000000e+00 0.000000 1.00 1.000000e+00 1.000 1.000000e+00 1.000000e+00

acc_now_delinq 190.0 0.000000e+00 0.000000 0.00 0.000000e+00 0.000 0.000000e+00 0.000000e+00

chargeoff_within_12_mths 190.0 0.000000e+00 0.000000 0.00 0.000000e+00 0.000 0.000000e+00 0.000000e+00

delinq_amnt 190.0 0.000000e+00 0.000000 0.00 0.000000e+00 0.000 0.000000e+00 0.000000e+00

pub_rec_bankruptcies 190.0 2.105263e-02 0.143939 0.00 0.000000e+00 0.000 0.000000e+00 1.000000e+00

tax_liens 190.0 0.000000e+00 0.000000 0.00 0.000000e+00 0.000 0.000000e+00 0.000000e+00

categorical_columns = df_filtered.select_dtypes(include='object')
categorical_columns
term int_rate grade sub_grade emp_title emp_length home_ownership verification_status issue_d loan_status ... zip_code addr_s

60 Dec-
1 15.27% C C4 Ryder < 1 year RENT Source Verified Charged Off ... 309xx
months 2011

AIR
36 Dec-
3 13.49% C C1 RESOURCES 10+ years RENT Source Verified Fully Paid ... 917xx
months 2011
BOARD

University
60 Dec-
4 12.69% B B5 Medical 1 year RENT Source Verified Fully Paid ... 972xx
months 2011
Group

36 Veolia Dec-
5 7.90% A A4 3 years RENT Source Verified Fully Paid ... 852xx
months Transportaton 2011

60 Southern Star Dec-


6 15.96% C C5 8 years RENT Not Verified Fully Paid ... 280xx
months Photography 2011

... ... ... ... ... ... ... ... ... ... ... ... ...

36 Dec-
195 14.27% C C2 Corning Inc. 8 years MORTGAGE Not Verified Charged Off ... 148xx
months 2011

36 Dec-
196 11.71% B B3 UPS 10+ years MORTGAGE Verified Fully Paid ... 028xx
months 2011

36 Gilbert Dec-
197 12.69% B B5 5 years RENT Source Verified Fully Paid ... 115xx
months Express 2011

36 Dec-
198 10.65% B B2 Talbert House 4 years RENT Not Verified Fully Paid ... 450xx
months 2011

36 Kohls Dec-
200 14.65% C C3 5 years RENT Verified Fully Paid ... 532xx
months Corporation 2011

190 rows × 24 columns

df5 = categorical_columns.describe().transpose()

import plotly.express as px

for i in df_filtered:
fig = px.histogram(df[i])
fig.show()
categorical_columns.columns

Index(['term', 'int_rate', 'grade', 'sub_grade', 'emp_title', 'emp_length',


'home_ownership', 'verification_status', 'issue_d', 'loan_status',
'pymnt_plan', 'url', 'purpose', 'title', 'zip_code', 'addr_state',
'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d',
'last_credit_pull_d', 'application_type', 'hardship_flag',
'debt_settlement_flag'],
dtype='object')

categorical = ['term', 'int_rate', 'grade', 'sub_grade', 'emp_title', 'emp_length',


'home_ownership', 'verification_status', 'issue_d',
'pymnt_plan', 'purpose', 'title', 'zip_code', 'addr_state',
'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d',
'last_credit_pull_d', 'application_type', 'hardship_flag',
'debt_settlement_flag']
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
for i in categorical:
df_filtered[i] = enc.fit_transform(df_filtered[i])
df_filtered

C:\Users\Simeon\AppData\Local\Temp\ipykernel_9788\3546400269.py:10: SettingWithCopyWarning:

A value is trying to be set on a copy of a slice from a DataFrame.


Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ret


urning-a-view-versus-a-copy

id loan_amnt funded_amnt funded_amnt_inv term int_rate installment grade sub_grade emp_title ... collections_12_mths_ex_med

1 1077430 2500 2500 2500.0 1 13 59.83 2 13 119 ... 0.0

3 1076863 10000 10000 10000.0 0 10 339.31 2 10 2 ... 0.0

4 1075358 3000 3000 3000.0 1 9 67.79 1 9 159 ... 0.0

5 1075269 5000 5000 5000.0 0 3 156.46 0 3 163 ... 0.0

6 1069639 7000 7000 7000.0 1 14 170.08 2 14 133 ... 0.0

... ... ... ... ... ... ... ... ... ... ... ...

195 1067038 12000 12000 12000.0 0 11 411.71 2 11 39 ... 0.0

196 1067030 25000 25000 25000.0 0 7 826.90 1 7 150 ... 0.0

197 1065350 9000 9000 9000.0 0 9 301.91 1 9 63 ... 0.0

198 1067028 13250 13250 13250.0 0 6 431.60 1 6 143 ... 0.0

200 1067018 3000 3000 3000.0 0 12 103.49 2 12 79 ... 0.0

190 rows × 58 columns


x = df_filtered.drop(['id','url','loan_status'],axis = 1)
y = df_filtered['loan_status']

from sklearn.model_selection import train_test_split


xtrain,xtest,ytrain,ytest = train_test_split(x,y, test_size = 0.2)

from sklearn.tree import DecisionTreeClassifier


model = DecisionTreeClassifier()

model.fit(xtrain,ytrain)

DecisionTreeClassifier()

pred = model.predict(xtest)

from sklearn.metrics import classification_report, confusion_matrix


print(classification_report(ytest,pred))

precision recall f1-score support

Charged Off 1.00 0.71 0.83 7


Fully Paid 0.94 1.00 0.97 31

accuracy 0.95 38
macro avg 0.97 0.86 0.90 38
weighted avg 0.95 0.95 0.94 38

print(confusion_matrix(ytest,pred))

[[ 5 2]
[ 0 31]]


Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js

You might also like