You are on page 1of 12

dpdzero-assessment

March 31, 2024

1 DPDZero Assesment
The dataset contains Loan data for various borrowers in a loan portfolio
The columns are as follows
Amount Pending - This is the EMI amount
State - The borrower’s state
Tenure - Total tenure of the borrower
Interest Rate - Interest rate
City - The city of the borrower
Bounce String
• This is a string that explain’s customer’s bounce behaviour since the disbursal of the loan -
bounce means that the customer did not end up making the payment
• S or H- No bounce in that month
• B or L - Bounce in that month
• FEMI - first EMI - no known behaviour
• Last character denotes the last month - first character denotes the first month on book - for
example SSB means that customer was on book for 3 months and he has bounced the in the
last month
Disbursed Amount - the total disbursed amount
Loan Number - The loan number
[166]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import os

[167]: df = pd.read_csv(r"C:\Users\Mohit\Downloads\Data_Analyst_Assignment_Dataset.
↪csv")

df.head()

1
[167]: Amount Pending State Tenure Interest Rate City Bounce String \
0 963 Karnataka 11 7.69 Bangalore SSS
1 1194 Karnataka 11 6.16 Bangalore SSB
2 1807 Karnataka 14 4.24 Hassan BBS
3 2451 Karnataka 10 4.70 Bangalore SSS
4 2611 Karnataka 10 4.41 Mysore SSB

Disbursed Amount Loan Number


0 10197 JZ6FS
1 12738 RDIOY
2 24640 WNW4L
3 23990 6LBJS
4 25590 ZFZUA

[168]: df.shape

[168]: (24582, 8)

[169]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24582 entries, 0 to 24581
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Amount Pending 24582 non-null int64
1 State 24582 non-null object
2 Tenure 24582 non-null int64
3 Interest Rate 24582 non-null float64
4 City 24582 non-null object
5 Bounce String 24582 non-null object
6 Disbursed Amount 24582 non-null int64
7 Loan Number 24582 non-null object
dtypes: float64(1), int64(3), object(4)
memory usage: 1.5+ MB

[170]: df.columns

[170]: Index(['Amount Pending', 'State', 'Tenure', 'Interest Rate', 'City',


'Bounce String', 'Disbursed Amount', 'Loan Number'],
dtype='object')

[171]: df.describe(include='all')

[171]: Amount Pending State Tenure Interest Rate City \


count 24582.000000 24582 24582.000000 24582.000000 24582
unique NaN 7 NaN NaN 186

2
top NaN Maharashtra NaN NaN Pune
freq NaN 6793 NaN NaN 1780
mean 1791.172687 NaN 9.415263 0.934960 NaN
std 937.565507 NaN 3.238904 3.114732 NaN
min 423.000000 NaN 7.000000 0.000000 NaN
25% 1199.000000 NaN 8.000000 0.000000 NaN
50% 1593.000000 NaN 8.000000 0.000000 NaN
75% 2083.000000 NaN 11.000000 0.000000 NaN
max 13349.000000 NaN 24.000000 37.920000 NaN

Bounce String Disbursed Amount Loan Number


count 24582 24582.000000 24582
unique 413 NaN 24579
top S NaN T7WLO
freq 3615 NaN 2
mean NaN 17705.195468 NaN
std NaN 14192.671509 NaN
min NaN 2793.000000 NaN
25% NaN 9857.750000 NaN
50% NaN 13592.000000 NaN
75% NaN 19968.000000 NaN
max NaN 141072.000000 NaN

[172]: df.isnull().sum()

[172]: Amount Pending 0


State 0
Tenure 0
Interest Rate 0
City 0
Bounce String 0
Disbursed Amount 0
Loan Number 0
dtype: int64

[173]: df.skew(numeric_only=True)

[173]: Amount Pending 3.077099


Tenure 3.199011
Interest Rate 4.910253
Disbursed Amount 3.282953
dtype: float64

[174]: df.duplicated().sum()

[174]: 0

3
[175]: df['Bounce String'] = df['Bounce String'].str.replace('L','B')
df['Bounce String'] = df['Bounce String'].str.replace('H','S')

1.0.1 Task 1 - Calculating the risk labels for all the borrowers.

[176]: condition2 = df[(df['Bounce String'] != 'FEMI') & ((df['Bounce String'].str.


↪count('B') == 0) & (df['Bounce String'].str.len() <= 6)) |

((df['Bounce String'].str[-6:].str.count('B') == 0) & (df['Bounce␣


↪String'].str.len() > 6))]

[177]: condition2['Bounce String'].unique()

[177]: array(['SSS', 'SS', 'S', 'SSSSS', 'SSSS', 'BSSSSSSS', 'SSSSSSSS',


'BSSSSSS', 'SSSSSSS', 'SSSSSS', 'SSSSSSSSS', 'BBSSSSSS',
'BSSSSSSSSS', 'SSSSSSSSSS'], dtype=object)

[178]: condition3 = df[(df['Bounce String'] != 'FEMI') & (df['Bounce String'].str[-1] !


↪= 'B')& ((df['Bounce String'].str.count('B') == 1) & (df['Bounce String'].

↪str.len() <= 6)) |

((df['Bounce String'].str[-6:].str.count('B') == 1) & (df['Bounce␣


↪String'].str.len() > 6))]

[179]: condition3['Bounce String'].unique()

[179]: array(['BS', 'BSSSSS', 'SBSSS', 'SSSBS', 'BSSSS', 'SSBSS', 'BSSS', 'SSBS',


'SBSS', 'SBS', 'SSSSSSSB', 'SSBSSSSS', 'BBBSSSSS', 'BSBSSSSS',
'BSSSSSBS', 'BBSSSSS', 'BSSSSBS', 'BSSSSSB', 'SBSSSSS', 'SSBSSSS',
'SSSSSBS', 'SSSBSSS', 'SSSSBSS', 'SSSSBS', 'SSSBSS', 'SSBSSS',
'BSSBSSSS', 'SBBSSSSS', 'SSSSSSBS', 'SSSSSSB', 'BSSSBSS', 'SBSSSS',
'BBBBBSSSSS', 'SBSSBSSSSS', 'BBBBSSSSS', 'SSSBSSSSS', 'BSSBSSS'],
dtype=object)

[180]: condition1 = (df['Bounce String'] == 'FEMI')


condition2 = ((df['Bounce String'] != 'FEMI') & ((df['Bounce String'].str.
↪count('B') == 0) & (df['Bounce String'].str.len() <= 6)) |

((df['Bounce String'].str[-6:].str.count('B') == 0) & (df['Bounce␣


↪String'].str.len() > 6)))

condition3 = ((df['Bounce String'] != 'FEMI') & (df['Bounce String'].str[-1] !=␣


↪'B') & ((df['Bounce String'].str.count('B') == 1) & (df['Bounce String'].str.

↪len() <= 6)) |

((df['Bounce String'].str[-6:].str.count('B') == 1) & (df['Bounce␣


↪String'].str.len() > 6)))

condition4 = ~(condition1 | condition2 | condition3)

conditions = [
condition1,

4
condition2,
condition3,
condition4
]

values = ['Unknown risk', 'Low risk', 'Medium Risk', 'High Risk']


df['Risk Label'] = np.select(conditions, values, default='Unknown')
df.head(2)

[180]: Amount Pending State Tenure Interest Rate City Bounce String \
0 963 Karnataka 11 7.69 Bangalore SSS
1 1194 Karnataka 11 6.16 Bangalore SSB

Disbursed Amount Loan Number Risk Label


0 10197 JZ6FS Low risk
1 12738 RDIOY High Risk

[203]: df['Risk Label'].value_counts().plot(kind= 'bar')

[203]: <Axes: >

5
[181]: df['Months on Book'] = df['Bounce String'].str.len()

[182]: df

[182]: Amount Pending State Tenure Interest Rate City \


0 963 Karnataka 11 7.69 Bangalore
1 1194 Karnataka 11 6.16 Bangalore
2 1807 Karnataka 14 4.24 Hassan
3 2451 Karnataka 10 4.70 Bangalore
4 2611 Karnataka 10 4.41 Mysore
… … … … … …
24577 899 Andhra Pradesh 8 0.00 Chittoor
24578 2699 Andhra Pradesh 8 0.00 Krishna
24579 1540 Andhra Pradesh 8 0.00 Krishna
24580 824 Andhra Pradesh 8 0.00 Guntur
24581 2254 Andhra Pradesh 11 0.00 Kurnool

Bounce String Disbursed Amount Loan Number Risk Label \


0 SSS 10197 JZ6FS Low risk
1 SSB 12738 RDIOY High Risk
2 BBS 24640 WNW4L High Risk
3 SSS 23990 6LBJS Low risk
4 SSB 25590 ZFZUA High Risk
… … … … …
24577 FEMI 7192 EAX5C Unknown risk
24578 FEMI 21592 5MCE9 Unknown risk
24579 FEMI 12320 9HO4Q Unknown risk
24580 FEMI 6592 3VV72 Unknown risk
24581 FEMI 24794 18XBC Unknown risk

Months on Book
0 3
1 3
2 3
3 3
4 3
… …
24577 4
24578 4
24579 4
24580 4
24581 4

[24582 rows x 10 columns]

6
1.0.2 Task 2: Labeling customers based on where they are in their tenure.

[183]: Cond1 = df['Months on Book'] <=3


Cond2 = (df['Tenure']-df['Months on Book'])<=3
Cond3 = ~(Cond1 | Cond2)

conditions = [
Cond1,
Cond2,
Cond3 ]

values = ['Early Tenure', 'Late Tenure', 'Mid Tenure']


df['Tenure Label'] = np.select(conditions, values, default='Unknown')
df.head(2)

[183]: Amount Pending State Tenure Interest Rate City Bounce String \
0 963 Karnataka 11 7.69 Bangalore SSS
1 1194 Karnataka 11 6.16 Bangalore SSB

Disbursed Amount Loan Number Risk Label Months on Book Tenure Label
0 10197 JZ6FS Low risk 3 Early Tenure
1 12738 RDIOY High Risk 3 Early Tenure

[204]: df['Tenure Label'].value_counts().plot(kind= 'bar')

[204]: <Axes: >

7
1.0.3 Task 3: Segmenting borrowers based on ticket size

[184]: # Sort the DataFrame by 'Amount Pending'


df = df.sort_values(by='Amount Pending').reset_index(drop=True)

# Calculate the cumulative sum of 'Amount Pending'


df['Cumulative Sum'] = df['Amount Pending'].cumsum()

# Calculate the quantiles for the cumulative sum


df['Ticket Size Quantile'] = pd.qcut(df['Cumulative Sum'], q=3, labels=['Low␣
↪Ticket Size', 'Medium Ticket Size', 'High Ticket Size'])

# Display the distribution of borrowers in each cohort


print(df['Ticket Size Quantile'].value_counts())

Low Ticket Size 8194


Medium Ticket Size 8194
High Ticket Size 8194

8
Name: Ticket Size Quantile, dtype: int64

[185]: df.head()

[185]: Amount Pending State Tenure Interest Rate City \


0 423 Maharashtra 11 11.84 Sangli
1 444 Tamil Nadu 11 12.23 VIRUDHUNAGAR
2 451 Maharashtra 7 37.92 Pune
3 522 Karnataka 11 12.83 Bagalkot
4 522 Maharashtra 11 12.83 Pune

Bounce String Disbursed Amount Loan Number Risk Label Months on Book \
0 FEMI 4389 HEMS0 Unknown risk 4
1 FEMI 4598 1BYJD Unknown risk 4
2 BSSSSB 2793 7COLC High Risk 6
3 FEMI 5390 587TX Unknown risk 4
4 S 5390 5QJN0 Low risk 1

Tenure Label Cumulative Sum Ticket Size Quantile


0 Mid Tenure 423 Low Ticket Size
1 Mid Tenure 867 Low Ticket Size
2 Late Tenure 1318 Low Ticket Size
3 Mid Tenure 1840 Low Ticket Size
4 Early Tenure 2362 Low Ticket Size

1.0.4 Task 4: Chhannel spend recommendations


[199]: # low risk Customers
Low_risk_df = df[df['Risk Label']=='Low risk']

# Firt EMI Customers


First_EMI_df = df[df['Bounce String']=='FEMI']

# Low EMI Customers


Low_EMI_df = df[df['Ticket Size Quantile']=='Low Ticket Size']

# Low or Medium EMI Customers


Low_Med_EMI_df = df[df['Ticket Size Quantile']== ('Low Ticket Size' or 'Medium␣
↪Ticket Size')]

# English Speaking Customers


Eng_df = df[df['City'] == ( "Mumbai" or "Pune" or "Delhi" or "Ahmedabad" or␣
↪"Surat" or "Chennai" or "Kolkata" or "Bangalore" or "Hyderabad")]

# Hindi Speaking Customers


Hindi_df = df[df['State'] == ('Maharashtra' or 'Madhya Pradesh')]

9
# Hindi or English Speaking Customers
Hindi_Eng_df = pd.concat([Eng_df, Hindi_df]).drop_duplicates()

# Low Bounce Behaviour Customers


Low_bounce_df = df[df['Risk Label']=='Medium Risk']

[200]: # Define the cost of each channel


channel_costs = {'Whatsapp bot': 5, 'Voice bot': 10, 'Human calling': 50}

# Assign each borrower to a channel based on the defined criteria


def assign_channel(row):
if row['Risk Label'] == 'Low risk' or row['Bounce String'] == 'FEMI' or␣
↪row['Ticket Size Quantile'] == 'Low Ticket Size':

return 'Whatsapp bot'


elif row['City'] in ["Mumbai", "Pune", "Delhi", "Ahmedabad", "Surat",␣
↪"Chennai", "Kolkata", "Bangalore", "Hyderabad"] or row['State'] in␣

↪['Maharashtra', 'Madhya Pradesh'] or row['Risk Label'] == 'Medium Risk':

return 'Voice bot'


else:
return 'Human calling'

# Apply the function to assign channels


df['Channel'] = df.apply(assign_channel, axis=1)
df

[200]: Amount Pending State Tenure Interest Rate City \


0 423 Maharashtra 11 11.84 Sangli
1 444 Tamil Nadu 11 12.23 VIRUDHUNAGAR
2 451 Maharashtra 7 37.92 Pune
3 522 Karnataka 11 12.83 Bagalkot
4 522 Maharashtra 11 12.83 Pune
… … … … … …
24577 12500 Maharashtra 8 0.00 Kolhapur
24578 12500 Maharashtra 8 0.00 Pune
24579 12500 Kerala 8 0.00 MALAPPURAM
24580 12500 Maharashtra 8 0.00 Sangli
24581 13349 Maharashtra 8 0.00 Nagpur

Bounce String Disbursed Amount Loan Number Risk Label \


0 FEMI 4389 HEMS0 Unknown risk
1 FEMI 4598 1BYJD Unknown risk
2 BSSSSB 2793 7COLC High Risk
3 FEMI 5390 587TX Unknown risk
4 S 5390 5QJN0 Low risk
… … … … …
24577 BBSSSSS 100000 8MQRY Medium Risk
24578 S 100000 1R840 Low risk

10
24579 S 100000 QUV9D Low risk
24580 S 100000 66HA4 Low risk
24581 S 106792 HZ6XJ Low risk

Months on Book Tenure Label Cumulative Sum Ticket Size Quantile \


0 4 Mid Tenure 423 Low Ticket Size
1 4 Mid Tenure 867 Low Ticket Size
2 6 Late Tenure 1318 Low Ticket Size
3 4 Mid Tenure 1840 Low Ticket Size
4 1 Early Tenure 2362 Low Ticket Size
… … … … …
24577 7 Late Tenure 43979758 High Ticket Size
24578 1 Early Tenure 43992258 High Ticket Size
24579 1 Early Tenure 44004758 High Ticket Size
24580 1 Early Tenure 44017258 High Ticket Size
24581 1 Early Tenure 44030607 High Ticket Size

Channel
0 Whatsapp bot
1 Whatsapp bot
2 Whatsapp bot
3 Whatsapp bot
4 Whatsapp bot
… …
24577 Voice bot
24578 Whatsapp bot
24579 Whatsapp bot
24580 Whatsapp bot
24581 Whatsapp bot

[24582 rows x 14 columns]

[201]: # Total cost for each channel

whatsapp_cost = df[df['Channel'] == 'Whatsapp bot'].shape[0] * 5


voice_cost = df[df['Channel'] == 'Voice bot'].shape[0] * 10
human_cost = df[df['Channel'] == 'Human calling'].shape[0] * 50
total_cost = whatsapp_cost + voice_cost + human_cost

print(f"Total cost for Whatsapp Bot: {whatsapp_cost} rupees")


print(f"Total cost for Voice Bot: {voice_cost} rupees")
print(f"Total cost for Human Calling: {human_cost} rupees")
print(f"Total overall cost: {total_cost} rupees")

Total cost for Whatsapp Bot: 96640 rupees


Total cost for Voice Bot: 39610 rupees
Total cost for Human Calling: 64650 rupees

11
Total overall cost: 200900 rupees

[ ]:

12

You might also like