0% found this document useful (0 votes)
22 views11 pages

EndTerm - MLBA - Group 7 Draft

The project analyzes a bank's dataset to identify potential customers for personal loan acceptance, focusing on data preprocessing and model development using Logistic Regression and Decision Tree. Key insights reveal significant class imbalance and the importance of features like Income and CD Account in predicting loan acceptance. The Decision Tree model outperforms Logistic Regression with higher accuracy and recall, making it more effective for the bank's objectives.

Uploaded by

Defi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views11 pages

EndTerm - MLBA - Group 7 Draft

The project analyzes a bank's dataset to identify potential customers for personal loan acceptance, focusing on data preprocessing and model development using Logistic Regression and Decision Tree. Key insights reveal significant class imbalance and the importance of features like Income and CD Account in predicting loan acceptance. The Decision Tree model outperforms Logistic Regression with higher accuracy and recall, making it more effective for the bank's objectives.

Uploaded by

Defi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Project Report: Predicting Personal Loan Acceptance

Project Title: Identifying Potential Customers for Personal Loan Uptake

Introduction & Dataset Name: This project aims to analyze a dataset from a bank that offers
personal loans to its customers. The primary objective is to identify customers who are most
likely to accept a personal loan offer, thereby helping the bank's management increase the
uptake of these loans. The dataset used for this analysis is bankloan.csv.

1. Data Preprocessing & Key Insights

Outline of Data Preprocessing Steps:

The initial bankloan.csv dataset comprised 5000 entries and 14 columns. The preprocessing
steps were crucial for cleaning the data and preparing it for effective model training.

● Initial Data Inspection:

○ The dataset was loaded into a pandas DataFrame.


○ Initial checks using .head(), .info(), and .describe() revealed the
dataset structure, data types (all numerical), and basic statistical summaries.
○ Crucially, no missing values were found across any of the columns.
● Handling Negative 'Experience' Values:

○ A significant data anomaly was identified in the Experience column: 52 rows


contained negative values (e.g., -3), which is illogical for "experience."
○ These negative values were imputed by replacing them with the median of all
positive Experience values in the dataset, which was 20.0. This correction
ensures that the Experience feature accurately reflects a customer's
professional background. The minimum Experience value became 0.0 after
this step.
● Dropping Irrelevant Columns:

○ The ID column, being a unique identifier, has no predictive power for loan
acceptance; therefore, it was removed.
○ The ZIP Code column, while numerical, represents highly granular geographical
information. Treating it as a numerical feature is inappropriate, and one-hot
encoding it would create too many features (high cardinality), potentially leading
to computational inefficiency and overfitting. For simplicity and focus on direct
customer attributes, ZIP Code was also dropped.
● Handling 'Education' Column:
○ The Education column contained discrete numerical values (1, 2, 3), which
clearly represent ordered categories (e.g., Undergraduate, Graduate,
Advanced/Professional). Given its ordinal nature, this column was retained as is,
without one-hot encoding, as its numerical representation implicitly captures the
hierarchical relationship.
● Feature Scaling:

○ To ensure that features with larger numerical ranges (like Income or Mortgage)
do not disproportionately influence the Logistic Regression model, numerical
features (Age, Experience, Income, CCAvg, Mortgage) were scaled using
StandardScaler. This transforms the data to have a mean of 0 and a standard
deviation of 1. Decision Tree models are generally insensitive to feature scaling,
but it was applied for consistency and Logistic Regression's benefit.
● Data Splitting:

○ The processed dataset was divided into training (70%) and testing (30%) sets.
This separation is vital to evaluate the models' generalization ability on unseen
data.
○ stratify=y was applied during the split to maintain the exact proportion of loan
acceptors (the target variable Personal Loan) in both the training and testing
sets, addressing the inherent class imbalance.

Key Insights from the Preprocessing and Exploratory Data Analysis (EDA):

The EDA revealed crucial characteristics of the dataset and provided insights into factors
influencing personal loan acceptance:

● Significant Class Imbalance: The target variable, Personal Loan, showed a notable
imbalance: only 9.6% of customers accepted a loan (Class 1), while 90.4% did not
(Class 0). This highlights the importance of evaluating models beyond simple accuracy,
focusing on metrics like recall and precision for the minority class.
● Impact of Key Numerical Features:

○ Income and CCAvg (Credit Card Average Spending): These were identified
as the most influential numerical features. Customers with significantly higher
incomes and higher average monthly credit card spending (CCAvg) were
found to be substantially more likely to accept personal loans. The distributions
for loan acceptors were clearly skewed towards higher values for these features.
○ Mortgage: While there was some correlation, the relationship between
Mortgage amount and loan acceptance was not as strong as with Income or
CCAvg.
○ Age and Experience: These features showed very little differentiation in their
distributions between loan acceptors and non-acceptors, indicating a minimal
direct impact on loan acceptance.

● Influence of Key Categorical/Binary Features:

○ CD Account: The presence of a CD Account was a remarkably strong indicator.


Customers who already held a CD Account were significantly more likely to
accept a personal loan.
○ Education: Higher Education levels (specifically levels 2 and 3) showed a
clear positive association with personal loan acceptance, suggesting that higher
education might correlate with better financial understanding or different financial
needs.
○ Family: Customers with larger Family sizes (e.g., 3 or 4 members) also
exhibited a slightly higher propensity for loan acceptance.
○ Securities Account, Online, CreditCard: These features showed very little
discernible impact on loan acceptance rates.
● Correlation Analysis:

○ A correlation matrix confirmed the strong positive relationships between


Personal Loan and Income, CCAvg, and CD Account.
○ Education and Mortgage had moderate positive correlations.
○ Features like Age, Experience, Securities Account, Online, and
CreditCard showed very weak or negligible correlations with loan acceptance.
○ A high correlation was noted between Age and Experience (0.99), indicating
multicollinearity, but given their low individual correlation with the target, it was
not a primary concern for the model's predictive power.

2. Predictive Model Development: Methodology, Model Performance, and


Interpretation

To identify potential customers, two predictive models were developed: Logistic Regression and
Decision Tree. Both were trained and evaluated to assess their effectiveness.

Methodology:

1. Data Preparation: The preprocessed dataset was used, with Personal Loan as the
target variable (y) and all other relevant features (excluding 'ID' and 'ZIP Code') as
predictors (X).
2. Data Splitting: The data was split into training (70%) and testing (30%) sets using
train_test_split with random_state=42 for reproducibility and stratify=y to
preserve the class distribution of Personal Loan in both sets.
3. Feature Scaling (for Logistic Regression): Numerical features (Age, Experience,
Income, CCAvg, Mortgage) were scaled using StandardScaler on the training data,
and this scaler was then applied to the test data. This step standardizes the feature
values, which is beneficial for Logistic Regression's optimization algorithm.
4. Model Instantiation and Training:
○ Logistic Regression: An instance of LogisticRegression from
sklearn.linear_model was created with random_state=42 and
solver='liblinear' (suitable for smaller datasets and regularization). The
model was then fit on X_train and y_train.
○ Decision Tree: An instance of DecisionTreeClassifier from
sklearn.tree was created with random_state=42 and a max_depth=5.
The max_depth parameter was set to control the complexity of the tree, helping
to prevent overfitting and improve interpretability. This model was also fit on
X_train and y_train.

Model Performance and Interpretation:

The performance of both models was evaluated using classification_report,


confusion_matrix, accuracy_score, and roc_auc_score. Visualizations like confusion
matrices and ROC curves were also generated.

Logistic Regression Model

● Overall Accuracy: 95.13%


● ROC AUC Score: 0.9646 (Indicates excellent discriminatory power)

Classification Report:

precision recall f1-score support

0 0.96 0.99 0.97 1356

1 0.83 0.62 0.71 144

Confusion Matrix:

1338 18

55 89

True Negatives (TN): 1338 (Correctly predicted no loan)


False Positives (FP): 18 (Incorrectly predicted loan, but they didn't take one)
False Negatives (FN): 55 (Incorrectly predicted no loan, but they did take one)
True Positives (TP): 89 (Correctly predicted loan)

Interpretation: The Logistic Regression model shows strong performance in predicting non-
loan acceptors (Class 0), with high precision and recall. However, for predicting loan acceptors
(Class 1), while its precision is good (83%), its recall is moderate (62%). This means it correctly
identifies 83% of the customers it predicts will take a loan, but it only captures 62% of the
customers who actually take a loan, missing 38% of potential positive cases.

Feature Coefficient Absolute Coefficient


CD Account 3.24 3.24
Income 2.08 2.08
Education 1.31 1.31
CreditCard -0.92 0.92
Securities Account -0.82 0.82
Online -0.64 0.64
Family 0.54 0.54
Age 0.48 0.48
Experience -0.48 0.48
CCAvg 0.2 0.2
Mortgage 0.06 0.06

Interpretation of Coefficients:

● Strong Positive Impact: CD Account, Income, and Education have the largest
positive coefficients, indicating that the presence of a CD account, higher income, and
higher education levels significantly increase the likelihood of a customer accepting a
personal loan.
● Negative Impact: CreditCard, Securities Account, and Online have negative
coefficients, suggesting that customers having these features might be slightly less likely
to accept a personal loan compared to others.
● Moderate/Minor Impact: Family, Age, Experience, CCAvg, and Mortgage have
smaller coefficients, implying a less pronounced influence on loan acceptance in this
model.

Decision Tree Model

● Overall Accuracy: 98.47%


● ROC AUC Score: 0.9951 (Indicates superior discriminatory power)

Classification Report:
precision recall f1-score support

0 0.99 0.99 0.99 1356

1 0.9 0.95 0.92 144

Confusion Matrix:

1340 16

7 137
p
True Negatives (TN): 1340
False Positives (FP): 16
False Negatives (FN): 7
True Positives (TP): 137

Interpretation:

The Decision Tree model significantly outperforms the Logistic Regression model, particularly in
identifying the minority class. Its recall for Class 1 (Personal Loan) is 95%, meaning it correctly
identifies 95% of actual loan acceptors, missing only 7 cases (False Negatives). This makes it a
much more effective tool for the bank's goal of increasing loan uptake.

Feature Importances of Decision Tree:

Feature Importance
Income 0.458
Education 0.326
Family 0.148
CCAvg 0.045
CD Account 0.011
Age 0.006
Online 0.004
Mortgage 0.003
Experience 0
Securities Account 0
CreditCard 0

Interpretation of Feature Importances:


● Income is by far the most important feature, followed by Education and Family. These
features are used at the top levels of the decision tree to make crucial splits.
● CCAvg and CD Account also contribute, though their individual importance in this
particular tree structure (with max_depth=5) is lower compared to the top three.
● Features like Experience, Securities Account, and CreditCard have negligible
or zero importance in this model, meaning they were not used in the key decision splits.

3. Model Comparison: Logistic Regression vs. Decision Tree

Metric Logistic Decision


Regression Tree

Accuracy 0.9513 0.9847

ROC AUC Score 0.9646 0.9951

Recall (Class 1) 0.62 0.95

Precision (Class 1) 0.83 0.90

False Negatives 55 7
(FN)

Conclusion on Model Comparison:


As the ROC Curve Comparison plot illustrates, the Decision Tree model (green curve)
demonstrates a superior performance with an Area Under the Curve (AUC) of 0.99, significantly
higher than the Logistic Regression model (blue curve) which has an AUC of 0.96. This
indicates that the Decision Tree model is much better at distinguishing between customers who
will accept a personal loan and those who will not, across various classification thresholds. The
closer the curve is to the top-left corner, the better the model's performance.

The Decision Tree model clearly outperforms the Logistic Regression model for this task,
especially in identifying potential loan acceptors. Its significantly higher recall for the positive
class (0.95 vs. 0.62) means it is much better at identifying customers who will actually take a
loan, leading to fewer missed opportunities for the bank.

4. Assumptions & Limitations

Assumptions:

● Data Representativeness: It is assumed that the provided bankloan dataset is


representative of the bank's overall customer base and reflects real-world customer
behavior regarding personal loan acceptance.
● Feature Relevance: The available features are assumed to contain sufficient
information to predict loan acceptance.
● Data Integrity: It is assumed that, after preprocessing, the data is accurate and free
from major unhandled errors (e.g., that median imputation for 'Experience' is an
acceptable approach).
● Ordinality of Education: It is assumed that the numerical representation of Education
(1, 2, 3) correctly reflects an ordinal relationship and that treating it as such is
appropriate for the models.
● Model Linearity (Logistic Regression): Logistic Regression assumes a linear
relationship between the independent variables and the log-odds of the dependent
variable.
● Independence of Observations: Both models assume that observations (customer
records) are independent of each other.

Limitations:

● Class Imbalance: Despite using stratify during splitting, the significant class
imbalance (9.6% loan acceptors) can still pose a challenge. Models might struggle with
the minority class, and performance metrics (like accuracy) can be misleading if not
considered alongside precision and recall.
● Generalizability: The models are trained on a specific dataset. Their performance on
entirely new customer segments or data from a different time period might vary.
● Feature Engineering Scope: Due to the scope of this project, extensive feature
engineering (e.g., creating interaction terms, more sophisticated handling of ZIP Code
data for geographical segmentation) was not performed, which could potentially further
improve model performance.
● Interpretability vs. Performance Trade-off: While Logistic Regression offers clear
coefficient interpretations, Decision Trees can become complex and less interpretable if
max_depth is not limited. Conversely, simpler Decision Trees might not capture all
complex relationships.
● Decision Tree Instability: Decision Trees can be sensitive to small changes in the
training data (high variance). While random_state ensures reproducibility for a given
run, this inherent instability can be a limitation for deployment. Ensemble methods (like
Random Forest or Gradient Boosting) could address this but were beyond the immediate
scope.
● No Causal Inference: The models identify correlations and predictive patterns, but they
do not establish causal relationships (e.g., high income causes loan acceptance, rather it
is strongly associated with it).

5. Conclusion & Key Takeaways

The analysis of the bank loan dataset and the development of predictive models have provided
valuable insights for identifying customers likely to accept personal loan offers.

Key Findings and Model Comparison:

● The dataset exhibited a notable class imbalance, with only 9.6% of customers accepting
personal loans.
● Income, CD Account, and Education Level emerged as the strongest positive
predictors of personal loan acceptance across both models. Customers with higher
incomes, those who possess a CD Account, and those with higher education levels are
significantly more likely to accept a loan.
● The Decision Tree model proved to be superior for this task. It achieved a higher
overall accuracy (98.47%) and, more importantly, a substantially better recall for the
positive class (95% vs. 62% for Logistic Regression). This indicates the Decision Tree is
much more effective at identifying true loan acceptors and minimizing missed
opportunities for the bank.

Key Takeaways for Increasing Loan Uptake:

Based on the insights derived from the models, the bank should primarily focus its marketing
and outreach efforts on customer segments characterized by:

1. High Income: This is the most critical factor. Targeting affluent customers will yield the
highest success rate.
2. Higher Education Levels: Customers with graduate or advanced degrees show a
strong propensity to accept personal loans.
3. Existing CD Account Holders: Customers who already have a Certificate of Deposit
(CD) account with the bank are significantly more likely to accept personal loans. This
group represents a highly promising target due to their existing relationship and likely
financial stability.
4. Larger Families: Customers with more family members also show a higher likelihood of
accepting loans.
5. Higher Credit Card Average Spending (CCAvg): Customers with higher monthly
credit card expenditures are also good candidates.

By prioritizing these characteristics, the bank can optimize its strategies to identify and engage
with the most receptive customers, thereby increasing the uptake of personal loans efficiently.
Further improvements could involve exploring more advanced modeling techniques or delving
deeper into feature engineering.

You might also like