Final Group Project
Group #1
Price College of Business, University of Oklahoma
FIN 4433-001 FinTech & Applications
Mandy Chan, MBA
Started on April 12th, 2025
Last Modified April 27th, 2025
Introduction
Credit risk, the probability that a borrower fails to meet contractual obligations, remains a
core concern for lenders. Recent advances in machine learning, particularly gradient-boosting
methods, allow institutions to model non-linear feature interactions and improve early detection
of high-risk applicants. Replicating peer-reviewed code from public repositories aligns with best
practices in open science and accelerates innovation in FinTech education.
The Challenges of Credit Risk Analysis before FinTech
Before automated scoring systems and networked databases became commonplace, credit-
risk assessment was hampered by a chronic lack of reliable information. Borrower data were
scattered across individual bank branches and local credit bureaus, often locked away in paper
files. Because lenders could not instantly access a customer’s full payment history, they made
decisions with glaring information gaps, increasing the odds of both approving high-risk applicants
and turning away creditworthy ones.
In the absence of large, consistent datasets, underwriting leaned heavily on human
judgment. Loan officers relied on “character” interviews, letters from employers, or the applicant’s
reputation in the community. Although personal insight sometimes revealed nuances that numbers
missed, it also introduced subjectivity and bias. Lending standards varied from branch to branch,
and discriminatory practices could creep in unnoticed, exposing institutions to compliance and
reputational risks.
Quantitative tools were rudimentary. Basic ratio analysis, such as comparing debt to
income or assets to liabilities was calculated by hand or on simple spreadsheets. Multivariate
statistical models were rare outside the largest banks, largely because the computing power and
skilled analysts needed to build them were expensive. As a result, lenders struggled to price the
risk accurately: interest rates and reserve cushions were either set too high, discouraging good
borrowers, or too low, encouraging future defaults.
Operationally, the entire process was slow and costly. Collecting pay stubs, tax returns, and
bank references require physical mail or in-person visits, so underwriting a single file could take
days. High manual workloads limited a lender’s ability to scale and made it difficult to handle
spikes in application volume. Customers often waited weeks for funding decisions, eroding
satisfaction and driving some to faster competitors.
Even after a loan was approved, monitoring remained largely static. Accounts were
reviewed only at fixed intervals or once payments became delinquent. Because portfolio data were
not linked in real time to economic indicators such as unemployment or regional downturns, credit
quality could deteriorate rapidly before management noticed. By the time remedial action was
taken, losses were often much larger than they would have been with earlier warning signals.
Some of the Application of FinTech in Credit Risk Analysis
Modern credit-scoring engines
The most visible use of data analysis in credit risk is building credit-scoring models that
predict the likelihood a new or existing borrower will repay on time. By feeding statistical or
machine-learning algorithms with thousands of historical loan records—payment histories,
utilization ratios, income stability, and even alternative data such as utility-bill payment or mobile-
phone top-ups—lenders translate raw attributes into a single score or probability of default (PD).
Because the score is automated and continuously recalibrated, it lets banks offer instant credit
decisions, set credit-line limits, and comply with “fair-lending” regulations that demand objective,
data-driven rules rather than subjective judgment.
Limit assignment and dynamic line management
Once an account is opened, lenders still need to decide how much exposure to grant. Data
analysis powers optimization frameworks that set initial credit limits and periodically adjust them
up or down. By linking historical utilization patterns, macroeconomic indicators, and forward-
looking loss forecasts, banks can expand limits for customers who show rising incomes and
responsible usage while trimming or freezing limits for those who exhibit early warning signs such
as maxing-out, higher minimum payments, or deteriorating external scores. These fine-tuning
balances revenue growth against loss containment more effectively than static “one-size-fits-all”
policies.
Responsible-lending and model-explainability analytics
Modern credit-risk teams pair machine-learning power with fairness and explainability
techniques to meet regulatory and ethical standards. Tools such as Shapley values and
counterfactual analysis quantify how each input feature influences an individual prediction,
enabling clear adverse-action notices to consumers who are declined. Bias-detection algorithms
scan for disparate impact across protected classes, prompting data scientists to retrain models or
add constraints that preserve accuracy while reducing unfair outcomes. This analytical layer turns
raw predictive power into transparent, compliant, and socially responsible credit decisions.
Stress testing and portfolio loss forecasting
Regulators and risk committees require lenders to show how their consumer portfolios will
perform under adverse economic scenarios. Analysts build panel data sets that marry internal loan-
level performance with unemployment rates, interest-rate paths, and regional house-price indices.
Econometric or machine-learning models project charge-offs and net-losses under baseline,
adverse, and severely adverse scenarios. These forecasts drive capital-adequacy planning, loan-
loss-reserve calculations (CECL/IFRS 9), and inform management on whether to tighten
underwriting standards or raise pricing before a downturn hits.
Example Notebook of Fintech App. in Credit Risk Analysis (w/ Walkthrough)
Business Problem
This app predicts if an applicant will be approved for a credit card or not. Each time there
is a hard enquiry your credit score is affected negatively. This app predicts the probability of being
approved without affecting your credit score. This app can be used by applicants who want to find
out if they will be approved for a credit card without affecting their credit score.
Method
Methodology that is used for this project includes (1) Exploratory data analysis, (2)
Bivariate analysis, (3) Multivariate correlation
Summary of the Notebook
This notebook serves as the exploratory-data-analysis (EDA) stage for a credit-card approval
project. Its goal is to answer two early-stage questions:
1. What does each variable look like on its own?
Univariate analysis cells profile every feature (categorical and numeric) through value-
frequency tables, histograms, box-plots, pie-charts, and summary statistics. These visuals
expose skew, outliers, dominant categories, and missing-value patterns so that later
cleaning and encoding choices are evidence-based.
2. How does a single feature interact with the target or with another feature?
In the bivariate analysis section, the notebook contrasts each variable against the binary
label Is high risk (default flag). Side-by-side box-plots, risk-segmented bar charts, and
grouped means reveal which characteristics—such as short employment history or certain
dwelling types—are disproportionately represented among bad applicants. These insights
help shortlist promising predictors and highlight relationships (e.g., high-risk applicants
having older accounts yet shorter job tenure) that modelling should capture.
Taken together, the univariate and bivariate explorations give stakeholders an intuitive, data-
driven picture of applicant demographics, financial attributes, and early risk signals before the
project moves on to multivariate modelling and machine-learning steps.
Notebook Walkthrough
0. Import required package
The first code cell, titled “0. import the necessary packages,” is the notebook’s staging
area: it loads every library required for the credit-card-approval project so that later sections
can focus entirely on data cleaning, modeling, and evaluation. General-purpose data
wrangling is handled by NumPy and pandas, while exploratory assistants such as
missingno and pandas-profiling make it easy to visualize gaps or anomalous values in the
dataset. Matplotlib and Seaborn, reinforced by scikit-plot and Yellowbrick, give the author
a full palette of plotting utilities—from quick histograms to ROC curves and feature-
importance bars—rendered inline through the %matplotlib inline magic.
Statistical testing and path management come next. Scipy’s statistical tools (for
example, chi-square tests) support hypothesis checks on categorical variables, and
pathlib.Path offers OS-agnostic file handling. A comprehensive slice of scikit-learn
components then enters: splitters such as train_test_split and cross_val_score,
preprocessing helpers like ColumnTransformer, OneHotEncoder, and MinMaxScaler, plus
virtually every mainstream classification algorithm—from logistic regression and support-
vector machines through decision-tree ensembles and neural networks. Calibration, cross-
validation, permutation-based feature importance, and rich reporting utilities
(classification_report, ConfusionMatrixDisplay, ROC functions) are also pulled in so the
notebook can judge model quality on balanced grounds
1. Import and process data
In [2] – Load the two raw data files
The cell reads application_record.csv into cc_data_full_data, which holds the applicant-level
features, and credit_record.csv into credit_status, which contains month-by-month repayment
information for those same customers. Bringing both tables into memory is the foundation for
every transformation that follows.
In [3] – Engineer risk labels and merge account age
First, it determines each borrower’s oldest account by grouping credit_status and taking the
minimum MONTHS_BALANCE, then merges that “Account age” back onto the main
application data. Next, it flags any serious delinquency: statuses “2”, “3”, “4”, or “5” are marked
“Yes” in a temporary dep_value column. By re-aggregating credit_status at the customer level,
the cell collapses multiple monthly rows into a single “Yes/No” high-risk indicator, merges it
into cc_data_full_data, converts the text label to numeric (1 = high risk, 0 = low risk), and drops
the helper column. A chained-assignment warning is also suppressed for neatness.
In [4] – Make column names human-readable
To improve clarity, this cell renames cryptic bureau codes such as CODE_GENDER or
DAYS_BIRTH into plain English equivalents like “Gender” and “Age.” It also renames the
previously added “Account age,” so the entire DataFrame now reads like a business-friendly
table
In [5] – Define a reusable train/test split function
A small helper called data_split wraps train_test_split, taking the DataFrame and a test-size
fraction (here 0.2) and returning reset-index copies of the train and test subsets. Encapsulating
the logic keeps later code tidy and reproducible.
In [6] – Create the working train and test sets
Using the function above, the full application data is split 80 %/20 % into cc_train_original and
cc_test_original. From this point onward, modeling work proceeds on the training set while
performance will later be checked on the held-out test set.
In [7] – Quick sanity check: training-set shape
Simply prints cc_train_original.shape, letting the analyst verify that roughly 80 % of the original
rows landed in the training partition.
In [8] – Quick sanity check: test-set shape
Likewise prints cc_test_original.shape, confirming that the remaining 20 % of records are in the
test split.
In [9] – Persist the training data to disk
Saves cc_train_original as dataset/train.csv so that downstream notebooks or production
pipelines can load the identical training sample without rerunning the earlier preprocessing steps.
In [10] – Persist the test data to disk
Does the same for cc_test_original, writing it to dataset/test.csv for future out-of-sample
evaluation or model-comparison experiments.
In [11] – Protect the raw splits with working copies
Creates cc_train_copy and cc_test_copy, duplicates of the saved splits. Subsequent cleaning,
encoding, or feature-engineering operations can now proceed on the copies, while the pristine
originals remain untouched as a reference point.
2. Basic analysis of the dataset
In [12] — Build and save an automated EDA report
This cell runs Pandas Profiling on the cleaned training set (cc_train_copy). The ProfileReport
object scans every column, computes descriptive statistics, correlation matrices, and missing-
value charts, and then renders them into a self-contained HTML file. A Path check ensures the
report isn’t regenerated if it already exists; otherwise it is written to
pandas_profile_file/income_class_profile.html. The result is a point-and-click exploratory
dashboard that can be opened in any browser for a deep dive into feature distributions and data-
quality issues.
In [13] — Quick visual peek at the data
Calling cc_data_full_data.head() displays the first five rows of the full, feature-engineered
application table. This gives the analyst a sanity check that the earlier merges and renaming
produced sensible, human-readable columns and that the new “Is high risk” target is present.
In [14] — Structural overview of the DataFrame
cc_data_full_data.info() prints each column’s dtype, the number of non-null observations, and
overall memory usage. It confirms that missing values have been handled as expected and shows
which variables are numeric vs. object (categorical), guiding later encoding and scaling
decisions.
In [15] / Out [15] — Numeric summary statistics
cc_data_full_data.describe() returns a table (shown as Out [15]) of count, mean, standard
deviation, and the 25th, 50th, and 75th percentiles for every numeric feature—including income,
age, employment length, and account age. Analysts use this snapshot to spot unreasonable
ranges, skewed distributions, or potential outliers before moving on to modeling.
3. Input the functions used to explore each feature/ pillar
In [18] – Value-count helper
This cell adds a utility called value_cnt_norm_cal. Given a DataFrame and a column name, the
function returns a tidy two-column table that shows the absolute Count of each distinct value and
its Frequency (%) expressed as a percentage. It is the workhorse behind most categorical plots
and summary printouts that follow, sparing the author from rewriting the same value_counts() /
normalization logic over and over.
In [19] – Quick feature profiler
gen_info_feat is a Swiss-army knife for on-the-fly exploration of a single feature. Using Python
3.10’s match … case syntax, the function tailors its behaviour to each variable: for Age it
converts the raw negative “days” into positive years before printing descriptive stats and plotting
a histogram; for categorical fields such as Education level or Dwelling it prints the value-
frequency table from In [18] and shows a bar chart; for numeric money variables it draws box-
and-histograms with scientific notation turned off. In short, one call delivers a concise textual
and visual profile of whichever column is under investigation.
In [20] – Pie-chart generator
create_pie_plot builds an “at-a-glance” pie chart for selected categorical attributes (e.g.,
Dwelling, Education level). It first grabs the percentage distribution via the helper in [18], then
feeds those percentages into plt.pie, formats the legend, enforces equal aspect so the circle isn’t
distorted, and titles the figure. Because credit datasets often have imbalanced classes, seeing the
relative share of, say, “Rented” vs “Owned” housing in one picture can be more intuitive than a
bar chart.
In [21] – Bar-chart generator
Complementing the pie routine, create_bar_plot produces vertical bar charts that show raw
counts for high-cardinality or business-critical categoricals—marital status, dwelling type, job
title, employment status, education level, and others. Tick labels are rotated and right-justified
for readability, and the same function falls back to a generic branch (case _:) so it will sensibly
plot any categorical column passed to it.
In [22] – Box-plot generator
create_box_plot focuses on the spread and outliers of numeric features. It again branches on the
feature name so that each variable is rendered with units users understand (e.g., Age converted to
years, Employment length converted from negative days to positive years, and incomes shown
with thousands separators). For discrete counts such as number of children it sets integer y-ticks,
while for money values it disables scientific notation. The result is a clean, vertically oriented
boxplot that highlights skewness and extreme observations.
In [23] – Histogram generator
create_hist_plot provides a complementary look at distribution shape. Like the box-plot helper, it
converts and formats special variables (Age, Income, Employment length) before calling
sns.histplot, overlays a kernel-density estimate, and allows the caller to specify the number of
bins. This is the go-to tool for diagnosing normality, skew, or multimodality in any numeric
column.
In [24] – High-risk vs low-risk boxplot
low_high_risk_box_plot dives deeper by splitting the numeric variable of interest (currently Age
or Income) into two groups—borrowers flagged Is high risk = 1 vs those flagged 0. It prints the
mean of each group for quick reference, then draws a side-by-side boxplot so analysts can see if,
for example, higher incomes coincide with fewer defaults or if older applicants have a different
risk profile.
In [25] – High-risk vs low-risk bar chart
Finally, low_high_risk_bar_plot serves the categorical analogue: it groups the data by a chosen
categorical feature, sums the Is high risk indicator to count how many risky customers sit in each
category, sorts those counts in descending order, prints the underlying dictionary, and renders a
bar chart. This immediately spotlights, say, which employment statuses or dwelling types
harbour the largest share of delinquent applicants, guiding subsequent feature engineering or
policy rules.
Together, cells 18-25 equip the notebook with a full exploratory-data-analysis toolkit—tables,
pies, bars, histograms, and risk-segmented boxplots—that can be invoked repeatedly without
cluttering the main narrative of the credit-risk project.
4. Run Univariate Analysis
Core variables
• Gender • Income
• Age • Employment status
• Marital status • Employment length
• Children count • Education level
• Has a property (yes/no) • Account age
• Gender
• Age
• Marital Status
• Income
• Employment Status
• Educational Level
• Employment Length
• Property Ownership
• Account Owner Length
5. Bivariate Analysis (Correlation Test)
• Correlation of Age vs Features
• Correlation of Income vs Features
Key Findings of this Notebook
From the data set, a representative customer in this dataset is a woman around 40 years old
who is married or cohabiting and has no children. She has worked for roughly five years, earns
about $157 k annually, and finished secondary school. While she does not own a car, she does
possess residential real estate (such as a house or flat) and her credit account has been open for
about 26 months.
Statistical tests indicate that neither age nor income shows a meaningful correlation with
the default flag. Borrowers classified as high-risk generally have shorter job tenures and longer-
standing accounts, yet they make up less than two percent of all observations. In contrast, the bulk
of applicants are aged 20–45 and hold accounts that have been active for 25 months or less.
Implication for the Future
The disciplined, pillar-by-pillar EDA framework used in this notebook lays a foundation
for explainable, regulator-ready credit-risk models. Because every variable is first profiled on its
own, then contrasted with the default flag, analysts can trace exactly why a feature is included,
how it behaves across segments, and whether it introduces bias. Embedding those checks as
reusable functions means the same diagnostics can run automatically when new data arrive or
when a model drifts, turning what is often a one-off exploratory step into a living set of controls
that satisfy both internal risk committees and external auditors.
Looking ahead, the modular design also accelerates feature expansion and alternative-data
experimentation. New signals (for example, utility-payment history or mobile-device metadata)
can be dropped into the univariate/bivariate template and instantly subjected to the same scrutiny
as traditional bureau fields. That consistency shortens the path from raw idea to production model
while guarding against “black-box” pitfalls. As lenders move toward real-time approvals and
dynamic credit limits, a methodology that couples rapid exploration with transparent governance
will be crucial for scaling machine-learning risk engines without sacrificing trust or compliance.
Acknowledgement
The notebook used for this report is a derived and simplified version of the project called
“Credit-card-approval-prediction-classification” by Stern Semasuka (username: @semasuka)
from GitHub.