SHIVAM - A00876542A - Capstone Paper

System for Alternative Credit Scoring using
Unstructured Non-Banking Data and

Deep Learning Architectures
Shivam Bansal
National University of Singapore
Abstract
Conventional methods for credit risk assessment of SMEs (small and medium
enterprises) uses historical credit information along with banking and transactional
data. Most of the financial institutions and banks rely solely on financial information
to evaluate the creditworthiness of business entities. One of the major downfalls of
this approach is the cold start problem, i.e. it is almost impossible to evaluate new
business entities because of lack of data. Relying only on static banking data also
means that a comprehensive profile about the business entities is never obtained. A
lot of non-banking data specifically the one present in unstructured form which is
highly valuable is missed out. In this paper, we present a sophisticated novel system
that exploits non-banking unstructured data available on the open web to improve
the conventional credit scoring models. The system uses a generic data extraction
module which mines the unstructured data at scale. The system uses state of the art
deep learning and language processing architectures such as Context Free Grammar
Parsers and Bidirectional Encoding Representations from Transformers (Bert) to
derive structured entities and sentiments from the text. All the derived information is
fed into a statistical modelling architecture. In this paper, we show that use of
additional non-banking data improves the credit profiles of the business entities
along with significant improvements in credit scoring model performance. The results
also shows an increased relative feature importance of the derived features.
1. Introduction
As part of the business opportunity identification and lead generation to give

loans, banks and financial institutions need to deeply understand and
evaluate the companies. This evaluation is critical especially for companies
which are SME (small or medium enterprises). In banking terminology, this
evaluation is also called creditworthiness evaluation and the process is also
called credit risk scoring [Abdou, HAH and Pointon, J, 2011]. This goal of
credit risk scoring is to identify the amount of risk associated with a business
entity using statistical and data mining techniques. The risk is related to loan
repayment which indicates that a company with high credit risk score means
less chances of repaying the loan and vice versa.
Credit scoring has been a standard problem statement of data mining, data
1
science and financial analytics since past few decades. It has been regarded
as a core evaluation tool used by different institutions and has been widely
investigated in different areas, such as banking, accounting, and even
insurance. Different scoring techniques are being used in areas of
classification and prediction, where statistical techniques have conventionally
been used. A number of research and works have been done in the past that
are aimed to solve this problem. One of the promising results were shown by
Atiya A. when their team of researchers used neural networks to derive
accurate credit scores for consumers [Atiya, A. F. 2001]. They used the
banking data of customers to train the neural network models. In 2009,
Bellotti and Crook showed the use of support vector machines for credit
scoring and discovery of significant features [Bellotti, T., Crook, J. 2009].
Cramer in 2004 showed an excellent case study in which they identified the
cases where scoring bank loans that may go wrong [Cramer, 2004]. The
usage of data mining techniques such as classification, clustering, and
ensembling in the area of research has increased and become the dominant
area in the field.
While advancements in techniques and methodologies of credit scoring have

progressively increased in the past few decades, one theme has remained
the same which is the use of banking data for credit scoring. In one
interesting research, Susterisic and Zupan used limited available credit data
to perform credit scoring. But they did not explored non-banking data to
supplement the dataset [Sustersic, M., Mramor, D., Zupan J. 2009].
Banking data comprises of features having financial importance. It can

mainly be divided primarily into two categories - financial static factors and
previous financial interactions. Financial static factors defines the companies
and business units in terms of their bank balances and financial ratios,for
example: EquityShareCapital, Current Liabilities, Cash Equivalents, Current
Assets, Total Assets, Net Profit Margins, Current Ratios, Quick Ratios, Total
Debt / Equity. All these financial heads are obtained from company’s balance
sheets and annual, quarter, or monthly reports. The second type of banking
data is only defined for existing customers. This data is about the past
interactions of the customers with the bank or financial institutions. Some
examples include Past Payments, Amount owed, Credit history, and Duration
of availing new credits. These factors give information about regular
payments on past loans taken, past problems, delays in the payments, or
defaults. These interactions also explain financial behaviours of the
customers which are relevant in explaining the performances on the amount
owed. For example, customers which keep high outstanding amounts gets
the lowers the credit score or Customers with a longer credit history (i.e.
they have been taking loans and repaying it back for many years now)
2
possibly gets a better score as they provide richer data for assessing
creditworthiness.
In the paper “All data is credit data: Constituting the unbanked” Rob Aitken
suggested that credit scoring should not be just limited to banking data. A
variety of other alternative datasets can be used to supplement the scoring
models. [Rob Aitken, 2017]. Relying only on banking data creates several
issues and challenges. One of them is the credit assessment for businesses
which do not have comprehensive banking data becomes almost impossible.
This indicates that a number of potential customers are often being ignored
to give loans. Not just the new customers, there are instances in which even
the existing customers had relatively fewer past transactions so the credit
scoring models give less accurate results.
For SMEs, financial accounts are not reliable and it's up to the owner to
withdraw or retain cash. There are also other issues, for example small
companies are affected by their partners and their bad or good financial
status affects them, so monitoring the SMEs counterparts is another way of
scoring them. Additionally, small businesses have a major share of the world
economy and their share is growing, so accurate SME scoring is a major
concern. In a whitepaper from Moody’s, the authors have discussed seven
major challenges in assessment of SMEs [Moody’s 2016]
In modern practices, banks do use online information before giving a final

decision. However, the existing approaches are heavily manual in which
humans rely on desktop research. These processes are not time-consuming
but also adds heavy human bias in the final score. In this paper, we have
explored alternative data or also called non-banking data. Alternate credit
scoring is a great way to introduce new customers to the credit and financing
landscape. The broader inclusion opportunities provided by alternate credit
scoring perhaps explains why this is such an important concept especially in
developing nations.
Past works and researches have been done on alternative data to improve
credit scoring models. Alaien Shema in 2019 presented a paper on effective
credit scoring using limited mobile phone data. [Alaien, 2019]. In other
works, Bhattacharya, P. and their team used network science and analytics
for Predicting Loan Defaults in Microfinance using Behavioral Sequences.
[Bhattacharya, P., Mehrotra, R., Tan, T., and Phan, T.Q. (2018)]. In 2016,
the same authors also used mobile data for Credit-worthiness Prediction in
Microfinance. They used a Spatio-network Approach to generate the
alternative data. [Bhattacharya, P., and Phan, T.Q. (2016)]
3
In our proposed system, we have focused on using unstructured data which
is mainly about the digital footprints of the companies as the main source of
non-banking data. This data includes meta information about the companies
on different webpages, company’s own websites, and common information
providers such as wiki pages, mashable, bloomberg, reuters etc. The
historical news data from multiple known sources is also added as the
mainstream data point. Additionally, any other information obtained from
social media and common websites is also added in the non-banking dataset.
In a similar work done by researchers from Central University of Finance and
Economics, Beijing they used social media data in credit scoring
improvements. They used limited statistical machine learning and natural
language processing techniques. [Yuejin Zhanga 2106].
Our hypothesis of using this particular unstructured dataset is about finding

the negative or pessimistic information about the companies which is not
reflected in the banking data. For example, there is a company having good
banking metrics, but the company was in the news for some toxic reasons.
Another example is saying there had been negative news in the past about
the directors or the management personnel. Similarly, there could have been
an overall negative sentiment about the company on social media websites
or personal blogs. Some bloggers may have explained toxic practices done
by the companies in the past.
All of these examples suggest that online information about the companies
serve as a proxy about the company’s digital figure. Even though the
company may have good bank balance, or past financial interactions, they
may not have good digital reflection. Hence, adding such information in the
credit scoring models can significantly improve the credit score of the
companies. Another hypothesis is that the relevant digital information, as
long as can be human searched can possibly be generated from web crawling
engine, further processed by NLP engine to extract the core concept, and
then fed into machine learning algorithm for risk scoring. This pipeline can
help the banks in understanding a company from multiple angles as well as
the whole industry.
Since most of this data is unstructured in nature and cannot be directly used
into statistical machine learning models, we used state of the art natural
language processing and deep learning architectures to perform entity
extraction and text classification. To enrich the dataset, we have derived two
types of sentiments associated with companies - entity specific sentiment
and sentence level overall sentiments. The details of these architectures are
shared in section 3. In the last part, a machine learning modelling
architecture is explained in which data from the banking side and non-
4
banking side is combined together, different statistical models are trained to
give a final risk score. The key intuition is that this risk score is an improved
reflection about the creditworthiness of the company.
2. Literature Review
The major factors in the traditional credit scoring model include the basic
information, repayment ability, life stability, credit record, guarantee and some
other factors. In 2008 during the recession period, Herzenstein et al. find that
the borrowers’ financial strength and their effort when listing and publicizing
the loan are more important than demographic attributes for funding success.
The shared the empirical results in their paper titled “The democratization of
personal consumer loans? Determinants of success in online peer-to-peer
lending communities” [Herzenstein et al 2008]. Qiu et al. show that the
personal information, social capital, loan amount, acceptable maximum
interest rate and loan period set by borrowers are all significant factors of
funding success in their work from the paper - Effects of borrower defined
conditions in the online peer-to-peer lending market. E-life: web-enabled
convergence of commerce, work, and social life [Qiu et al., 2012]. Riza et al.
use Cox Proportional Hazard regression technique to evaluate credit risk and
measure loan performances. They find that credit grade, debt-to-income ratio,
FICO score and revolving line utilization plays an important role in loan
defaults in their work done in the paper “ Evaluating credit risk and loan
performance in online Peer-to-Peer (P2P) lending” [Riza et al. 2015]. Everett
analyzes the relationship between the social relationship and the default risk
and the interest rate in online P2P lending, and concludes that there is a low
default rate between the group members who have actual social relationship
on the mutual financing platform. Their work was presented in the paper titled
“Group membership, relationship banking and loan default risk: the case of
online social lending” [Everett 2010]. Collier et al. make an empirical analysis
of the financing behavior between members of community groups on P2P
lending platform, and confirm that by combining individual reputation with the
reputation of the group, the group members can supervise each other, which
can effectively reduce the adverse selection phenomenon. [Collier et al., 2010]
The modern data mining techniques, which have made a significant

contribution to the field of information science [Chen & Liu, 2004], can be
adopted to construct the credit scoring models. Banks and financial institutions
have developed a variety of traditional statistical models and data mining tools
for credit scoring, which involve linear discriminant models [Reichert, Cho, &
Wagner, 1983], logistic regression models [Henley, 1995], k-nearest neighbor
models [Henley & Hand, 1996], decision tree models [Davis, Edelman, &
Gammerman, 1992], neural network models [Desai, Crook, & Overstreet,
5
1996; Malhotra & Malhotra, 2002; West, 2000], and genetic programming
models [Ong, Huang, & Tzeng, 2005]. From the computational results made
by Tam and Kiang [Tam and Kiang 1992], the neural network is most accurate
in bank failure prediction, followed by linear discriminant analysis, logistic
regression, decision trees, and k-nearest neighbor.
Recently, researchers have proposed the hybrid data mining approach in the
design of an effective credit scoring model. Hsieh proposed a hybrid system
based on clustering and neural network techniques [Hsieh, 2005]; Lee and
Chen proposed a two-stage hybrid modeling procedure with artificial neural
networks and multivariate adaptive regression splines [Lee and Chen 2005];
Lee, Chiu, Lu, and Chen integrated the backpropagation neural networks with
traditional discriminant analysis approach [Lee, Chiu, Lu, and Chen 2002];
Chen and Huang presents a work involving two interesting credit analysis
problems and resolves them by applying neural networks and genetic
algorithms techniques. [Chen and Huang 2003]
Lipika Dey and Ishan Verma (2011) suggested a framework to combine

unstructured data with structured data in order to train the machine learning
models with single dataset. The state of the art deep learning and natural
language processing architecture called BERT (Bidirectional Encoding
Representations from Transformers) was released by Google AI in 2018. BERT
architecture outperformed 11 major NLP tasks. [Jacob Devlin, Ming-Wei
Chang, Kenton Lee, Kristina Toutanova, 2018].
3. Approach : Data and Methodology
In this paper, we propose a robust system to perform alternative credit

scoring. The system takes input as company information such as name and
the country and returns a credit score by deep analysis of both structured and
unstructured data. The proposed system comprises of three parts which are
interconnected together. These components are Data Extractor Module,
Natural Language Processing engine, and the Scoring Model Pipeline.
figure 1: The three component workflow of scoring engine
The first component is used to mine the maximum possible data about SMEs
from the open web. The second component is to derive structured information
6
from the data. The role of the third component is to provide a scoring model in
which multiple predictive models are trained. Models are then framed into a
stacking architecture which gives slightly improved accuracy. The model
results are also evaluated using a validation dataset which is the unseen data.
3.1 Dataset Preparation
For this project, an anonymised and masked dataset of companies, their

metadata, and their banking data was obtained from a local bank. The
dataset consisted of 10,000 companies data. For the purpose of this paper,
the names of the companies have been masked to ensure the confidentiality.
1. Total size of the dataset used was 10,000 rows and 16 columns including
one ID column and one target column.
2. Dataset consists of 9 numerical features related to bank and financial data
of the companies. These features were mainly related to past amounts,
current balances, average transactions, median spends etc. The snapshot of
these columns is shown in figure 2:
figure 2: Statistical summary of numerical columns present in the dataset
3. The dataset also consisted of 5 categorical features. These features were

related to company’s metadata such as industry or sector. One feature was
the country which had 10 distinct values and another feature was industry
which had 9 distinct values from Mining, Solar, Consumer Goods, Oil and
Gas, Logistics, Shipping, Pharmaceuticals, Transporters, and Construction.
7
figure 3: Distribution of Categorical Variables in the dataset
4. The target variable followed a slight imbalance distribution with more than
70% of companies were tagged as 0 and the remaining companies were
tagged as 1. The target variable is the indication of credit worthiness, if the
company failed to make loan installment amount for one of the months in
first 12 month, then it was tagged as 1 else it was tagged as 0.
Figure 4: Distribution of Target Variable
The dataset was relatively clean and did not require preprocessing. This is
because this was bank’s production ready data which they were using for
other services such as personalization, recommendations, insights and
dashboards. To store the entire data, a data lake architecture was also
developed as part of the entire system.
5. The dataset was split into two sets : a training set and validation sets. The
training set was used for training the credit scoring model and validation set
was used to evaluate the model performance on unseen dataset.
6. As the next step, we enriched the dataset by mining unstructured data

about the companies from the open web. First step was to prepare input
queries. Every input query contained the name of the company and their
operating country separated by a space. A large number of company names
also contained a lot of noisy entities, and hence a preprocessing snippet was
8
developed to clean the company names and fix issues such as HTML tags,
special characters, and handle different text encodings.
The data mining engine made programmatic search requests to different

search engines and obtained different URLs and links. These links were
stored in local file system as the temporary intermediate files. Next, every
link was analyzed and the main company website was identified. The main
website identification algorithm contained systematic rules to check patterns
in the links. After this, the module also performed link branching. The
meaning of link branching is that for a one page of the company’s website,
the system also extracted other relevant pages such as about page, contact
page, news page, and history page. This was done to enrich the information
in a comprehensive manner. Along with this, search engines were also
searched in order to obtain the news links of the company from past 20
years. And every news link was scraped to identify the news headline, news
body, author and date of publication. All this text data was stored in
individual text files stored on local disk storage.
From the unstructured text data, system then extracted named entities
specifically mentions of company names. For this task, NLP engine was used.
Additionally, all the text objects where the input company name was
mentioned were tagged separately. For these sentences, entity wise
sentiment analysis was performed along with the sentence level sentiment
using deep learning state of the art architectures.
In the end, for 10,000 companies a total of 56,341 total webpages were
extracted from open web. Out of these webpages, 19,653 were news links
obtained from sources like bloomberg, yahoo finance, reuters, mashable etc.
Further from these news links, there were 4943 instances where input
company name was mentioned. These were the relevant sentences to gain
information about negative sentiment and entity specific sentiment. Entity
specific sentiment was classified as one of the four types of risk. This became
a single feature. The figure 5 shows the data lake architecture:
9
figure 5: Data Lake Architecture
3.2 Data Extractor Module
The banking data consisted of only static details and it is enriched with non-
banking attributes. The data extractor module is responsible for performing
automated searching on open web and obtaining all the relevant information.
It works in a pipeline manner in which the first step is to define an input
query. The input query is prepared by combining company name and the
operating country in single string. The linkfinder module performs automated
searches on multiple search engine websites such as google, bing, and
yahoo. These searches are triggered programmatically using python
programming language. In the linkfinder module, different methods are
incorporated to avoid the inaccuracies and discrepancies in the results. Figure
7 shows the detailed pipeline which is part of data extractor module.
10
figure 6: Workflow of Data Extraction Engine
First, the linkfinder module performs a Google Search using the custom
search API which is available for free to use. If relevant results are obtained
they are saved otherwise the engine then uses google-search package from
python to programmatically search google. It then also tries scraping google
very carefully and systematically using requests library. Further, even if all
the above steps do not generate the desired output, possibly due to blocking
limits, then a selenium wrapper is used to get the data which is a web
automation tool. it is used to mimic the searching process similar to humans
and obtain the results. The output is a list of URLs which are saved in local
file storage.
The next step in the data extractor pipeline is html extraction step. In this
step, all the links obtained in the previous part are passed to a
link_request_module which makes the programmatic requests to each
module. It requests them one by one and obtains their HTML text. Depending
upon the URL, this module makes a Get or Post request and stores the
response in the files. Finally, the HTML is parsed using python’s beautifulsoup
library in a step by step and systematic manner. This module does not obtain
all the text from an HTML page, rather obtains the main portion of the
website. The main page extraction is performed by keeping a check on
specific html tags. Some html tags are given high priority such as body, div,
p tags while others such as script, style, title tags are ignored.
3.2 State of the Art NLP Engine
11
Most of the non-banking data obtained from the web, documents, and
databases is raw and unstructured in nature. Before using it for analysis or
modelling purposes it is necessary to convert it into a structured form. In this
component of the system, a natural language processing engine is
developed. NLP engine is used for text cleaning, entity extraction, and text
classification for sentence sentiment and entity sentiment.
a. Text Cleaning: The first step in the NLP engine is the cleaning of text. A
pipeline to take raw text as input, performing several cleaning techniques
and producing a cleaned text as output is developed. The key techniques are:
Removal of HTML entities, Removal of special characters, Standardization of
text encodings, and removal of unwanted spaces, delimiters, and slang. All of
this noise is captured in the raw form from the web during the mining part.
This text cleaning task makes sure to remove all such noises and produces a
cleaned dataset for analysis and modelling.
b. Entity Extraction: Entity extraction, also known as entity name

extraction or named entity recognition, is an information extraction technique
that refers to the process of identifying and classifying key elements from
text into predefined categories. In this way, it helps transform unstructured
data to data that is structured, and therefore machine readable and available
for standard processing that can be applied for retrieving information,
extracting facts and question answering. In our use case, we were primarily
interested in identifying company-names as the entities mentioned in texts
from social media, news, articles, and blogs. For this purpose, we
incorporated very popular spacy package from python in our NLP engine.
SpaCy is an open-source library for advanced Natural Language Processing in
Python. It is designed specifically for production use and helps build
applications that process and “understand” large volumes of text. It can be
used to build information extraction or natural language understanding
systems, or to pre-process text for deep learning. Some of the features
provided by spaCy are- Tokenization, Parts-of-Speech (PoS) Tagging, Text
Classification and Named Entity Recognition. SpaCy provides an exceptionally
efficient statistical system for NER in python, which can assign labels to
groups of tokens which are contiguous. It provides a default model which can
recognize a wide range of named or numerical entities, which include person,
organization, language, event etc. Apart from the default entities, spaCy also
gave a flexible option to add arbitrary classes to the model.
c. Entity and Sentence Sentiment
The primary goal of performing entity wise sentiment analysis in our project
is to accurately identify if the company (entity extracted from the text) is
12
associated with any negative sentiment or risk terms. We used two different
approaches for entity wise and sentence level sentiment scores. One uses
pure natural language processing and the other uses deep learning based
novel architectures. The output of this module is not only a polarity score of
an overall sentence but also entity-wise sentiment analysis. In other words,
the polarity of a sentence or the attitude of the source actor (company)
towards the target entity with respect to a context (check for negative
sentiment) in a sentence.
d. Context Free Grammar based Entity Sentiment
The text data obtained from news, articles, blogs is tokenized into
paragraphs, then into sentences. We filtered all the sentences which
contained at least one company name obtained from entity extraction. All the
sentences were iterated one by one. First, every input sentence is analysed
and the “Subjective information” such as subject, object, and verb etc is
identified and extracted from it. This information is obtained by generating
the context-free grammar trees based on dependency grammar for every
sentence. We used a mix of stanford core nlp and spacy for this purpose. The
Stanford typed dependencies representation is designed to provide a simple
description of the grammatical relationships in a sentence that can easily be
understood and effectively used by people without linguistic expertise who
want to extract textual relations. In particular, rather than the phrase
structure representations that have long dominated in the computational
linguistic community, it represents all sentence relationships uniformly as
typed dependency relations.
The grammar relations or the dependencies are mapped onto a directed

graph with words as the nodes and the relations as the edges. The graph is
recursively parsed in a bottom up manner.
For example: “Bell, based in Los Angeles, makes and distributes electronic,
computer and building products”. For this sentence, the Stanford
Dependencies (SD) representation is:
13
figure 7: Graphical representation of the Stanford
Dependencies for the sentence, Ref [Stanford, 30]
While parsing the graph, each word (in the form of leaf and root nodes) is
analysed and based on the word itself, its part of speech and grammatical
dependency relation, it is categorised as the part of source actor, the part of
target actor or the context/verb part. The remaining words which are not
part of actor groups are checked in a huge bag of words and decided as
positive, negative or neutral and a corresponding value (+1, -1, 0) is
assigned to them. The grammatical dependency relations are also used
decide a “sentiment factor” for each word. This factor is used to intensify or
negate the calculated scores. For a subtree, each word’s value is added and
each factor is multiplied to give an overall score. When all the subtrees are
completely parsed, the output is in the form of a subject, object, verb triplet
with a sentiment score.
Example: “Company X has shown improvements in customer acquisition

however one of their directors was alleged of crime.” In this example, there
are two chunks: “Company X has shown improvements in customer
acquisition” which garnered a moderate positive sentiment. The other chunk
is “one of their directors was alleged of crime” gets negative sentiment. This
engine produces the following features: the number of positive news, the
number of negative news, the number of times a company was mentioned in
the news and twitter posts, overall sentiment associated with companies. In
the next part, these features are further enhanced with sentence level
sentiment along with aggregated values.
e. State of the Art Deep Learning for Sentiment Classification
To improve the sentiment analyzer specific to different chunks or entities in a
14
sentence, we also trained and fine-tuned state of the art deep learning
architectures for text classification. Specifically, we focused on Bidirectional
Encoding Representations from Transformers a.k.a. BERT Architecture. The
BERT framework, a new language representation model from Google AI, uses
pre-training and fine-tuning to create state-of-the-art NLP models. Bert Uses
the concepts of Transformers, and Self-Attention which are explained below.
figure 8: Figure describing role of BERT based architecture for sentiment classification
Transformers: The Transformer is a novel architecture that aims to solve

sequence-to-sequence tasks while handling long-range dependencies. The
Transformer was proposed in the paper “Attention Is All You Need” [A
Vaswani, 2017]. The idea is to handle the dependencies between input and
output with attention and recurrence completely. The architecture of a
transformer architecture consists of an encoder and decoder.
figure 9: Encoder-Decoder architecture for Attention

The encoder and decoder blocks are multiple identical encoders and decoders
stacked on top of each other. Both the encoder stack and the decoder stack
have the same number of units. The word embeddings of the input sequence
15
are passed to the first encoder. These are then transformed and propagated
to the next encoder. The output from the last encoder in the encoder-stack is
passed to all the decoders in the decoder-stack.
Self Attention: Transformers uses an important concept called Attention aka

self-attention. “Self-attention is an attention mechanism relating different
positions of a single sequence in order to compute a representation of the
sequence.” It helps in focusing the appropriate parts of the input sequence.
The inner workings of self attention are explained in this paragraph. First,
three key vectors are created from the first pass of the encoder: Query
Vector, Key Vector, and Value Vector. These vectors are trained and updated
during the training process. Next, the self-attention for every word in the
input is calculated. For a phrase, it calculates scores for all the words in the
phrase with respect to a particular word. This score determines the
importance of other words when we are encoding a certain word in an input
sequence. The score for the first word is calculated by taking the dot product
of the Query vector (q1) with the keys vectors (k1, k2, k3) of all the words.
Then, these scores are divided by the square root of the dimension of the key
vector, followed by normalization using the softmax. These normalized scores
are then multiplied by the value vectors (v1, v2, v3) and sum up the
resultant vectors to arrive at the final vector (z1). This is the output of the
self-attention layer. It is then passed on to the feed-forward network as
input. So, z1 is the self-attention vector for the first word of the input
sequence and similarly it can get the vectors for the rest of the words.
An important property about this architecture is that self-attention is

computed not once but multiple times in the Transformer’s architecture, in
parallel and independently. It is therefore referred to as Multi-head Attention.
“Multi-head attention allows the model to jointly attend to information from
different representation subspaces at different positions.”
BERT: Google released two variants of the model: BERT Base with Number
of Transformers layers = 12, Total Parameters = 110M, BERT Large with
Number of Transformers layers = 24, Total Parameters = 340M. In our
project, we trained BERT Base. BERT uses a multi-layer bidirectional
Transformer encoder as its self-attention layer performs self-attention in both
directions. BERT uses bidirectionality by pre-training on two important tasks
— Masked Language Model and Next Sentence Prediction.
Masked Language Modeling (MLM): The masked language model

randomly masks some of the tokens from the input, and the objective is to
predict the original vocabulary id of the masked word based only on its
16
context. Unlike left-to-right language model pre-training, the MLM objective
allows the representation to fuse the left and the right context, which allows
us to pre-train a deep bidirectional Transformer.” The task then becomes to
predict these masked words.
Next Sentence Prediction: Language models do not capture the

relationship between consecutive sentences. BERT was pre-trained on this
task as BERT uses pairs of sentences as its training data. The task becomes
to predict next sentence given a sentence.
BERT architectures demonstrate that unsupervised learning (pre-training and

fine-tuning) is going to be a key element in many language understanding
systems. Low resource tasks especially can reap huge benefits from these
deep bidirectional architectures. Below is a snapshot of a few NLP tasks
where BERT plays an important role:
figure 10: Performance of Bert in different NLP tasks
We used BERT base for sentiment classification of sentiments in which

entities were mentioned. Finally, we merged it with NLP based sentiment
module to give an improved sentiment score along with entity wise sentiment
score. The output of NLP engine was a set of features which were added
along with raw features.
3.3 Credit Scoring Modelling Pipeline
17
In the last part of the system, a scoring pipeline is developed. Figure 12
shows the credit risk scoring architecture which is built on top of the data
lake architecture, NLP engine, and the Data Mining Module. The scoring
model uses the ensemble modelling technique in which multiple base and
meta machine learning models are stacked together to give a final output.
figure 11: Complete workflow and process diagram for credit scoring
Training dataset is used to train statistical machine learning models, and

validation set data was used to evaluate the model performance on unseen
data. In the first part, the dataset with raw and derived features is used for
descriptive analysis and exploratory data analysis. This analysis is used to
obtain business insights. In the preprocessing step, categorical features are
encoded as labels using python’s label encoder. Numerical columns were
normalized using min-max normalizer. In the feature engineering step, to
increase the number of features, different interactions were created within
the features. These interactions increased the features set size to 600. Then,
predictive modelling is performed in which independent features are raw and
derived features and target variable is the categorical feature describing the
credit performance of the companies.
3.4 Modeling Architecture
18
We used the stacking modelling architecture for predictive modelling. All the
base features and NLP based features are used in the classification model.
The models which are used to predict share counts are Logistic Regression,
K-Nearest Neighbour Classifier, Random Forest Classifier and Extreme
Gradient Boosting.
Logistic regression is used to predict the value of an outcome variable Y

which is categorical in nature based on the basis of one or more input
predictor variables X. The aim is to establish a linear relationship between
the predictor variable(s) and the response variable. The logistic regression
model is chosen so that the model can predict the linearity of the factor
variables and the predictor variables. K-nearest neighbour is used to predict
the value of outcome variable Y based on the k-nearest neighbours Y. The
closeness is based on Euclidean distance of other variables. The value of Y is
the majority voting result of the Y values from its k-nearest neighbours. The
reason we chose KNN model is because it gives a good prediction
performance and easy to interpret output. Random Forest based Classifier
builds multiple decision trees and merges them together to get a more
accurate and stable prediction. It generally has much better predictive
accuracy than a single decision tree and it works well with default
parameters. The Random Forest is applied to the baseline data which
essentially is a regression problem. XGBoost is an implementation of the
famous gradient boosting algorithm. XGBoost contains a series of models
where each new model is built to predict the previous model’s error. To make
prediction using xgboost, we need to add the prediction of all the models.
XGBoost is known for its speed and accurate predictive power. We trained
the model by tuning all the tunable parameters max_depth,
min_child_weight, gamma, subsample, colsample_bytree, nrounds, etc.
Even, 5 fold cross validation is applied to find the best parameter sets.
Ensemble Modelling: Stacking Strategy
Ensemble model represents a family of techniques that help reduce

generalization error in machine learning tasks. Ensemble method uses
multiple learning models to obtain better predictive performance than could
be obtained from any of the constituent learning models alone.
The stacking technique is employed as a method of ensemble modelling. The

basic idea behind this is to use a pool of base predictors and then use
another predictor to combine the base predictions. The base predictors in our
stacking model are the prediction values of the 12 baseline models trained on
the training sets. As shown in Figure 13, there exist two levels in the
complete process. The level 1 Meta learner models are trained on the
19
baseline predictor set. The various models of level 1 trainer that was trained
are Linear Regression, Random Forest, and XGBoost. The prediction of
baseline Level1 models are now features for level 2. In level 2 XGBoost is
used as model as a meta learner.
After the model, there is a prediction layer, in which all the preprocessing
steps remains the same. But instead of learning the part, trained model
weights are used to make predictions. The final model predictions are
generated and given in the form of files, visualizations, and reports. Along
with it, the interpretations are also shared. This is because ML models may
act as black boxes but in problems like credit risk scoring, we need to explain
why a model is making certain predictions, or what are the most important
features. Hence, work is done by making use of partial dependence plots,
feature importance, and permutation importance to give interpretability.
Final credit risk scoring also makes use of business rules and domain
knowledge. This is because a purely data-driven approach may not give
desired and relevant results. Hence, an additional rule-based engine is also
developed in which hard-coded business rules are implemented.
4. Discussion
The Federal Reserve estimated there are at least 55 million unbanked or

underbanked adult Americans in 2018, which account for 22 percent of U.S.
households. The percentage is much higher for developing countries. In case
of SMEs numbers are also similar. These figures reflect an important point
about banking that if these companies or individuals apply for loans, there
are high chances of their application being rejected. This is simply because
their credit score didn’t generate enough data to establish the
creditworthiness for the new loan. Many hopeful applicants and SMEs are
facing rejections due to lack of credit history data. And this happens to be a
vicious cycle. In the absence of low or no credit score, they cannot get loans,
and since they cannot service the loans successfully, they cannot improve
their credit scores. Hence, there is a need to utilize untouched datasets such
as news and social media in the absence of data from credit scoring agencies
in order to overcome this issue.
That’s where alternate credit scoring becomes a viable solution in order to

improve their chances of successful loan application and subsequent
disbursement. Even with low or no credit scores, they can use the alternate
credit score to bolster their chances of procuring loans from banks and
financial institutions. The biggest beneficiaries of alternative credit scoring
are SME companies who are new to the credit and financing ecosystem. For
20
such new SME, there is no sufficient centralized data available. But that
doesn’t mean that they cannot avail of credit. New age alternative credit
scoring companies use other tangible factors like digital footprint to
determine the credit-worthiness of a new customer.
This provides benefits at both ends. By extending access to credit, SMEs who
are new to the credit and loan system can still avail of loans irrespective of
lack of credit scoring data on traditional channels. Banks and Financial
institutions too can utilize alternative credit scoring in order to boost their
penetration in previously unexplored geographies like semi-urban and rural
areas but still keep their risk minimum and check frauds. At the core of the
alternate credit scoring companies’ competencies are three key factors – the
ability, intent, and stability of the customer in repayment of the loans
advanced on the basis of these innovative scoring systems.
Most of the data from which useful information can be obtained is present in
unstructured form. With such a heavy focus on structured databases in the
traditional method of calculating a credit score, some would view it as
outdated tool in our technologically evolving world. The emergence of social
media has brought about a data revolution, and with it a trove of new
information for businesses to tap into. It’s thought that by simply overlaying
social data onto traditional data, the context of these financial decisions –
that on the surface appear to be risky or irresponsible – will give lenders
more confidence when agreeing to offer loans to borrowers. The technology
is already in existence. News or social media data can be incorporated to
enrich reliable credit scores by delivering deeper insight into each customer
as an individual, instead of a number. Unlike traditional credit scoring, it
reveals recorded events and phrases that businesses can analyse to discover
what is going on “behind the scenes” financially. It echoes the original
methods employed by the Merchant Associations and small Credit Bureaus,
where a part of the credit score looks at personality, and what’s actually
going on in a borrower’s life. Social data offers that same insight, minus the
one-to-one visit, and brings personalisation into the equation. Affordability
assessments will become more detailed and bespoke than ever before.
For example, lenders would be able to see if a borrower had recently needed
to invest in a new boiler, whether they currently shop at Lidl or Waitrose, if
they have just been on holiday, got engaged, or had a new baby. Social data
reveals these snapshots so businesses can draw their own, well-informed,
financial conclusions. For lenders, there has never been a better time to join
the data revolution and utilise thousands of data points, extracted from
digital footprints, to supplement traditional credit scoring.
21
Natural language processing techniques such as Named entity extraction are
very useful in obtaining relevant information from unstructured data. Entity
extraction focuses on the extraction of semantically meaningful named entities
and their semantic classes from text, serves as an indispensable component
for several down-stream natural language processing tasks such as relation
extraction and event extraction. Dependency trees also convey crucial
semantic-level information. These techniques can be applied to gain more
information about the consumers before giving them loan or before generating
their credit scores. We discussed about how to utilize the structured
information conveyed by dependency trees, entity extraction, along with state
of the art deep learning models. Also, use of state of the art deep learning
models can help to gain better insights about the unstructured data. For
example, one can quickly identify the themes and sentiments associated with
a company from news or social media data. These additional features can
significantly improve the machine learning models and at the same time gives
a more credible view of credit score of the company. Machine learning
techniques such as logistic regression, tree based boosting, and stacking
architectures can be used to model the relationship between credit scores and
the features of the SMEs obtained from banking data and unstructured non
banking data. Through extensive experiments, we show that our proposed
novel system is a useful choice which can be used by multiple banks or
financial institutions for credit scoring.
5. Conclusions and Future Work
We discussed three key ideas in this project: use of alternative (non-banking)

unstructured data for credit scoring, use of state of the art natural language
processing and deep learning architectures to uncover insights from
unstructured data, and use of credit scoring pipeline with banking and non
banking features. The novel work in this project is the combining of two
different types of sentiment techniques - one for entity specific sentiment
using natural language processing and second for sentence level sentiment
using transformers and bert based architectures.
The use of alternative unstructured data in credit modeling presents both

opportunities and challenges for stakeholders, including consumers, lenders,
credit reporting agencies and a new wave of data furnishers. In this paper, we
show that unstructured data can be very valuable to assess the profile of a
company specifically SMEs for which banking data is not reliable and not
accessible. Thus, in this work, we show that use of alternative data in the data
credit space can help to make an educated decision when extending credit to
customers or business units.
22
We discussed that valuable information can be obtained from the unstructured
data which can heavily impact the performance of credit scoring models. The
use of state of the art deep learning and natural language processing
architectures can deeply help in improving credit scoring profiles. The
unstructured data is not only useful in credit scoring but there are several
other possible implications of greater use of alternative data and modeling
such as fair lending and discrimination, minimising default rates, setting up
new legal expectations, and privacy.
This work can be extended in multiple directions. More variety of alternative

data such as telecommunication data, geographic data, or psychometric data
can be fused with news data and banking data. This incorporation is likely to
uncover deep insights about the SMEs and at the same time gives
opportunities to improve the risk models further. Even 1% of the
improvements may save millions of dollars for the banks. Additionally, apart
from sentiment analysis or entity extraction, more variety of features can also
be extracted from the unstructured data. For example - use of word
embeddings which converts the text data into vectors by preserving the
context, the use of tone analysis and theme identification.
The entire theme presented in the paper, i.e. the use of alternative
unstructured dataset for credit scoring of SMEs can in fact be extended to
other use-cases as well. For instance - Risk Assessment of customers on online
ecommerce websites, Risk scoring of potential customers by an Insurance
company before giving them loans, or measuring healthcare risk associated
with the patients in a hospital. All these examples have a similar theme and
core problem statement to be solved. The system presented in this paper can
be fine tuned and customized for specific use case.
Bibliography
1. Abdou, HAH and Pointon, J; Credit scoring, statistical techniques and

evaluation criteria: A review of the literature
2. Alain Shema; Effective credit scoring using limited mobile phone data
3. Allen N.Berger, W.Scott Frame, Nathan H.Miller, Credit Scoring and the
Availability, Price, and Risk of Small Business Credit; 2005
4. Atiya, A. F.; Bankruptcy prediction for credit risk using neural networks:
a survey and new results; 2011
5. B. Collier, Hampshire R., Sending Mixed Signals: Multilevel Reputation
Effects in Peer-to-Peer Lending Markets; 2010
23
6. Bellotti, T., Crook, J.; Support vector machines for credit scoring and
discovery of significant features; 2009
7. Bhattacharya, P., Mehrotra, R., Tan, T., and Phan, T.Q.; Predicting Loan
Defaults in Microfinance using Behavioral Sequences; 2008
8. C. Everett, Group membership, relationship banking and loan default
risk: the case of online social lending
9. Chen, M. C., & Huang, S. H.; Credit scoring and rejected instances
reassigning through evolutionary computation techniques. Expert; 2003
10. Chen, S. Y., & Liu, X.; The contribution of data mining to
information science; 2004
11. Cramer, J. S.; Scoring bank loans that may go wrong: A case
study. Statistica Neerlandica; 2004
12. Desai, V. S., Crook, J. N., & Overstreet, G. A.; A comparison of
neural networks and linear scoring models in the credit union
environment; 1996
13. Henley, W. E., & Hand, D. J.; A k-nearest neighbor classifier for
assessing consumer credit risk; 1996
14. Henley, W. E.; Statistical aspects of credit scoring. Dissertation;
1995
15. Hsieh, N.-C.; Hybrid mining approach in the design of credit
scoring models; 2005
16. J. Qiu, Z. Lin, and B. Luo, Effects of borrower defined conditions
in the online peer-to-peer lending market.
17. J.E. Stiglitz, A. Weiss; Credit Rationing in Market with Imperfect
Information;
18. Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova,
BERT: Pre-training of Deep Bidirectional Transformers for Language
Understanding; 2018
19. Lipika Dey, Ishan Verma, Arpit Khurdiya, Sameera Bharadwaja
H.; A framework to integrate unstructured and structured data for
enterprise analytics
24
20. M. Herzenstein, R. Andrews, U. Dholakia, et al; The
democratization of personal consumer loans? Determinants of success in
online peer-to-peer lending communities, Working Paper; 2008
21. Malhotra, R., & Malhotra, D. K.; Differentiating between good
credits and bad credits using neuro-fuzzy systems; 2002
22. Moody’s Seven key challenges in assessing SME credit risk;
23. Ong, C.-S., Huang, J.-J., & Tzeng, G.-H.; Building credit scoring
models using genetic programming. Expert Systems with Applications;
2005
24. Riza Emekter, Yanbin Tu, Benjamas Jirasakuldech & Min Lu,
Evaluating credit risk and loan performance in online Peer-to-Peer (P2P)
lending; 2015
25. Rob Aitken; ‘All data is credit data’: Constituting the unbanked;
26. Sustersic, M., Mramor, D., Zupan J.; Consumer credit scoring
models with limited data; 2009
27. Tam, K. Y., & Kiang, M. Y.; Managerial applications of neural
networks: the case of bank failure prediction; 1992
28. Tan, T., Bhattacharya, P., and Phan, T.Q.; Credit-worthiness
Prediction in Microfinance using Mobile Data: A Spatio-network
Approach; 2016
29. Tobias Berg, Valentin Burg, Ana Gombović, Manju Puri; On the
Rise of the FinTechs—Credit Scoring using Digital Footprints
30. A. Vaswani, 2017; Attention is All you Need;
25

SHIVAM - A00876542A - Capstone Paper

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SHIVAM - A00876542A - Capstone Paper

Uploaded by

Copyright:

Available Formats

System for Alternative Credit Scoring using

Unstructured Non-Banking Data and

As part of the business opportunity identification and lead generation to give

While advancements in techniques and methodologies of credit scoring have

Banking data comprises of features having financial importance. It can

In modern practices, banks do use online information before giving a final

Our hypothesis of using this particular unstructured dataset is about finding

The modern data mining techniques, which have made a significant

Lipika Dey and Ishan Verma (2011) suggested a framework to combine

3. Approach : Data and Methodology

In this paper, we propose a robust system to perform alternative credit

figure 1: The three component workflow of scoring engine

3.1 Dataset Preparation

For this project, an anonymised and masked dataset of companies, their

figure 2: Statistical summary of numerical columns present in the dataset

3. The dataset also consisted of 5 categorical features. These features were

Figure 4: Distribution of Target Variable

6. As the next step, we enriched the dataset by mining unstructured data

The data mining engine made programmatic search requests to different

3.2 State of the Art NLP Engine

b. Entity Extraction: Entity extraction, also known as entity name

c. Entity and Sentence Sentiment

d. Context Free Grammar based Entity Sentiment

The grammar relations or the dependencies are mapped onto a directed

Example: “Company X has shown improvements in customer acquisition

To improve the sentiment analyzer specific to different chunks or entities in a

Transformers: The Transformer is a novel architecture that aims to solve

figure 9: Encoder-Decoder architecture for Attention

Self Attention: Transformers uses an important concept called Attention aka

An important property about this architecture is that self-attention is

Masked Language Modeling (MLM): The masked language model

Next Sentence Prediction: Language models do not capture the

BERT architectures demonstrate that unsupervised learning (pre-training and

figure 10: Performance of Bert in different NLP tasks

We used BERT base for sentiment classification of sentiments in which

3.3 Credit Scoring Modelling Pipeline

Training dataset is used to train statistical machine learning models, and

3.4 Modeling Architecture

Logistic regression is used to predict the value of an outcome variable Y

Ensemble Modelling: Stacking Strategy

Ensemble model represents a family of techniques that help reduce

The stacking technique is employed as a method of ensemble modelling. The

The Federal Reserve estimated there are at least 55 million unbanked or

That’s where alternate credit scoring becomes a viable solution in order to

5. Conclusions and Future Work

We discussed three key ideas in this project: use of alternative (non-banking)

The use of alternative unstructured data in credit modeling presents both

This work can be extended in multiple directions. More variety of alternative

1. Abdou, HAH and Pointon, J; Credit scoring, statistical techniques and

You might also like