You are on page 1of 40

University Paris-Dauphine

Master 2 ISI

Predicting late payment of an


invoice

Author: Supervisor:
Jean-Loup Ezvan Fabien Girard

September 17, 2018


1 Abstract
The purpose of this work was to provide a tool allowing to predict the delay
of payment for any invoice given in a company that is specialized in invoice
collection. It was necessary to gather external data to create a relevant model. A
Gradient Boosting Decision Tree implementation called Light Gradient Boosting
Machine was chosen for a multi-class classification to assign an estimated delay
of payment for every invoice. An approach was presented to try and tailor the
prediction according to the will of a customer, whether it is to focus on the
latest invoices, or to have the best global accuracy. A misclassification cost
matrix has also been introduced, as there was a will to have the ability to tailor
a prediction according to the importance allocated to the cost of each invoice.

Key Words : Prediction, MultiClass, LGBM

1
Contents
1 Abstract 1

2 Introduction 4

3 Literature reviews 5
3.1 Predicting and Improving Invoice-to-Cash Collection Through
Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Using Predictive Analysis to Improve Invoice-to-Cash Collection 8
3.3 LightGBM: A Highly Efficient Gradient Boosting Decision Tree . 10

4 First analyzes 13
4.1 Getting started with our dataset . . . . . . . . . . . . . . . . . . 13
4.2 Exploration of models . . . . . . . . . . . . . . . . . . . . . . . . 14
4.3 A solution to the problem . . . . . . . . . . . . . . . . . . . . . . 19

5 Implementation of our solution 22


5.1 Customization of the objective function . . . . . . . . . . . . . . 22
5.2 3 Dimensions Cost Sensitive Learning . . . . . . . . . . . . . . . 26
5.3 Alternative to shape the prediction . . . . . . . . . . . . . . . . . 34

6 Conclusion 37

2
List of Figures
1 The prediction result of binary case with various machine learning
algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 The prediction result of multiple output case with various machine
learning algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Class prediction accuracy of Random Forests in Multiple Outcome
Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4 A comparison of prediction accuracy of Random Forests and Bal-
anced Random Forests . . . . . . . . . . . . . . . . . . . . . . . . 7
5 Misclassification cost matrix for cost-sensitive Random Forests . 7
6 A comparison of prediction accuracy of Random Forests and cost-
sensitive Random Forests . . . . . . . . . . . . . . . . . . . . . . 7
7 Summary features . . . . . . . . . . . . . . . . . . . . . . . . . . 9
8 Prediction accuracy of unified model vs. firm-specific models . . 10
9 Cost sensitive learning for Firm A . . . . . . . . . . . . . . . . . 10
10 Datasets used in the experiments . . . . . . . . . . . . . . . . . . 11
11 Overall training time comparison . . . . . . . . . . . . . . . . . . 12
12 Overall accuracy comparison on test datasets . . . . . . . . . . . 12
13 Classes statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
14 Comparison Regression/Classification . . . . . . . . . . . . . . . 17
15 Samples distribution . . . . . . . . . . . . . . . . . . . . . . . . . 18
16 ROC score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
17 GBDT ROC score . . . . . . . . . . . . . . . . . . . . . . . . . . 19
18 Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
19 True/False Negatives/Positives . . . . . . . . . . . . . . . . . . . 21
20 Confusion matrices for different β . . . . . . . . . . . . . . . . . . 24
21 Distribution of payment classes (split by costs) . . . . . . . . . . 27
22 Comparison F1 score (weighted) . . . . . . . . . . . . . . . . . . 27
23 Comparison F1 score (binary) . . . . . . . . . . . . . . . . . . . . 28
24 Confusion matrices specific to a cost class . . . . . . . . . . . . . 29
25 Relative performance for each score (by weights) . . . . . . . . . 31
26 Confusion matrix for the best Custom Score 6 . . . . . . . . . . . 32
27 Confusion matrix for the last cost class . . . . . . . . . . . . . . . 33
28 Relative performance for each score (by objective function) . . . 33
29 Probabilites for each class . . . . . . . . . . . . . . . . . . . . . . 34
30 Predicted Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
31 Error distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
32 Confusion matrix outcome . . . . . . . . . . . . . . . . . . . . . . 36

3
2 Introduction
Dunforce provides a tool that allows to improve cash collection from invoices.
The goal is to create a prediction model that predicts the delay of payment for
any invoice so that every customer of Dunforce gets additionnal information on
their outstanding invoices. Knowing when every invoice should be paid helps a
lot in terms of portfolio management and also allows to act on those which are
predicted to be paid very late to try and ensure that their delay of payment is
not that big. Some customers prefer to have a very accurate prediction for the
invoices with highest delays even if it is at the expense of the global accuracy,
when other customers merely want the best global accuracy. The aim of the
research is thus to provide a prediction model tailored for every client. That
means that many different models have to be trained which makes it necessary
to use a prediction algorithm that trains quickly while still being accurate.
We can then wonder what algorithm should be used and how those prediction
models could be tailored for each customer. To tackle those issues, we will
start off by reviewing two papers also dedicated to the prediction of payment
delays regarding invoices to understand what the state of the art methods to
predict delays are. Then we will look into a third paper to introduce us to an
algorithm that fits perfectly the needs evoked earlier, in terms of both scalability
and accuracy. In a second time, we will describe our first approach for the
problem by studying the data we deal with, the relevant features to be added
and an exploration of different models to then introduce a way to answer the
problematic of being able to tailor the model for any kind of customer . Finally,
we will study the implementation of that solution by discussing its results, its
advantages and its limits.

4
3 Literature reviews
There is little literature in the domain of invoice collection prediction but there
are still some very interesting papers that we are going to introduce now and
then use during our analyzes.

3.1 Predicting and Improving Invoice-to-Cash Collection


Through Machine Learning
The first one on which we are going to spend time is a thesis written by Hu
Peiguang [4], that combines a financial analysis and a predictive analysis. Com-
bining those two aspects is very interesting as it allows to put the emphasis on
the most important matters regarding the delay prediction. Indeed, Research in
supply chain finance shows that effective invoice collection is positively correlated
with the overall financial performance of companies. The paper consequently
aims at identifying payments that are going to be late and customers with a
bad behavior. The analysis starts with the creation of many features but it
is also the case in the second paper we will take a look at, and some of them
are believed to be more relevant in that case which is why we’ll focus on those
features later. Then, most of the analysis consists in trying out many models of
supervised learning, always classification ones. The algorithms which are com-
pared are a Decision Tree Classifier [7], Random Forest Classifier [8], Adaptative
Boosting [9], Logistic Regression [10] and Support Vector Machine (SVM) [11].
First, a comparison is made between binary outputs cases. It seems that the
best results are obtained with the Random Forest Classifier, as shown on the
Figure 1 .

Figure 1: The prediction result of binary case with various machine learning
algorithms

However by observing the confusion matrix corresponding to this prediction,


it is understood that only the accuracy is compared. This may be an issue as
this evaluation gives the same importance to False Positives and False Negatives.
The same kind of comparison is then made between multiple outputs cases, the
same metric is used and it is still Random Forest Classifier that gives the best
results, as we can see on the Figure 2.

5
Figure 2: The prediction result of multiple output case with various machine
learning algorithms

It then becomes clear that this analysis had been light in order to focus on the
algorithm that stands out, the Random Forest Classifier. The most interesting
case is the multiple outcomes one because the binary one is too simple and does
not provide enough information from a financial point of view. In this case, 4
classes have been chosen : On time, short delay (1-30 days), medium delay (31-
90 days) and high delay (90+ days). Even though the global accuracy showed
very good numbers with a 81.6% rate, by looking deeper into the results it is
noticed that the accuracy is not homogeneous. Indeed, the Figure 3 shows that
results are that good only for the first and second classes.

Figure 3: Class prediction accuracy of Random Forests in Multiple Outcome


Case

That is due to the imbalance of the data analyzed, that issue being dealt with
right after that in the paper. Two methods are used to tackle this issue. The
first one consists in assigning a custom weight to each class, the exact weight
used is not very clear but it can be interpreted as being inversely proportional
to their frequency so that the product of frequency and weight for every class is
equal. By doing so the results are very interesting because the obtained accuracy
rates radically change, as shown on the Figure 4 below.
By taking into account the distribution for each class, that new prediction
means that the global accuracy diminishes but that does not necessarily mean
that the prediction is worse, that interpretation relies upon what are the prior-
ities for the business which uses those predictions. The second method used in
the paper consists in implementing a cost sensitive learning, which is the equiv-
alent of incorporating a cost matrix directly into the random forest algorithm.
By applying the cost matrix on the Figure 5, the results on the Figure 6 are

6
Figure 4: A comparison of prediction accuracy of Random Forests and Balanced
Random Forests

obtained.

Figure 5: Misclassification cost matrix for cost-sensitive Random Forests

Figure 6: A comparison of prediction accuracy of Random Forests and cost-


sensitive Random Forests

The results obtained with that second method look like a good compromise
as the high accuracy observed at the beginning for the first classes is preserved
while the accuracy for the 2 last ones is higher. It is a shame that there are
no deeper analyzes for those methods but it is a very good preview of tracks
to explore. It is although a direction suggested by the author, Hu Peiguang,
to go further on the topic of cost-sensitive learning.Let us move forward to
another paper which treats with the same issues while also going deeper into
that cost-sensitive approach.

7
3.2 Using Predictive Analysis to Improve Invoice-to-Cash
Collection
This article [1] is focused on the invoice to cash process and how to improve it.
This fits perfectly with the main aim of the service that Dunforce provides. The
analysis is made by using invoice records from 4 firms, accumulating to a total
number of nearly 170 000 invoices. That difference with the first paper where
the data came from only one business makes it more interesting because there
is a new angle to study, the question whether those data-sets can be merged
to create a better performing model. It is explained that in order to predict
delays there is a first step which consists in differentiating historical customers
and new ones, because there are obviously many features to add in the first
case. The classification approach was directly chosen and the best results were
obtained with the C4.5 algorithm, which is therefore the algorithm used for the
rest of that paper. This algorithm [2] builds decision trees by using information
entropy to create nodes. In this case, the choice of picking 5 classes has been
made, which consist of : On time - 1-30 days late - 31-60 days late - 61-90
days late - 90+ days late as classes. In this paper, they create and use a list
of interesting features, as shown on the Figure 7 below, that will we use almost
entirely and that contains most features that were introduced in the first paper..

Since most of those features are created upon historical data, they are not
available for new customers who have no historical. That issue is tackled very
briefly but it is explained that the point of additional features would be to
quantify the ability and willingness of a new customer to pay, which is why
features describing the region or the market of the business are created. As
there is not much detail it seems like a good track to follow.
By comparing the results obtained by training the model either on only one
firm or on all of them, it is shown that there is more to gain when training on
all datasets, as shown on that table on Figure 8 that sums up those results.
Another very interesting aspect of the approach which is used that we have
introduced before is the set up of a cost sensitive learning, which in their case,
aims at increasing the quality of prediction for the 90+ days late invoices. But,
by doing so the global accuracy is decreased. Custom weights are also added at
some points to counteract the imbalance of the data and give more importance
to the latest payments, which are under-represented. The effect is the same, as
it reduces the global accuracy of the prediction but increases the prediction for
high-risk invoices. When building the cost sensitive matrix, it is seen that very
late invoices that are not predicted as such are very harshly penalized while
it does not seem that invoices wrongfully predicted as very late do not matter
much. Indeed, we can see it on Figure 9, also extracted from that paper.

The end of the paper consists in studying the financial pros of being able to
predict efficiently the delays. Indeed, a very representative example is about the
invoices predicted as being late more than 90 days, and for which an average of
being able to save 49.5 days is estimated. Thus this article is very interesting

8
Figure 7: Summary features

because it tackles many issues we are also confronted with and we will shortly
come back to that cost sensitive learning that we have repeatedly evoked. Before
that, we are going to study a last paper which introduces a very interesting
algorithm.

9
Figure 8: Prediction accuracy of unified model vs. firm-specific models

Figure 9: Cost sensitive learning for Firm A

3.3 LightGBM: A Highly Efficient Gradient Boosting De-


cision Tree
This paper [5] presents an algorithm called LightGBM, which stands for Light
Gradient Boosting Machine. It is an implementation of a popular machine
learning algorithm called Gradient Boosting Decision Tree (GBDT) [12]. GBDT
achieves state-of-the-art performances in many machine learning tasks, such as
multi-class classification, which is exactly why we wanted to learn more on the
subject. There are other implementations of the GBDT such as XGBoost [13]
and pGBRT [14]. XGBoost is the most well renown GBDT implementation
as it is the one which has proven to be the most successful in the last years.
By considering Kaggle competitions, more than half winning solutions used
XGBoost. XGBoost and LightGBM are both gradient tree boosting methods.
That method consists in successively adding a decision tree that shows the best
improvement to the model, creating a tree ensemble model. The final prediction
given by the model is then the sum of predictions from each tree. As described in
[13], XGBoost is that successful thanks to both its scalability and its accuracy,
and that is the reason why it will be our point of comparison when speaking of
LightGBM.
Although XGBoost can be praised for its scalability, is is explained in [5]

10
that the efficiency and scalability are still unsatisfactory when the feature di-
mension is high and data size is large which can be an issue in our case when
willing to add many features and dealing with many invoices. A major reason
is that for each feature, they need to scan all the data instances to estimate
the information gain of all possible split points, which is very time consuming.
Indeed, to choose the best split, XGBoost uses either pre-sorted algorithm or
histogram-based algorithm [16]. The pre-sorted splitting algorithm consists in
enumerating over all features for each node. Then, for each feature, sorting the
instances by feature value, then find the best split for that feature in terms of
information gain and finally take the best split among all the features. On the
other hand, for each feature, the histogram-based algorithm splits all the data
points for a feature into discrete bins and uses these bins to find the split value
of histogram. To tackle the time-consuming issues, LightGBM uses two novel
techniques: Gradient-based One-Side Sampling (GOSS) and Exclusive Feature
Bundling (EFB). The first one, GOSS [15], excludes a fairly high amount of
data instances with small gradients and uses the rest to compute information
gain. Thus, GOSS allows to get an estimation of the information gain with less
computational time. The second one, EFB, is a way to reduce the dimension,
by regrouping mutually exclusive features (i.e., they rarely take nonzero values
simultaneously. Besides, another big difference between LightGBM and XG-
Boost is about how the decision trees are built. Like most decision tree based
algorithms, XGBoost grows trees level by level while LightGBM grows trees
leaf-wise, meaning that it chooses the leaf which will have the best impact on
the loss. That configuration makes the choice of the parameters corresponding
to the maximum depth and the maximum number of leafs very important, but
we will come back to that later. We are going to take a look at some computa-
tional time comparisons displayed in the paper obtained by trying out several
models on several datasets. On the figure 10 is a description the datasets upon
which the comparison will be realized.

Figure 10: Datasets used in the experiments

Those are mostly binary classification cases but it is very similar to multi-
class classification so those results are relevant to us. The number of data in-
stances and features are variable but very high in each case, so high that those
dimensions are not supported by an ordinary computer. For those datasets,
several implementations of LightGBM and XGBoost are compared. For XG-
Boost, xgb-exa is XGBoost using the pre-sorted algorithm while xgb-his is the
one using the histogram-based one. For LightGBM, lgb-baseline is LightGBM

11
without GOSS or EFB. EFB-only is lgb-baseline with EFB, and finally Light-
GBM is when using both GOSS and EFB. The number displayed on the Figure
11 are the average time, in seconds, for one training iteration.

Figure 11: Overall training time comparison

The OOM values mean out of memory. The results obtained on the Figure
11 are very promising as it seems LightGBM is very fast when comparing it to
other models. The efficiency of GOSS and EFB in terms of scalability are both
highlighted when comparing the time between the 3 LGBM implementations.
But that does not mean much without taking into account the corresponding
accuracy for each model, as displayed on the Figure 12

Figure 12: Overall accuracy comparison on test datasets

It is interesting to notice that for every dataset, LightGBM is as accurate or


even more than the other models.
To conclude, GOSS and EFB improve the computational time without hurt-
ing the accuracy too much, which makes LightGBM a first choice algorithm for
our analyzes to come.

12
4 First analyzes
4.1 Getting started with our dataset
The data dealt with consists of invoices. Thus, the basic information we have
is the provider, the customer, the amount, the date of creation, the due date
and the actual date of payment regarding each invoice. The provider is a Dun-
force customer and is the business to which the invoice is to be paid, while the
customer is the business which has to pay the invoice. The invoices on which
the underlying analyzes will be made is different from those in the Dunforce
database. Indeed, most Dunforce customers use it mostly because they would
really struggle being paid on time if they did not use it, which is why when
looking at the distribution of the delays of payment for Dunforce customers, it
diverges strongly from the distribution observed in [1] and [4]. Consequently,
another dataset is picked, containing nearly 200 000 invoices. The most impor-
tant feature, our Y, is the substraction of the actual date of payment by the due
date. After having cleaned the data, some features have to be added to make
a prediction as relevant as possible. To do so, a first step is to take inspiration
from the features evoked in the literature review. Those features are used to
qualify every unique relation provider/customer, they are very effective in iden-
tifying historical behaviors but for businesses with no or very few invoices, those
features are not enough. This is why another kind of features, describing both
the provider and the customer, is created as well. In France, there are more
than 10 millions business referenced in open data, from INSEE, where many
caracteristics are explicited. To find a specific company, a business identifica-
tion number is required, which is called ’SIRET’ in France. The exhaustive list
of the features added is represented on the Table 1.

That makes a list of 23 features, that are used to describe both the provider
and the customer, which makes a total of 46 additionnal features. But a very
important one is missing, the turnover of the company, which is unfortunately
not provided by the INSEE. Still, it can be recovered by a bot which types every
SIRET on www.societe.com and returns the SIRET whenever it is provided.
Finally, we add a feature that is still to be improved that aims to measure the
link between the names of the type of activity of the two companies. It uses
Natural Language Processing algorithms to measure the similarity between the
two lexical fields because it is believed that this may have an impact on the
relation between two companies, and therefore on the delay of payment for any
invoice between the same two companies.

13
No. Feature Description
1. ACTISURF Type of shop (size)
2. SAISONAT Whether activity of establishment is seasonal
3. PRODET Productive nature of the establishment
4. PRODPART Particular participation in production by establishment
5. AUXILT Whether activity of establishment is of auxiliary nature
6. NATETAB Nature of establishment of a self-employed worker
7. ORIGINE Details of creation of establishment
8. LIEUACT Place of activity
9. MODET Precision about main activity of establishment
10. IND-PUBLIPO Whether the establishment can be contacted by mailing
11. ESS Belonging to the field of the social and solidarity economy
12. CATEGORIE Company category
13 LIBAPET Name of principal activity
14. TEFET Number of employees in establishment (by bracket)
15. ACTIVNAT Nature of activity
16. DDEBACT Year and month establishment added to dissemination database
17. DAPET Year principal activity of establishment validated
18. EFETCENT Number of employees in establishment to nearest hundred
19. DCRET Date establishment created
20. TU Size of urban unit
21. RPET Region where establishment located
22. APET700 Main activity
23. SIEGE Whether establishment is the headquarters of the company

Table 1: Features describing a company

4.2 Exploration of models


Since the data is cleaned and it has a number of features which seems reasonably
sufficient, it is time to try out different kind of models to see what would be best
to answer the needs on the question of predicting delays of payment for invoices.
While reviewing [1] and [4], multi-class classification models were the most used,
as it seems instinctive to do so in this context. But it may still be interesting to
quickly compare to other approaches to find the best model to predict payment
delays. To do so, the first question was whether to choose a regression model or a
classification one. In the first case, for each invoice, the aim would be to predict
the exact day the invoice would be paid. To do so, we choose to use Random
Forest Regressor and Neural Network Regression models. In the second case,
which is the classification one, we may split this problem into two categories.
A binary one, where we have to predict whether the invoice will be paid on
time or late, or a multi-class one, where there are more than two classes and
where each class consists in an interval of days (15-30 days late for example).
For the classification problem, we choose to use Random Forest Classifier and

14
Decision Tree Classifier. Our idea to pick the best model consists in splitting
the approach into two steps where the first one is to pick the seemingly best
model for each kind of prediction and then compare those ’elite’ models. We
choose to use those models for the comparison because their efficiency has been
proven and because they do not need much parameters tuning to be effective, so
we may have quickly a confirmation of the idea that a multi-class classification
is the best way to go. But it does not mean that we will keep using them for the
rest of the analyzes. But how does one compare two results when they cannot
be measured with the same metrics?

It is indeed a tough situation but we have to take into account the context
of the mission. Our prediction has to go hand in hand with our tool. However,
Dunforce offers different recovery plans according to the behavior of a customer.
Therefore, having a binary classification does not seem optimal in this case
because then it would not be necessary to have several plans of recovery since
every bad client would be in the same ’LATE’ box. But then again, if it is the
only model that gives decent results then it should still be the preferred one.
But for the cases of regression or classification with more than 2 classes, there
seems to be more to work with. The quickest part is to compare the regression
algorithms between themselves, and the classification between themselves as
we can us any metric to do so. For the regression problem, after comparing
the Root Mean Square Error (RMSE), which is merely the root square of the
sum of difference between the actual values and what is predicted, given by
each prediction, Random Forest Regressor has best results with 74.9, when the
multi-layer perceptron regressor has almost twice that with a rmse of 146.8,
the whole having been trained and tested on the same samples of data. Even
though not much time was spent on tuning the multi-layer perceptron, the
margin between the two results seemed significant enough. In the problem of
multi-class classification, we choose to use the F1 score to decide on what model
is best. F1 is usually calculated as follows :
2 × Precision × Recall
F1 =
Precision + Recall
Where

Precision = T PT+F
P

P
Recall = T PT+F
P
N

Consequently, the higher the F1 score is the better the prediction is, and its
value goes from 0 to 1. That formula is directly usable in the binary cases, but
what about in our case? First, let us define more precisely those variables in a
multi-class classification case.

TP is the number of True Positives, ie the number of elements, for a class,


that have been correctly labeled as belonging to that class. FP is the number of

15
False Positives, ie the number of elements, for a class, that have been wrongfully
labeled as belonging to that class. FN is the number of False Negatives, ie the
number of elements, for a class, that should have been attributed to that class
but were not.

F1 is the most used F-measure, but the general formula is as follows


Precision ∗ Recall
Fβ = (1 + β 2 ) ·
(β 2 · Precision) + Recall

β is used to give more or less importance to False Negatives, since by sim-


plifying the previous equation we get :

(1 + β 2 ) ∗ TP
Fβ =
(1 + β 2 ) ∗ TP + β 2 ∗ FN + FP
So it might be useful to use that in our case, but we will come back to that
later. In the meantime, we keep deciding on models with basic metrics.
We can calculate the F1 score by binary average for each of the 5 classes
and then take the average as our score. This is called the ’macro’ average. By
doing so, we have a F1 score of 68.8% for the Decision Tree Classifier and 67.8%
for the Random Forest Classifier. If we use a weighted average (the weight
depending on the true number of samples belonging to each class) then we get
a F1 score of 81% for the Decision Tree Classifier and 80.9% for the Random
Forest Classifier. Since we have pretty similar results for both algorithms but
with the Decision Tree Classifier always ahead, we don’t feel the need to make
other comparisons. However we may notice the difference due to whether we
use a weighted average or not, which may indicates that the prediction is less
accurate for smaller classes.

Figure 13: Classes statistics

We can see on Figure 13 that data is imbalanced and the quality of prediction
as well. Those results are obtained after training on nearly 160 000 data points
and testing them on nearly 40 000 data points.

16
We then choose to compare Decision Tree Classifier and Random Forest
Regressor to decide on which kind of model we will use. As said before, we
cannot use the same metric on both their predictions so we need to find another
way. The first step consists in plotting the predictions to have an idea. To do
so, we plot the error of prediction of delay according to the actual delay. For
regression it is merely the average of errors for a chosen day to be predicted,
but for classification we plot the average error for each day only by taking
into account what class this day belongs to. Then we consider the deviation
between classes observed in the training set and the distribution of predictions
to establish an average error, this deviation is given by the function we call
Deviation(). As there are only 5 classes, that makes only 5 values to plot. By
considering as j the number of a class we have as follows :

5
P
Deviation(i, j) ∗ Card((T rue Class = j)(P redicted Class = i))
i=1
i6=j
∀j ∈ [1, 5], Error[j] =
Card(Class = j)

By using that formula we get Figure 14, but that does not take into account
the number of samples for each day, so it is to be studied carefully.

Figure 14: Comparison Regression/Classification

It seems that error is in average lower with the classification method, and
is higher only in a small interval, which can be likened to the class 3 (16-30
days late), where there is a sudden decrease in prediction quality noticed on the

17
Figure 13. The results seem pretty clear to allow to conclude that the multi-
class classification model is what should be used. We choose to use 5 classes as
well, but with intervals slightly different from those observed in [1] and [4] as we
want them to befit the steps used in the invoice recollection process. The classes
On time, 1-15 days late, 16-30 days late, 31-60 days late and 60+ days late are
picked, as shown with the distribution displayed accordingly on the Figure 15.

Figure 15: Samples distribution

It is interesting to notice that On time Payments consist almost only of in-


voices paid at the due date, and there are close to none of them before that. It
is now time to find the best algorithm to predict delays while using that config-
uration. We set up XGBoost and LightGBM to compare to what is explained
in the literature review. After having tuned their parameters, we are going to
compare their results. First, the XGBoost implementation trained in 30 minutes
and 30 seconds whereas LightGBM trained for only 19 minutes and 20 seconds.
It is not the same difference than the one observed in [5] but there are also a
lot less features, which may explain this. To compare the results, we are going
to get the predicted probability for each class and study the AUC ROC curves
[17] for each class to determine what model is the most relevant. The curves
are displayed for the two classification algorithms used earlier, Decision Tree
Classifier and Random Forest Classifier as well as for the two gradient boosting
decision tree algorithms we keep confronting on the Figures 16 and 17.

The results are better with a high margin for the two GBDT algorithms.

18
(a) Decision Tree Classifier (b) Random Forest Classifier

Figure 16: ROC score

(a) XGBoost (b) LightGBM

Figure 17: GBDT ROC score

XGBoost shows results slightly better than LGBM but the difference is so small
that by also taking into account the process time, LightGBM seems like the
best algorithm to use for the rest of the study.

4.3 A solution to the problem


Having at our disposal an algorithm with the best balance accuracy/scalability
that could be found, it is now necessary to find a way to tailor the prediction this
algorithm gives us for that same dataset. The elements found while reviewing
the literature on the subject lead us to believe that cost sensitive learning can
really have an impact on the quality of the prediction, overall but also class
by class. Besides, with LightGBM it is possible to customize many interesting
parameters to try and obtain specific results, such as the maximal number of
leafs, the maximal depth or other kinds of parameters, like the objective function
and the evaluation function.

19
The first matrix of classification obtained after using LightGBM is shown
on Figure 18.

Figure 18: Confusion matrix

That kind of classification gives a good multi-logloss score but there are obvi-
ous flaws that can be a real issue. Indeed, for the samples predicted as belonging
the the second payment class (1-15 days late), 982 of them actually belong to
the fifth and 1648 to the fourth. Considering that there are 40011 samples on
this confusion matrix, that is a big portion of serious misclassification. Thus, it
is believed that those results can be improved.
For a majority of clients, False Negatives are for more disturbing than False
Positives when concerning late payments. Further more, False Negatives regard-
ing the last class of payment (60+ days late) raise a lot more attention than
the other ones because in France the law states that a delay cannot go higher
than 60 days. False and True Negatives are not necessarily an obvious concept
for multi-classes problems, they change according to which class of payment is
chosen, as shown for the two extreme classes of payment on Figure 19.
The main issue is usually the False Negatives corresponding to the last Pay-
ment Class, as it means that those invoices are going to be prove to be prob-
lematic and they will not be known of until it is too late. Indeed, false alerts
(False Positive) are less important than failure of noticing true problems (False
Negative) because in the case of invoices that are predicted to be paid late,
Dunforce will ensure that they are paid sooner. But for the invoices which we
predict as paid on time, and which do not end up as so, this is where the essence
of the problem lies.
In order to avoid being forced to lose time and money on litigation and

20
(a) First Payment Class (b) Last Payment Class

Figure 19: True/False Negatives/Positives

trial it is almost imperative to forecast the invoices which will belong to that
class so that steps can be taken to accelerate the payment, even if it means
wrongfully attributing invoices to that class. It really depends on the will of the
customer. Besides, as invoices with highest amounts matter more than others,
a new order of priorities might be put in place as well. The goal is then to set
cost classifications matrices depicting best any kind of priority a customer wishes
for and find the combination of parameters that optimize the error accordingly.
Changing the objective function seems like the main parameter to tune to shape
the prediction at our convenience, by enforcing priorities to highers costs or
higher payment classes invoices, or even reducing large difference true/predicted
errors.

21
5 Implementation of our solution
5.1 Customization of the objective function
While reviewing literature regarding the prediction of invoices’ delay of payment,
we were highlighting cost sensitive learning. Some algorithms can directly in-
corporate cost matrixes to the learning as it was done in the papers [1] and [4],
but LightGBM cannot.
The objective function takes as variables the predictions and the labels (true
values). The Loss used at first is multi-logloss, which is merely the sum of log
loss values for each class and which is the evaluation function we used initially.
The evaluation function is the metric used at every iteration, in which case the
user sets a number of iterations. If there has not been a better result returned
by the metric during that number of iterations, then the training is stopped.
Here is the initial objective function.
PM
Loss = − i=1 (F N + FP)

Where

F N = y ∗ log(ŷ)
F P = (1 − y) ∗ log(1 − ŷ)

With


 p = P redicted probability
S(x) = 1+e1−x





 = 10−6


 ŷ = min(max(S(p), , 1 − )
y = T rue label (integers f rom 0 to 4)




M = N umber of classes

By willing to penalize False Negatives more harshly, we choose to modify


the Loss function.
Take β ∈]1, ∞[, the new Loss function is :
PM
Loss = − i=1 (β ∗ FN + FP)

With that new Loss function, we calculate the underlying gradient and hes-
sian that the objective function has to return in the LGBM algorithm.

As
dS
= S ∗ (1 − S)
dx

22
:

Then

dβ ∗ F N 1
= β ∗ y ∗ S ∗ (1 − S) ∗ = β ∗ y ∗ (1 − S)
dx S

dF P 1
= −(1 − y) ∗ S ∗ (1 − S) ∗ = −(1 − y) ∗ (S)
dx 1−S

Which gives us this gradient :

d(−(β ∗ F N + F P ))
= S ∗ [1 + y(β − 1)] − β ∗ y
dx

And which gives us this hessian :

d2 (−(β ∗ F N + F P ))
= S ∗ (1 − S) ∗ [1 + y ∗ (β − 1)]
dx2

To implement this into our custom objective function, we have to take into
account the fact that labels are a (N,1) shaped matrix, where N is the number
of data points. But the prediction, the gradient and the hessian are (N*M,1)
shaped matrix. The 3 of them are sorted by classid then by rowid . So in
order to access the probability prediction of the i-th row for the j-th class, we
had to consider the point [j * N + i] of the matrix. Labels take values from
0 to 4 but that is a problem when we calculate both gradient and hessian
because labels should only take 2 values in our function, 0 and 1, which is why
∀i ∈ [1, N ], ∀j ∈ [1, M ], if idx = j ∗ N + i then while calculating the gradient
and hessian : 
1 if label[i] = j
y(i) =
0 otherwise

Thus explaining this code from our custom objective function.


β ∗ (ŷ − 1) if label[i] = j
∀i ∈ [1, N ], ∀j ∈ [1, M ], idx = j∗N +i =⇒ grad[idx] =
ŷ otherwise


β ∗ ŷ ∗ (1 − ŷ) if label[i] = j
∀i ∈ [1, N ], ∀j ∈ [1, M ], idx = j∗N +i =⇒ hess[idx] =
ŷ ∗ (1 − ŷ) otherwise

23
The analysis starts off by comparing results for several values of β to try
and understand the impact of that change. It is expected that with a higher
β, fewer False Negatives should be observable. In return, regarding the False
Positives, their number should increase. It is believed acceptable until a certain
point where too many are predicted in a higher class than the truth and which
makes the prediction almost useless. Then again it is a balance to find and that
depends on the initial will set by a customer. The first comparisons are created
with the values 0.5, 2, 10 and 100 for β. Even if β is supposed to be higher than
1, as stated before the calculus, a value inferior to 1 is picked to check whether
it would on the contrary improve results for False Positives.

(a) β = 0.5 (b) β = 2

(c) β = 10 (d) β = 100

Figure 20: Confusion matrices for different β

As shown on Figure 20, there is indeed an improvement for the bottom left
cells when increasing β, except with the value 100 which may be considered too
high. The 0.5 value for β does indeed have a negative impact on those False
Negatives that matter so much. There might still be room to improve since the
data dealt with is imbalanced, indeed there are fewer invoices in the fifth class
while those are the most important to predict correctly. This is why weights are
going to be introduced to the data, to give more or less importance to a data

24
sample depending on its payment class for example. But it may be relevant to
also have it depend on another parameter, the amount of the invoice.

25
5.2 3 Dimensions Cost Sensitive Learning
It is not ridiculous to say that invoices with higher costs matter more than other
invoices. This is why we have the idea to create 5 cost classes to rank those in-
voices and give those with highest amounts a higher importance. Consequently,
we create a weight that is to be used in the objective function, once again, to
try and stress the invoices on which we really want not to be wrong on the
prediction. But in order to do so, the evaluation function has to be changed too
because multi-logloss does not take into account those weights. Therefore, we
create a first weight column which is calculated as such :

∀i ∈ [1, N ], weight[i] = classcost[i]2

with classcost taking values from 1 to 5. It may be a little extreme as it


means that invoices belonging to the fifth cost class weigh 25 times more than
those belonging to the first cost class. By splitting according to the delimiting
values [100,500,1500,5000], here is the distribution for the dataset we have been
using for most of the study.

Classes Class 1 Class 2 Class 3 Class 4 Class 5


Interval 0-100 100-500 500-1500 1500-5000 5000+
Number of occurrences 56193 62108 38761 26642 16140

Table 2: Cost class samples distribution

Consequently, we have new values for the gradient and hessian.


β ∗ (ŷ − 1) ∗ weight[i] if label[i] = j
∀i ∈ [1, N ], ∀j ∈ [1, M ], idx = j∗N +i =⇒ grad[idx] =
ŷ ∗ weight[i] otherwise


β ∗ ŷ ∗ (1 − ŷ) if label[i] = j
∀i ∈ [1, N ], ∀j ∈ [1, M ], idx = j∗N +i =⇒ hess[idx] =
ŷ ∗ (1 − ŷ) otherwise

We also have a new visualization of the distribution of the data with that
new dimension, as shown on Figure 21.

26
Figure 21: Distribution of payment classes (split by costs)

That new dimension shows that the highest amounts are a minority for the
first payment classes but for the two last payment classes there is some kind of
balanced distribution because every cost class consists of nearly 2000 samples.
By comparing the results obtained with and without using such a weight with
f1 score as a metric, using the weighted average, there is a slight improvement
for the last cost classes but not as important as it could have been expected, as
shown on the Figure 22.

(a) Without cost class weight (b) With cost class weight

Figure 22: Comparison F1 score (weighted)

The highlighted row (Cost Class 5) on the Figure 22 shows a fairly worse
score than the one obtained while applying the cost class weight whereas there is
not much of a difference for any of the other cost classes. Those are consequently
some promising results but there are other tracks to follow. Another one is to

27
apply a weight in order to sort of balance the data. To do so, the distribution
of each payment class is taken into account. A coefficient γ is then created to
balance those samples, and which only depends on the payment class.
1
∀i ∈ [1, M ], γ[i] = npartial [i]
ntotal
Where

∀i ∈ [1, M ], npartial [i] is the number of occurrences of the ith class.
ntotal is the sum of the number of occurences of each class.)

To compare the results obtained with and without using such a weight with
f1 score as a metric, by splitting the results by payment class, F1 score with
a binary average can be used as the labels can be split into only two classes,
those belonging to the payment class in question and the rest. By doing so, it
is noticed that adding weights to balance the data has not worked as expected
since results are worse for every class, as shown on the Figure 23.

(a) Without balancing weight (b) With balancing weight

Figure 23: Comparison F1 score (binary)

Those are just ideas to try out to see if there is any positive impact on the
prediction. Once those tests are done, f1 scores are not necessarily enough to
make up our mind on the relevance of a change, some interesting results can
indeed be noticed when looking into the confusion matrices.
As it has been highlighted before, the value of an invoice is a very important
matter and we really want to put more emphasis on those which are worth
most, because that is what every company would do as well. That is why we
create a 3D cost matrix. A 2D cost classification matrix would give the penalties
corresponding to predicting a class j when the class i should have been predicted,
as already shown on Figure 18.

But that representation does not quite fit our goal which includes taking
into account the cost. That is why we add a dimension, the cost class. Instead

28
of a 5*5 matrix, we then deal with a 5*5*5 matrix, since we choose to use the
5 cost classes displayed on Table 2

Thereby, we have a matrix of misclassification for each of the 5 cost classes.


This is why we also prefer visualizing the error by displaying a confusion matrix
for each cost class, which would allow us once again to put the emphasis on
the highest cost classes. The first and last cost classes confusion matrices are
displayed on Figure 24

(a) First cost class (b) Last cost class

Figure 24: Confusion matrices specific to a cost class

Cost matrices and confusion matrices go hand in hand. After having gener-
ated that kind of matrix, we can compute the associated score. To do so, we
simply use the following equation.

5 X
X 5 X
5
Scorecustom = matrixcost [i][j][k] ∗ matrixconf usion [i][j][k]
i=1 j=1 k=1

The first coordinate is the cost class, the second coordinate is the actual
values class and the third one is the prediction class. That allows to create a
custom metric for any specific will, as we may choose precise values in the cost
matrix to specifically set an objective.
Now that we have a tool to decide what prediction is best for what situation,
it is decided to go further by exploring results generated by some combination
of slight parameters changes, starting with the objective function where the
algorithm can be run with many different beta, as shown before with the 4
simulations on Figure 20.

But an objective function works hand in hand with an evaluation one, indeed
the evaluation function decides when the training is to be stopped. That is why
we also customize it but by taking into account the same kinds of parameters

29
we consider in the objective function, because up until this point, we had only
been using a classic multi-logloss evaluation function. Adding to that the use
of weights, whether it is to counteract the lack of balance, or to give more
importance to higher payment classes, or high cost classes, or even both at
once. Trying out several values for that many parameters makes a big number
of actual models, but that may be worth the computing time to find patterns
about the efficiency of each of those changes.

To try out those combinations, we follow the exact same procedure as when
applying a grid search to tune parameters, because in the end that is actually
what we are doing since we can consider those functions as parameters of the
boosting algorithm. By executing a simulation with 40 different weights, 4
different objective functions, 2 different evaluation functions and by balancing
or not the data, we get 640 prediction results.
If we want to find the best prediction in a problem where we believe False
Negatives are more annoying than False Positives, where we want to avoid the
60+ days late invoices that go undetected most of all, by taking into account
their amount and penalizing larger prediction errors , a suitable cost matrix can
be calculated as follows :

∀i ∈ [1, 5], ∀j ∈ [1, 5], ∀k ∈ [1, 5], matrixcost [i][j][k] = α ∗ γ ∗ |j − k| ∗ i



2 if j > k
where α =
1 otherwise

and 
10 if j = 5
where γ =
1 otherwise
We create the coefficient α to give more importance to False Negatives than
to False Positives, as we have been doing talking about earlier. Indeed, if the
actual class is higher than the predicted class we want the penalty higher than in
the opposite case. The coefficient γ is used to to highlight the invoices belonging
to the fifth payment class. The absolute value of the subtraction between the
actual and predicted classes is used to put a higher cost for larger prediction
errors (For example, predicting first class when the actual class is the fifth one).
At last, the cost class is used to emphasize the most expensive invoices. Indeed,
the point is to penalize more harshly when predicting wrongfully the delay of
payment for an expensive invoice than for a cheap one.
It is a pretty complex example but it shows how each parameter can be
used to express the kind of prediction looked for. But, to compare first results,
we start off with some more basic cost matrices. The results are displayed on
Figure 25.
In order to display several graphs, the lowest point for each is 100 since the
y-axis is the ratio score/best-score (in percent) and the lower score is the best
score in every case. Here, we compare the parameter weight. As we used 40

30
Figure 25: Relative performance for each score (by weights)

of them during the simulation, there are 40 data points for each score. A first
cost matrix is created where it penalizes every mistake by the same value 1, the
associated score is called Global Score.
Another one is created where it penalizes more or less according to the error
margin, the cost class and where it is harsher with predictions lower than actual
delays, as follows :

∀i ∈ [1, 5], ∀j ∈ [1, 5], ∀k ∈ [1, 5], matrixcost [i][j][k] = α ∗ |j − k| ∗ i



2 if j > k
where α =
1 otherwise

It is at the origin of the Custom Score 2.


On the same model we created one which penalizes mostly the last payment
classes which go undetected, as follows and associated to the Custom Score 4.
Finally, we build one that penalizes mostly large errors, giving a lot of in-
fluence to the cost class as well and penalizing more the predictions below the
actual payment classes , it is calculated as follows:

∀i ∈ [1, 5], ∀j ∈ [1, 5], ∀k ∈ [1, 5], matrixcost [i][j][k] = α ∗ |j − k|3 ∗ i2



2 if j > k
where α =
1 otherwise

31
It is at the origin of the Custom Score 6. It is interesting to notice that
Global Score and Custom Score 2 look alike, except for the first weight values
where there are still significant differences. The Custom Score 4 is at its low-
est with the customweight23, which depends on both the cost and the payment
class of a sample. It takes the square roots each of them as factors, and multiply
it by 128 if the sample belongs to the last payment class. Indeed some custom
weights have been kind of randomly designed by taking different exponent val-
ues. Unfortunately, we could not make that many simulations as it is very time
consuming, so we only used 40 weights. The Custom Score 6 is at its lowest
with the customweight6, which consists in returning as weight for a sample the
product of its squared class cost with its squared payment class.
Now we want to see their confusion matrices to really understand the impact
of each combination, but the values displayed on Figure 25 are the mean score
obtained and it means that it is not necessarily with those custom weights that
we got the best scores, it simply shows tendencies. By sorting according to the
Custom Score 6, the best score is obtained with an objective function using
0.2 for β, an evaluation function using 5 for β, a balancing weight and the
customweight10, which consists in the addition of the payment class and the
cost class. On Figure 26 is shown a representation of the classification.

Figure 26: Confusion matrix for the best Custom Score 6

If we compare that classification to the first one we had it really looks like
some progress has been made. And that is not all, that custom score aims at
focusing on the highest cost class invoices, so let us take a look at the confusion
matrix for the last cost class alone, displayed on Figure 27 side by side with the
first confusion matrix we had for that cost class.
It is pretty clear that the number of samples with predicted classes below the
actual ones have been strongly reduced, which is exactly the point of that custom

32
(a) New classification (b) Old one

Figure 27: Confusion matrix for the last cost class

score. However, the parameters allowing us to get those results are unexpected.
From our previous equations, it was not forecast that an improvement would be
seen with an objective function using a β inferior to 1. It is also really surprising
because by generating the same figure as before but by comparing the objective
functions instead of the weights, with the same custom scores, it is shown on
28 that the Custom Score 6 we have been talking about so much earlier has its
worst results with β = 0.5.

Figure 28: Relative performance for each score (by objective function)

It is also interesting to notice that from this point of view β = 0.25 looks
like the best value for every score.

33
5.3 Alternative to shape the prediction
Another method could be to change the prediction without messing with the
algorithm parameters. The idea is to take the initial probabilities prediction
results and then not necessarily design the highest probability class as the pre-
dicted class but check with a cost classification matrix we build to pick the best
expected value.
The aim is to minimize a score which stems directly from a cost matrix we
generate. Instead of designating the predicted class as the class with highest
probability as we have been doing until now, we choose to designate the pre-
dicted class as the class with the lowest cost expected. To do so, we would use
the misclassification cost matrix and the probability given for each class to com-
pute the cost expected for each class. At first, our model returned probabilities
as shown below on Figure 29.

Figure 29: Probabilites for each class

Each line contains 5 values, one for each payment class. Consequently the
sum of every term for each line is equal to 1. However, in practice, those results
do not please a company. They would prefer something more explicit, which is
why the traditional way to answer that problem would be to pick the highest
probability as we said before, which, by taking the same example, would return
the prediction on Figure 30 :

34
Figure 30: Predicted Classes

But by considering the expected cost for each class, we might chose to assign
an invoice to a class with a lower prediction. Those expected costs are calculated
with the misclassification cost matrix. We still use a 3 dimensions one as we still
want to take into consideration the amount of an invoice. The expected cost,
for each payment class, is calculated by considering all the penalty costs relative
to that payment class. To show the effects it can have on the prediction, we
are going to describe a usecase introduced before, where a customer would like
to minimize the number of occurrences of cases where the difference between
the actual payment class and the predicted one is too high, with a predicted
payment class below the actual one, while still taking into account the cost class
of any invoice. To model this volition into a cost matrix, we create the following
one :

∀i ∈ [1, 5], ∀j ∈ [1, 5], ∀k ∈ [1, 5], matrixcost [i][j][k] = α ∗ |j − k|4 ∗ i



3 if j > k
where α =
1 otherwise

Since we want to focus mostly on cases where the predicted class is lower
than the actual one, with a big difference, we are going to use a new graphic
on which we will display the number of occurrences for each amplitude of error,
by splitting by cost, used on Figure 31 by considering only the invoices with a
predicted class inferior or equal to the actual class.
It indeed allowed us to eradicate cases where the first payment class is pre-
dicted when the actual payment class is the fifth and the invoice belongs the
the fifth cost class. However by displaying the global confusion matrix we un-
derstand that this improvement had a backlash as the global accuracy is worse,
mostly for cases where the prediction is higher than the actual payment class,
as shown on Figure 32.
Once again the issue is to find a good balance between both the quality of
prediction and the priorities stated by a customer. This method might seem

35
Figure 31: Error distribution

Figure 32: Confusion matrix outcome

better than the previous one as in this case the algorithm needs to be trained
only once to be able to create different shapes from a same prediction, which is
done only by changing the misclassification cost matrix. However, by comparing
the best results obtained in the previous simulations according to a specific
misclassification cost matrix to the results obtained with that same cost matrix
applied on this method, it proves to be less efficient.

36
6 Conclusion
In order to tackle the issue of predicting the payment period of invoices, the
writing of this memoir have relied on literature on the subject of invoices collec-
tion which detailed the best models and features to get relevant results. Adding
to that a paper presenting the LightGBM algorithm, which happened to be per-
fect for our study as it is very fast and accurate, it has been possible to simulate
many models to try and introduce a method to tailor a model according to the
will of any customer. The ability to provide a prediction tool for a company
regarding its outstanding invoices is already a very good thing, but adding to
that a mean to control the prediction wished for is a real added-value. Besides,
the cost dimension also adds some value to the experience. The results we have
show promising results that may indicate the extent of what is achievable on the
subject. However, it is not on point yet and there are some results we cannot
explain but the work done here has proved to be very interesting and that it is
indeed possible to tailor the prediction regarding the will of a customer.

A lot of future prospects can be envisaged as the results of this study. It


would be good to have more computing power to try out even more simulations.
It would also be a good thing to combine the tuning of the algorithm parameters
with our different approaches to try and tailor a prediction, as for most of
our simulations we have been using the same parameters for the LightGBM
algorithm. Besides, working on improving the features and comparing their
contribution to the prediction according to each prediction may be interesting
to understand some of the results we had that could not be explained. It
would also be interesting to work specifically on the feature we have shortly
introduced before that aims to describe the relation between two businesses by
using Natural Language Processing techniques, as it is not on point yet.

37
References
[1] Sai Zeng, Prem Melville, Christan A. Lang, Iona Boier-Martin and Conrad
Murphy. Using Predictive Analysis to Improve Invoice-to-Cash Collection
http://www.prem-melville.com/publications/equitant-kdd08.pdf
[2] S.Vijayarani and M.Divya. An Efficient Algorithm for Classification Rule
Hiding
https://pdfs.semanticscholar.org/84e4/1a7f69a99f1e5a5d2e8046d1f0c82519357a.pdf
[3] Pedro Domingos. MetaCost: A General Method for Making Classfiers Cost-
Sensitive
https://homes.cs.washington.edu/ pedrod/papers/kdd99.pdf
[4] Hu Peiguang. Predicting and Improving Invoice-to-Cash Collection Through
Machine Learning
https://dspace.mit.edu/bitstream/handle/1721.1/99584/925473704-MIT.pdf?sequence=1
[5] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong
Ma, Qiwei Ye, Tie-Yan Liu
LightGBM: A Highly Efficient Gradient Boosting Decision Tree
https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdf
[6] Scott M. Lundberg and Su-In Lee
A Unified Approach to Interpreting Model Predictions
http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf
[7] S. R. Safavian and D. Landgrebe
A survey of decision tree classifier methodology
https://engineering.purdue.edu/ landgreb/SMC91.pdf
[8] A. Liaw and M. Wiener
Classification and Regression by Random Forest
http://www.webchem.science.ru.nl/PRiNS/rF.pdf
[9] Yoav Freund, Robert E. Schapire
A short introduction to boosting
https://cseweb.ucsd.edu/ yfreund/papers/IntroToBoosting.pdf
[10] Walter A. Shewhart and Samuel S. Wilks
Applied Logistic Regression
http://resource.heartonline.cn/20150528/13 kOQST g.pdf
[11] Johan Suykens
Least Squares Support Vector Machines
https://www.esat.kuleuven.be/sista/natoasi/suykens.pdf
[12] Jerome H. Friedman
Greedy Function Approximative: A Gradient Boosting Machine
https://statweb.stanford.edu/ jhf/ftp/trebst.pdf

38
[13] Tianqi Chen and Carlos Guestrin
XGBoost: A Scalable Tree Boosting System
http://www.kdd.org/kdd2016/papers/files/rfp0697-chenAemb.pdf
[14] Stephen Tyree, Kilian Q. Weinberger and Kunal Agrawal
Parallel Boosted Regression Trees for Web Search Ranking
http://www.cs.cornell.edu/ kilian/papers/fr819-tyreeA.pdf
[15] Rong Zhu
Gradient-based Sampling: An Adaptive Importance Sampling for Least-
squares
https://papers.nips.cc/paper/6579-gradient-based-sampling-an-adaptive-importance-sampling-for-least-squ

[16] Alvira Swaling


CatBoost vs. Light GBM vs. XGBoost
https://towardsdatascience.com/catboost-vs-light-gbm-vs-xgboost-5f93620723db
[17] Jocelyn D’Souza
Let’s learn about AUC ROC Curve!
https://medium.com/greyatom/lets-learn-about-auc-roc-curve-4a94b4d88152

39

You might also like