You are on page 1of 88

Uncertainty Quantification (UQ)

1
An important note on the nomenclature
▪ Jus to be precise: Uncertainty and quantiles are not the same thing.
▪ At the end of the day? But most of the time you care about quantiles and not uncertainty.
▪ If you really do want uncertainty with deep nets checkout
http://mlg.eng.cam.ac.uk/yarin/blog_3d801aa532c1ce.html

2
This section deals with
▪ “Classical” confidence interval (classical because is the most known)
▪ Prediction interval
▪ PI coverage probability and Mean PI width
▪ Methods knows to get PIs

3
Prediction vs. confidence interval: #1
▪ An aspect that is important but often overlooked in applied machine learning is intervals, be
it confidence or prediction intervals. Confidence intervals, prediction intervals, and
tolerance intervals are three distinct approaches to quantifying uncertainty in a statistical
analysis.

▪ A confidence interval quantifies the uncertainty on an estimated population variable, such


as the mean or standard deviation. It can be used to quantify the uncertainty of the
estimated skill of a model. The confidence interval is fairly robust due to the Central Limit
Theorem and in the case of a random forest, the bootstrapping helps as well.
▪ A prediction interval quantifies the uncertainty on a single observation estimated from the
population. It can be used to quantify the uncertainty of a single forecast. The prediction
interval is completely dependent on the assumptions about how the data is distributed
given the predictor variables, Central limit theorem and bootstrapping have no effect on
that part.

https://blog.datadive.net/predictio BEST
https://stats.stackexchange.com/questions/56895/do-the-predictions- https://www.graphpad.com/support/faq/the-distinction-between-confiden
n-intervals-for-random-forests/ of-a-random-forest-model-have-a-prediction-interval
4
ce-intervals-prediction-intervals-and-tolerance-intervals/
Prediction vs. confidence interval: #2
▪ The CF looks are symmetric because the error
is normally distributed, but that not always
the case. This case is specific!
▪ You are interested in determining the mean (a
property of the whole population) and its CI.
You see that 95 out of 100 captures the true
mean and only 5 do not.
▪ If you increase the sample size, you will see a
noticeable decrease in the width of the CI.
This happens only for the CI. As the sample
size (n) approaches infinity, the right side of
the equation goes to 0 and the average will
converge to the true population mean.

BEST
https://www.graphpad.com/support/faq/the-distinction-between-confiden 5
ce-intervals-prediction-intervals-and-tolerance-intervals/
Prediction vs. confidence interval: #3
▪ Prediction intervals must account for both the uncertainty in estimating the population
mean, plus the random variation of the individual values. So a prediction interval is always
wider than a confidence interval. Also, the prediction interval will not converge to a single
value as the sample size increases.
▪ The key point is that the prediction interval tells you about the distribution of individual
values, as opposed to the uncertainty in estimating the population mean and will not
converge to a single value as the sample size increases.

BEST
https://www.graphpad.com/support/faq/the-distinction-between-confiden 6
ce-intervals-prediction-intervals-and-tolerance-intervals/
Prediction vs. confidence interval: #4
▪ Before moving on to tolerance intervals, let's define that word 'expect' used in defining a 95%
prediction interval. If you were to simulate many prediction intervals, some would capture more
than 95% of the individual values and some would capture less, but on average, they would capture
95% of the individual values.

▪ What if you want to be 95% sure that the interval captures at least 95% of the population? Or 90%
sure that the interval captures at least 99% of the population? These questions are answered by a
tolerance interval. To compute, or understand, a tolerance interval you have to specify two
different percentages. One expresses how sure you want to be (confidence level), and the other
expresses what fraction of the population the interval will contain (population coverage).

▪ If you set the first value (confidence level) to 50%, then a tolerance interval is essentially the same
as a prediction interval. If you set the confidence level to a higher value (say 90% or 99%) then the
tolerance interval is wider than a prediction interval. As with prediction intervals, tolerance
intervals will not converge to a single value as the sample size increases. The formula for a
tolerance interval is Average k*StDev, where k is a tabled value based on the sample size and
confidence level.

BEST
https://www.graphpad.com/support/faq/the-distinction-between-confiden 7
ce-intervals-prediction-intervals-and-tolerance-intervals/
Prediction vs. confidence interval: #5
▪ Prediction and tolerance intervals are more affected by departures from the Gaussian
distribution than confidence intervals.
▪ This is because prediction and tolerance intervals predict where individual values will fall.
▪ Confidence intervals are based on the distribution of statistics, such as average or standard
deviation, which are typically well approximated by a Gaussian distribution (the
approximation gets better as the sample size increases).

BEST
https://www.graphpad.com/support/faq/the-distinction-between-confiden 8
ce-intervals-prediction-intervals-and-tolerance-intervals/
Prediction vs. confidence interval: #6
These intervals all have confidence
level (NOT interval) of 95%.

BEST
https://www.graphpad.com/support/faq/the-distinction-between-confiden 9
ce-intervals-prediction-intervals-and-tolerance-intervals/
Prediction vs. confidence interval: #7
▪ I do not particularly like this definition, but
I’ll report here for completeness
▪ Confidence interval: predicts the distribution
of estimates of the true population mean or
other quantity of interest that cannot be
observed.
▪ Prediction interval: predicts the distribution
of individual future points. Prediction interval
takes both the uncertainty of the point
estimate and the data scatter into account.
So a prediction interval is always wider than
a confidence interval.

10
https://medium.com/the-artificial-impostor/quantile-regression-part-1-e25bdd8d9d43
Prediction vs. confidence interval: #7.1
▪ Using simple linear regression as an example, its confidence interval is

▪ And its prediction interval is:

▪ We can see that the variance of the prediction interval is just the variance of the confidence
interval plus the mean square error, which is an estimate of the data scatter.

11
https://medium.com/the-artificial-impostor/quantile-regression-part-1-e25bdd8d9d43
▪ For both of them, confidence intervals are characterised by 2 elements:
▪ An interval [x_l, x_u] (not necessarily symmetric!)
▪ The level C that ensures that C% of the time, the value that we want to predict will lie in
this interval.

▪ For instance, we can say that the 99% confidence interval of average temperature on earth
is [-80, 60].

▪ Another alternative way to see it is: let us say, we wish to know the range of possible
predictions with a probability of certainty of 90%, then we need to provided two values
Q_alpha, Q_{1-alpha}, such that, the probability of the true value being within the interval

12
A political pollster plans to ask a random sample of 500 voters whether or not they support
the incumbent candidate. The pollster will take the results of the sample and construct a 90%
confidence interval for the true proportion of all voters who support the candidate. Which of
the following is a correct interpretation of the 90% confidence level?

https://www.khanacademy.org/math/ap-statistics/estimating-confidence-ap/introduction-confidence-inte
rvals/a/interpreting-confidence-levels-and-confidence-intervals 13
A baseball coach was curious about the true mean speed of fastball pitches in his league. The
coach recorded the speed in kilometers per hour of each fastball in a random sample of 100
pitches and constructed a confidence interval for the mean speed. The resulting interval was
(110,120 left parenthesis, 110, comma, 120, right parenthesis.
Which of the following is a correct interpretation of the interval (110,120)?

https://www.khanacademy.org/math/ap-statistics/estimating-confidence-ap/introduction-confidence-inte 14
rvals/a/interpreting-confidence-levels-and-confidence-intervals
Suppose that the coach from the previous example decides they want to be more
confident. The coach uses the same sample data as before, but recalculates the
confidence interval using a 99% confidence level. How will increasing the
confidence level from 95% and 99% affect the confidence interval?

https://www.khanacademy.org/math/ap-statistics/estimating-confidence-ap/introduction-confidence-inte 15
rvals/a/interpreting-confidence-levels-and-confidence-intervals
Prediction + confidence and Bayesian confidence
interval
▪ A prediction interval is an interval associated with a random variable yet to be observed,
with a specified probability of the random variable lying within the interval. For example, I
might give an 80% interval for the forecast of GDP in 2014. The actual GDP in 2014 should
lie within the interval with probability 0.8. Prediction intervals can arise in Bayesian or
frequentist statistics.
▪ A confidence interval is an interval associated with a parameter and is a frequentist
concept. The parameter is assumed to be non-random but unknown, and the confidence
interval is computed from data. Because the data are random, the interval is random. A
95% confidence interval will contain the true parameter with probability 0.95. That is, with
a large number of repeated samples, 95% of the intervals would contain the true
parameter.
▪ A Bayesian confidence interval, also known as a “credible interval”, is an interval
associated with the posterior distribution of the parameter. In the Bayesian perspective,
parameters are treated as random variables, and so have probability distributions. Thus a
Bayesian confidence interval is like a prediction interval, but associated with a parameter
rather than an observation.

https://robjhyndman.com/hyndsight/intervals/
16
Parametric intervals
▪ Parametric intervals when the computation is done via a normal distribution
▪ Non-parametric intervals when the computation without using the assumption of normally
distribute data and a method to do it is bootstrapping.

https://saattrupdan.github.io/2020-02-26-parametric-prediction/ 17
Coverage probability, standard deviation and
normal distribution
• PI gives an interval within which An alternative PoV
we expect the single prediction to
lie with a specified probability.
Assuming normally distributed
errors a 95% PI interval on the
prediction is:
prediction +/-1.96*sigma
▪ More generally the PI can be
written in terms of multiplier c,
which depends on the coverage
probability as shown in the table:
prediction +/-c*sigma

https://towardsdatascience.com/regression-prediction-
https://otexts.com/fpp2/prediction-intervals.html intervals-with-xgboost-428e0a018b 18
Link between prediction and confidence
interval
▪ Each target (measured value) t_i can be modelled as the sum of the noise epsilon_i and the
true mean y_i.

▪ Where it is assumed that the errors are iid (independently and identically distributed), but
in practice an estimated of the true mean is obtained via and model y_hat_i

PI deals with the uncertainty of this term CI deals with the variance of this term

Khosravi, Abbas, et al. "Comprehensive review of neural network-based prediction intervals and new
advances." IEEE Transactions on neural networks 22.9 (2011): 1341-1356.
19
Wide or narrow prediction intervals (PIs)?
▪ A PI is comprised of upper and lower bounds that bracket a future unknown value with a
prescribed probability called a confidence level [(1−α)%] where alpha is the error.

▪ Wide PIs are an indication of presence of a high level of uncertainty in the operation of the
underlying system.

▪ Narrow PIs mean that decisions can be made more confidently with less chance of
confronting an unexpected condition in the future.

Khosravi, Abbas, et al. "Comprehensive review of neural network-based prediction intervals and new
advances." IEEE Transactions on neural networks 22.9 (2011): 1341-1356.
20
How do you establish how good a prediction interval is? #1

▪ PIs need to be evaluated from two perspectives, width and coverage probability. Frequently, PIs are assessed from their coverage
probability perspective without any discussion about how wide they are. The two metrics can be combined together, see
reference.

▪ [more important] PI coverage probability (PICP) is measured ▪ [less important] PICP has a direct relationship with
by counting the number of target values covered by the the width of PIs. A satisfactorily large PICP can be
constructed PIs, where where ntest is the number of samples
in the test set easily achieved by widening PIs from either side.
However, such PIs are too conservative and less useful
in practice, as they do not show the variation of the
targets. Therefore, a measure is required to check
how wide the PIs are. Mean PI width (MPIW)
quantifies this aspect:

Khosravi, Abbas, et al. "Comprehensive review of neural network-based prediction intervals and new
advances." IEEE Transactions on neural networks 22.9 (2011): 1341-1356.
21
Which value of PICP is good?
▪ In general, we want the model to have high • Remember in both cases we we want our
PICP (typically reaching some threshold model to learn, hence we evaluate this
such as 80%) as reported in this reference value on the hold out set (test set
▪ https://www.godaddy.com/engineering/2 essentially)
020/01/10/better-prediction-interval-with-
neural-network/

▪ But this will leave roughly 20% of the


prediction outside and this is not
preferable.

22
How do you establish how good a prediction interval is? #2

▪ Both PICP and NMPIW evaluate the quality of PIs from one aspect. A combined index is
required for the comprehensive assessment.
▪ The new measure should give a higher priority to PICP, as it is the key feature of PIs
determining whether constructed PIs are theoretically correct or not, but at the same time
consider the MPIW.
▪ The new metric is called CWC (Coverage Width-based Criterion)

▪ nu corresponds to the nominal confidence level associated with PIs and can be set to 1−α

Khosravi, Abbas, et al. "Comprehensive review of neural network-based prediction intervals and new
advances." IEEE Transactions on neural networks 22.9 (2011): 1341-1356.
23
How do you establish how good a prediction interval is? #2.1
Normalising MPIW by the range R of the
underlying target allows us to compare PIs ▪ Worked out example.
constructed for different datasets respectively ▪ PIs are constructed with an
associated 90% confidence level
▪ (α = 0.1=1-90)
▪ η = 50 why?
▪ μ = 0.9 - > corresponds to the
nominal confidence level

Khosravi, Abbas, et al. "Comprehensive review of neural network-based prediction intervals and new 24
advances." IEEE Transactions on neural networks 22.9 (2011): 1341-1356.
How do you establish how good a prediction interval is? #3

▪ If PICP is less than the nominal confidence level (1−α)%, CWC should be large regardless of
the width of PIs (measured by MPIW).
▪ If PICP is greater than or equal to its corresponding confidence level, then NMPIW should
be the influential factor γ(PICP), eliminates the exponential term of CWC when PICP is
greater or equal to the nominal confidence level.

▪ In the CWC measure, the exponential term penalizes the violation of the coverage
probabilities. This is, however, a smooth penalty rather than a hard one, for the following
reasons. It is appropriate to penalize the degree of violation, rather than just an abrupt
binary penalty. Also, it allows for statistical errors due to the finite samples.

Khosravi, Abbas, et al. "Comprehensive review of neural network-based prediction intervals and new
advances." IEEE Transactions on neural networks 22.9 (2011): 1341-1356.
25
How do you establish how good a prediction interval is? #4

▪ So here it is a nice tradeoff!


▪ Therefore, PIs should be as narrow as possible from the informativeness perspective.
▪ However, as discussed above, the narrowness may lead to not bracketing some targets and
result in a low coverage probability (low-quality PIs).
▪ CWC evaluates PIs from the two conflicting perspectives: informativeness (being narrow)
and correctness (having an acceptable coverage probability).

Khosravi, Abbas, et al. "Comprehensive review of neural network-based prediction intervals and new
advances." IEEE Transactions on neural networks 22.9 (2011): 1341-1356.
26
Confidence interval: awkward language

▪ Correct interpretation: We are XX% confident that the population parameter is between
the interval bounds. The confidence level only quantifies how plausible it is that the
parameter is in the interval.

▪ Incorrect language might try to describe the confidence interval as capturing the
population parameter with a certain probability.

Diez, David M., Christopher D. Barr, and Mine Cetinkaya-Rundel. OpenIntro statistics. OpenIntro, 2012.
27
Each on of this bar is the results of
drawing an ”n” number of samples
upon which the CF was computed.

Diez, David M., Christopher D. Barr, and Mine Cetinkaya-Rundel. OpenIntro statistics. OpenIntro, 2012.
28
Communicating interval plus changes in
scenarios
• In communicating forecasting result, a useful summary can be
something like:

• "If nothing unexpected happens we expect to be within ±x %, but if


assumptions a, b, or c perform differently than expected, we might be
as much as ±y % off.”
• ±x % = prediction intervals
• ±y % = use Monte Carlo simulation ??

29
Prediction intervals

30
Available methods
▪ Bootstrapping
▪ Quantile regression
▪ Dropout for ANNs
▪ Distribution estimator
▪ Methods that consider the prediction error distributed normally

31
Prediction interval methods
Method Pros Cons
▪ Easy to implement if not provided as default
▪ Account for heteroscedasticity (variances of
the individual residuals are different) ▪ You have to run 3 times: low, LS or 50%
▪ Found an application for DL as well quantile and upper. Still it is much less
Quantile loss
▪ Can be applied to trees: quantile regression expensive than bootstrapping
forest ▪ Smooth vs. hard quantile loss function
▪ Can be applied to gradient boosting
▪ Can be applied to XGBoost?
▪ Infeasible if training takes many hours
because of the large number of bootstrapped
samples (100s or 1000s) required to train the
▪ Flexible
Bootstrapping model
▪ Work with unknown distribution.
▪ Cannot account for heteroscedasticity
(variances of the individual residuals are
different)
The bottom line? Use quantile regression when dealing with heteroscedastic data (with confidence intervals
included if bootstrapping is feasible), or when dealing with an accurate predictive model that takes a long time
to train, such as neural nets.
32
Quantile regression forests

33
Quantile regression forests: part #1
▪ A general method for finding confidence intervals for decision tree based methods is Quantile
Regression Forests.
▪ The idea: instead of recording the mean value of response variables in each tree leaf in the forest,
record all observed responses in the leaf.
▪ The prediction can then return not just the mean of the response variables, but the full
conditional distribution P(Y≤y∣X=x) of response values for every x.
▪ Using the distribution, it is trivial to create prediction intervals for new instances simply by using
the appropriate percentiles of the distribution.
▪ For example, the 95% prediction intervals would be the range between 2.5 and 97.5 percentiles of
the distribution of the response variables in the leaves. If you are looking for 95% confidence
interval , you want cutoffs at 2.5 from one end and 2.5% the other end.

https://blog.datadive.net/prediction-intervals-for-random-forests/
Quantile regression forests: part #2
▪ One can use a random forest as quantile regression forest simply by expanding the tree fully so
that each leaf has exactly one value. (And expanding the trees fully is in fact what Breiman
suggested in his original random forest paper.)
▪ Then a prediction trivially returns individual response variables from which the distribution can be
built if the forest is large enough.
▪ One caveat is that expanding the tree fully can overfit (we’ll check for it in this presentation): if
that does happen, the intervals will be useless, just as the predictions. The nice thing is that just
like accuracy and precision, the intervals can be cross-validated.

https://blog.datadive.net/prediction-intervals-for-random-forests/
Quantile regression forests: part #3
▪ Prediction interval is an ESTIMATE of an interval into
which the future observations will fall with a given
probability.

▪ For example, the 95% prediction interval would be


the range between 2.5 and 97.5 percentiles of the
distribution.

▪ Given 100 draws we’ll expect 95 of them to fall


within the prediction interval, and the other 5%
outside of it.

▪ It is clear that what every model strives to achieve is


to have a 95% prediction interval as small as
possible and as close as possible to the true value.

https://blog.datadive.net/prediction-intervals-for-random-forests/
▪ If we now want our random forests to also output their uncertainty, it would seem that we
are forced to go down the bootstrapping route, as the quantile approach relied on the
models learning through gradient descent, which random forests aren’t.
▪ Choice No #1 bootstrapping: Say I have a random forest consisting of 1,000 trees and I’d
like to make 1,000 bootstrapped predictions to form a reasonable prediction interval.
Naively, to be able to do that we’d have to make a million decision tree predictions for
every prediction we’d like from our model, which can cause a delay that the users of the
model might not be too happy about.
▪ Choice No #2: there is a simple way of tweaking a random forest to enable it to make
quantile predictions, which eliminates the need for bootstrapping. It was introduced in
2006 by Meinshausen.

https://saattrupdan.github.io/2020-04-05-quantile-regression-forests/ https://www.jmlr.org/papers/volume7/meinshausen06a/meinshausen06a.pdf 37
More details: #1
▪ The rough idea is that we choose the feature and threshold that best
separates the target values of the data. From such a tree, we can now
easily figure out which leaf a given input belongs to, by simply
answering the yes/no questions all the way down.

▪ A quantile decision tree is different when we focus on what happens


after having found the correct leaf. During training, there might have
been several training instances ending up in a given leaf, each with
their own associated target value. So what value do we assign to this
new instance? In a normal decision tree we simply take the mean of
those target values from the training set, so that every leaf has a
single target value associated with it. However, the values in the
leaves giving rise to the above tree might have looked like this. We can
use all of the information in the leaves to estimate the true
distribution of the target values, rather than simply getting a point
value. Namely, given a new input variable x, we traverse the tree to
find the leaf node it belongs to, and then simply look at the
distribution of target values present in that leaf, which will be our
estimate for the predictive distribution.

https://saattrupdan.github.io/2020-04-05-quantile-regression-forests/ 38
More details: #2
▪ Given such an estimate we can now also output quantiles rather than the mean: we simply
compute the given quantile out of the target values in the leaf. A Quantile Regression
Forest (QRF) is then simply an ensemble of quantile decision trees, each one trained on a
bootstrapped resample of the data set, exactly like with random forests.
▪ Note one crucial difference between these Quantile Regression Forests and the quantile
regression models is that by only training a QRF once, we have access to all the quantiles at
inference time, where in the previous case we would have to train our model separately for
every quantile.
▪ Also, as we also noted last time, the quantile model is able to deal with heteroscedasticity,
which bootstrapping can’t really deal with.

https://saattrupdan.github.io/2020-04-05-quantile-regression-forests/ 39
Gradient Boosting Regressor with quantile loss
(quantile regression)

40
Quantile explained with an example: #1
▪ Given a random variable (such as the predicted parking time) and a value in [0, 1], the
associated quantile , is the value such that P(Y <= q) = p. As an example, the median is the
0.5 quantile.
▪ For instance, let say that a predictive model that computes the parking time for tomorrow
morning in downtown Bordeaux estimates that 22 minutes is the 0.95 quantile.
▪ This means that the probability that the parking time will be less or equal to 22 minutes is
95%

https://medium.com/@qucit/a-simple-technique-to-estimate-prediction-intervals-for-any-regression-model-2dd73f630bcb
41
Quantile explained with an example: #2
▪ The following figure (Fig 2) illustrates how the 0.05 and 0.95 quantiles are used to
compute the 0.9 prediction interval.
▪ Using the predictions of a 0.05 quantile regressor as a lower boundary and the
predictions of a 0.95 quantile regressor as an upper one, by construction the
probability that a value belongs to the interval between the upper and lower
boundary is:
▪ P( l<=X <= u) = P(X <= u) — P(X <=l)=0.95–0.05=0.9
▪ Roughly speaking, this means the prediction interval will contain approximately 90%
of samples.

https://medium.com/@qucit/a-simple-technique-to-estimate-prediction-intervals-for-any-regression-model-2dd73f630bcb
42
Quantile explained with an example: #3

▪ Example of a 90% prediction


interval: the probability that the
actual function’s observations
(blue dots) belongs to the
prediction interval (blue filled
area) is 90%.
▪ As you can some blue dots are
not in that interval at all.

https://medium.com/@qucit/a-simple-technique-to-estimate-prediction-intervals-for-any-regression-model-2dd73f630bcb
43
Quantile vs. L2 loss function

▪ The go-choice loss function for regression is the MSE (L2). That’s
well and good, but this says nothing about how varied (<- that is
the keyword) the residuals are.

▪ There is an alternative and that is using a different loss function:


the quantile as suggested in 1978 by Koenker & Bassett. This
indicated with q and x is the error, where hat is model
prediction?? (check it!). We can then proceed to define the
mean quantile loss.

▪ If you set alpha to 0.95, 95% of the observations are below the
predicted value. Conversely, if you set alpha to 0.05, only 5% of
the observations are below the prediction. And 90% of real
values lie between these two predictions. This is 90% is called
confidence level (not interval!).

https://saattrupdan.github.io/2020-03-09-quantile-regression/ https://www.jstor.org/stable/1913643?seq=1 44
Quantile vs. L2 loss function
▪ The method of least squares estimates the conditional mean of the response variable
across values of the predictor variables
▪ Quantile regression estimates the conditional median (or other quantiles) of the response
variable.
▪ When do you use it? Quantile regression is an extension of linear regression used when the
conditions of linear regression are not met.

▪ Advantages? One advantage of quantile regression relative to ordinary least squares


regression is that the quantile regression estimates are more robust against outliers in the
response measurements. However, the main attraction of quantile regression goes beyond
this and is advantageous when conditional quantile functions are of interest.

https://www.wikiwand.com/en/Quantile_regression 45
Quantile vs. L2 loss function: how to read the
formula
Y-hat is the model prediction
The weighting (=q) of the errors appears
as the slope.

Case #2 q= 0.1 Case #2 q= 0.9


Case #1 q= 0.5
As you can see the two side of the graph are As you can see the two side of the graph are
Remember q works as a weight. AE = Absolute
weighted asymmetrically. In particular negative weighted asymmetrically. In particular
error. As you can see the two side of the graph
error, overprediction were given more positive negative error, overprediction were given less
are weighted in the same way, after all 0.5 is
weight than underprediction. This means we are positive weight than underprediction. This
the median.
penalizing more the x< 0 case means we are penalizing more the x> 0 case

46
Another way to understand quantile loss
• Slope = weighting of the errors (q or q-1)
• True value = 10.
• Error = true - prediction
• There are four possibilities for the predictions:
• Prediction = 15 with Quantile = 0.1. Actual < Predicted; Loss = (0.1 - 1) * (10 - 15) =
4.5
• Prediction = 5 with Quantile = 0.1. Actual > Predicted; Loss = 0.1 * (10 - 5) = 0.5

• Predicted = 15 with Quantile = 0.9. Actual < Predicted; Loss = (0.9 - 1) * (10 - 15) =
0.5
• Predicted = 5 with Quantile = 0.9. Actual > Predicted; Loss = 0.9 * (10 - 5) = 4.5
• For cases where the quantile > 0.5 we penalize underpredictions more
heavily.
• For cases where the quantile < 0.5 we penalize overpredictions more heavily.
• Remember that the quantile gives you the scalar below which you’d like to
find the points. Q=0.8 means you are given a scalar value under which the
80% percentile of points are.

47
Quantile & asymmetric Laplace distribution

▪ If we were to take the negative of the


individual loss and exponentiate it, we get the
distribution know as the Asymmetric Laplace
distribution, shown below.

▪ The reason that this loss function works is


that if we were to find the area under the
graph to the left of zero it would be alpha,
the required quantile.

https://towardsdatascience.com/deep-quantile-regression-c85481548b5a 48
Quantile crossover
▪ Risk? Due to the fact that each model is a simple rerun, there is a risk of quantile cross over.
i.e. the 49th quantile may go above the 50th quantile at some stage.

▪ The article suggest a possible fix: To make sure that we avoid this, the non-median quantile
for instance, 0.95, can be modelled as (node 0.5 + sigma * sigmoid(0.95 node)). Where
sigma is the maximum we would expect the 0.95 quantile to deviate away from the median.
Similar idea can be enforced on 0.05 quantile with a negative sign on sigma * sigmoid(0.05
node) instead.

49
https://towardsdatascience.com/deep-quantile-regression-c85481548b5a
Linear regression vs. quantile regression

Rodriguez, Robert N., and Yonggang Yao. "Five things you should know about quantile regression."
50
Proceedings of the SAS global forum 2017 conference, Orlando. 2017.
Le Cook, Benjamin, and Willard G. Manning. "Thinking beyond the mean: a practical guide for using quantile
regression methods for health services research." Shanghai archives of psychiatry 25.1 (2013): 55.
What is quantile regression?
▪ The regression line from an ordinary least squares (OLS) regression
model is essentially flat, suggesting that there is no relationship
between number of psychotherapy session-hours and mental
health at follow-up.
▪ Although the median line is flat as before, the 90th quantile
prediction line is significantly increasing whereas the 10th quantile
prediction line is significantly decreasing.
▪ In OLS regression, the goal is to minimize the distances between
the values predicted by the regression line and the observed
values.
▪ In contrast, quantile regression differentially weights the distances
between the values predicted by the regression line and the
observed values, then tries to minimize the weighted distances.
▪ This weighting ensures that minimization occurs when xx percent of
the residuals are negative.

51
Differentially weights and penalisation, how do they
come together in the quantile loss function?

▪ Consider three points on a vertical line at different distances from


each other: upper point, middle point and lower point. In this
one-dimensional example, the absolute error is the same as the
distance.
▪ We will start at the middle point and move upward getting closer to
the upper point but further, by the same distance, to both the middle
and lower points. This will obviously increase the mean absolute error
(i.e. mean distance to the three points). The same applies if moving
downwards.
▪ If instead of having a single upper point and a single lower point, we
had one hundred points above and below, or any other arbitrary
number, the result still stands.

https://www.evergreeninnovations.co/blog-quantile-loss-function-for-machine-learning/
52
Smooth and non-smooth quantile
regression objective function: #1
▪ Both the scikit-learn GradientBoostingRegressor and
CatBoost use the non-smooth standard definition of
Quantile Regression.
▪ There is a singularity in (0, 0), so it not C1, because the
right and left derivative at the point are different in
sign.
▪ This is an issue, as gradient boosting methods require
an objective function of class C_2, i.e. that can be
differentiated twice to compute the gradient and
hessian matrices.

https://towardsdatascience.com/confidence-int
ervals-for-xgboost-cac2955a8fde
53
Smooth and non-smooth quantile
regression objective function: #2
▪ These quantile regression functions are simply the MAE
(Mean Absolute Error), scaled and rotated.
▪ The figure above also shows a regularized version of the
MAE, the logcosh objective (log of the hyperbolic cosine).
▪ As you can see, this objective is very close to the MAE, but
is smooth, i.e. its derivative is continuous and
differentiable.

https://towardsdatascience.com/confidence-int
ervals-for-xgboost-cac2955a8fde
54
Smooth and non-smooth quantile
regression objective function: #3
▪ All we need to do now is to find a way to rotate and
scale this objective so that it becomes a good
approximation of the quantile regression objective.
Nothing complex here.

▪ As logcosh is similar to the MAE, we apply the same


kind of change as for the Quantile Regression, i.e. we
scale it using alpha.

https://towardsdatascience.com/confidence-int
ervals-for-xgboost-cac2955a8fde 55
Smooth and non-smooth quantile regression
objective function: #3.1
▪ A little more on the relationship within quantile and MAE

▪ We have seen how the quantile regression functions are simply the MAE (Mean Absolute
Error), scaled and rotated.
▪ The most commonly used quantile, the median.
▪ If q is substituted with 0.5, the mean absolute error function is obtained which predicts the
median.
▪ This is equivalent to saying that the mean absolute error loss function has its minimum at
the median.

https://www.evergreeninnovations.co/blog-quantile-loss-function-for-machine-learning/ 56
Smooth and non-smooth quantile regression
objective function: #4
▪ Why combining two non-linear functions like log and cosh results in such a simple, near
linear curve?

https://towardsdatascience.com/confidence-int
ervals-for-xgboost-cac2955a8fde 57
Log cosh loss: any downside?
▪ It still suffers from the problem of gradient and hessian for very large off-target predictions
being constant, therefore resulting in the absence of splits for XGBoost.
▪ These absence of spits was actually capture in another article here:
https://towardsdatascience.com/regression-prediction-intervals-with-xgboost-428e0a018b

58
https://heartbeat.fritz.ai/5-regression-loss-functions-all-machine-learners-should-know-4fb140e9d4b0
Quantile regression: pros & cons #1
▪ We only have to fit two extra model.
▪ It account for heteroscedasticity simply because our models
are not simply outputting constants.
▪ One notable weakness of the quantile prediction intervals is
that the model is quantifying its own uncertainty. This means
that we are reliant on the model being able to correctly fit the
data, so in a case where the conditional means of the data
follow a linear trend but the quantiles don’t, we would then
have to choose a non-linear model to get correct prediction
intervals.
▪ Further, if we’re overfitting the training data then the
prediction intervals will also become overfitted. (that is
obvious not really a weakness, how can it be otherwise?)

https://saattrupdan.github.io/2020-03-09-quantile-regression/ 59
Quantile regression: pros & cons #2
▪ We can remedy the latter by creating confidence intervals around the quantile predictions,
but then we’re back at either the homoscedasticity scenario if we choose to create
parametric confidence intervals, or otherwise we have to bootstrap again, losing what I
think is the primary benefit of the quantile approach for prediction intervals.

▪ In short, the article suggest to use quantile regression when dealing with heteroscedastic
data (with confidence intervals included if bootstrapping is feasible), or when dealing with
an accurate predictive model that takes a long time to train, such as neural nets.

https://saattrupdan.github.io/2020-03-09-quantile-regression/ 60
Gradient Boosting Regressor with quantile loss: #1
▪ Predicting Intervals with the Gradient Boosting Regressor. GB builds an additive model in a forward
stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions.
▪ In each stage a regression tree is fit on the negative gradient of the given loss function. From sklearn
userguide we know these are the options: loss{‘ls’, ‘lad’, ‘huber’, ‘quantile’}.

▪ For the lower prediction with loss='quantile' and alpha=lower_quantile (for example, 0.1 for the
10th percentile)
▪ For the upper prediction with loss='quantile' and alpha=upper_quantile (for example, 0.9 for the
90th percentile)
▪ For the mid prediction with (loss="quantile", alpha=0.5) which predicts the median, or the default
loss="ls" (for least squares) which predicts the mean which was we are going to use
▪ When we change the loss to quantile and choose alpha (the quantile), we’re able to get predictions
corresponding to percentiles. If we use lower and upper quantiles, we can produce an estimated range
which is exactly what we want.
Gradient Boosting Regressor with quantile loss: #2

Mid -> obtained using the LS (least


square loss function). CP = 90%

https://towardsdatascience.com/how-to-generate-prediction-intervals-with-scikit-learn-and-python-ab3899f992ed
https://nbviewer.jupyter.org/github/WillKoehrsen/Data-Analysis/blob/master/prediction-intervals/prediction_intervals.ipynb
Issues
▪ The quantile approach relies on the models learning through gradient descent. If that is not
available we have to change method!
▪ Because of the nature of the Gradient and Hessian of the quantile regression cost-function,
XGBoost is known to heavily underperform.
▪ The article shows how by adding a randomized component to a smoothed Gradient,
quantile regression can be applied successfully.

https://towardsdatascience.com/regression-prediction-intervals-with-xgboost-428e0a018b 63
Issues: is quantile regression statistics?
▪ It must be kept in mind that the resulting confidence (from quantile regression) intervals
are a model approximation rather than true statistics.
▪ It is also important to note that the learned quantile models do not have to be necessary
consistent with each other, as they are all learned separately.
▪ The article states “do not have” but if they are then it is even better, isn’t it?

Natekin, Alexey, and Alois Knoll. "Gradient boosting machines, a tutorial." Frontiers in neurorobotics 7 (2013): 21. 64
There is more than one median and this causes issues!

▪ When using extreme quantiles, it can be


the case that the intervals are very distant.
This does not mean they are wrong
though, but in practice they are useless.

▪ Both lines (read and orange) separate the


observation in half, i.e. the probability that
an observation is greater than the
predicted value on the whole data set is
50%. However, in practice, a static median
value is less useful. In fact, it has a higher
variance and underfits the data.

65
https://medium.com/@qucit/a-simple-technique-to-estimate-prediction-intervals-for-any-regression-model-2dd73f630bcb
Other useful links on quantile loss function
▪ https://stats.stackexchange.com/questions/213050/scoring-quantile-regressor [discussion
on to asses how good your model quantile loss is]
▪ http://jmarkhou.com/lgbqr/#mjx-eqn-quantileloss [How LightGBM implements quantile
regression]

66
How do you tune a quantile loss function? Which
metrics shall we use? Is MSE still appropriate?
▪ Contrary to the classic forecasts where the goal is to have the forecast as close as possible

https://www.lokad.com/pinball-loss-function-definition
from the observed values, the situation is biased (on purpose) when it comes to quantile
forecasts.
▪ The pinball loss function returns a value that can be interpreted as the accuracy of a
quantile forecasting model. The pinball loss function (in red) has been named after its shape
that looks like the trajectory of a ball on a pinball. The lower the pinball loss, the more
accurate the quantile forecast.
▪ Let τ be the target quantile, y the real value and z the quantile forecast, then Lτ, the pinball
loss function, can be written:

67
Expectile loss
(quadratic quantile regression really!)

68
Sometimes referred
Expectile regression to as check loss

▪ MAE = mean absolute error


▪ MQL = mean quantile loss = weighted L1 sum

▪ MSE = mean squared error


▪ The quadratic (squared loss) analog of quantile regression
is expectile regression.

https://stats.stackexchange.com/questions/37955/how-to-design-and-implement-an-asymmetric-loss-function-for-regression 69
Quantile and its generalization -> k-th power
expectile regression
▪ Quantile regression -> L1 -> k =1 -> robust -> requires no
moment condition. Check loss is singular at r=0
▪ Expectile regression-> L2 -> k=2 -> sensitive to tails and more
effective for Gaussian distribution. Imposes a condition that
the mean of the true distribution underlying data exists.
Differentiable everywhere.

▪ In order to balance robustness and effectiveness, we adopt a


loss function, which falls in between the above two loss
functions, to introduce a new kind of expectiles and develop
an asymmetric least k-th power estimation method that we
call the k-th power expectile regression, 1< k<2.
▪ k=1.5 seems to be a good compromise

Jiang, Yingying, Fuming Lin, and Yong Zhou. "The k th power expectile regression." Annals of the Institute of Statistical Mathematics (2019): 1-31. 70
Link btw hinge loss and L1, L2 and Huber loss
▪ Hinge loss aka L1: such a loss function penalizes errors linearly, expressing an error when the
example is classified on the wrong side of the margin, proportional to its distance from the margin
itself. Though convex, has the disadvantage of not being differentiable everywhere.
▪ Squared hinge loss aka L2: always differentiable variants
▪ Another variant is the Huber loss, which is a quadratic function when the error is equal or below a
certain threshold value h but a linear function otherwise. Such an approach mixes L1 and L2
variants of the hinge loss based on the error and it is an alternative quite resistant to outliers as
larger error values are not squared, thus requiring fewer adjustments by the learning SVM. Huber
loss is also an alternative to log loss (linear models) because it is faster to calculate and able to
provide estimates of class probabilities (the hinge loss does not have such a capability).
▪ So which one do I choose? There are no evidence that Huber loss or L2 hinge loss can consistently
perform better than hinge loss

Sjardin, Bastiaan, Luca Massaron, and Alberto Boschetti. Large Scale Machine Learning with Python. Packt Publishing Ltd, 2016. 71
References
▪ https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.832.103&rep=rep1&type=pdf
[expectiles lack an intuitive interpretation. Numerically, quantiles ‘live’ in the L1 world,
while expectiles are rooted in the L2 world.]
▪ http://www.sp.unipg.it/surwey/images/workshop/nikos/Session-1_Pisa_July-2016.pdf
[quantile, expectile and Huber M-quantile]
▪ https://conservancy.umn.edu/bitstream/handle/11299/163935/Yuan_umn_0130E_14824.
pdf?sequence=1 [phd thesis]
▪ https://sci-hub.se/https://link.springer.com/article/10.1007/s10463-019-00738-y [best
explanation between quantile and expectile and M-quantile?]

72
Bootstrapping
[for uncertainty quantification]

73
Bootstrapping for both confidence and
prediction interval
▪ Bootstrapping means sampling with replacement. What we do is simply start with a sample,
and then resample our sample a bunch of samples with replacement (to ensure
independence), each having the same size as our original sample (so the bootstrapped
samples will have duplicates).

▪ Features:
▪ Bootstrapping works even if we are dealing with an unconventional statistic with an
unknown distribution. That makes it very general.
▪ We can then compute our statistic for every bootstrapped sample and look at the
distribution of these. As simple as that!
▪ The remarkable thing about bootstrapping is that the distribution of the bootstrapped
statistics approximates the distribution of the sample statistics.

https://saattrupdan.github.io/2020-02-20-confidence/ 74
.632+ bootstrap estimate
▪ As a general observation, and more detailed are provided later, the question “Which error
shall we use?” is important.

▪ Training errors will usually be too small as we tend to overfit, so we have to rely on the
validation errors somewhat.
▪ The validation errors will tend to be slightly too large however, as a bootstrap sample only
contains roughly 2/3 of the training data on average, meaning that the predictions will be
artificially worsened.

▪ As a compromise between the training- and validation errors there is a method called
“.632+ bootstrap estimate”.

https://saattrupdan.github.io/2020-02-20-confidence/ 75
https://stats.stackexchange.com/questions/96739/what-is-the-632-rule-in-bootstrapping
Cons
▪ The bootstrap prediction intervals requires us to train the model on a large number of
bootstrapped samples, which is unfeasible if training the model takes many hours or days.
An alternative to this is quantile regression.
▪ Bootstrap approach failed to account for heteroscedasticity. Attempts have been made to
make the bootstrap approach account for this, e.g. using the wild bootstrap, but this is
unsuitable for new predictions as it requires knowledge of the residuals.

https://saattrupdan.github.io/2020-03-09-quantile-regression/ 76
How bootstrap fail with heteroscedasticity
▪ To see an example of how the quantile approach deals
with heteroscedasticity, let’s multiply our noise terms
ε∼N(0,1) with our independent X, so that the
observations become more noisy over time.

▪ We see that the quantile approaches really shine when


compared to the bootstrap approaches.

https://saattrupdan.github.io/2020-03-09-quantile-regression/
77
Where is the .632+ bootstrap estimate coming from? #1

▪ A naïve estimate of prediction error is (where is some loss


function, MSE - > L2). This is often called the training error and
some call it apparent error rate or resubstitution rate. It's not
very good since we use our data to fit. This results in being
downward biased (read it optimistic). What you really want to
know is how well your model does in predicting new values.

▪ Often we use cross-validation (via k-fold for instance) as a simple


way to estimate the expected extra-sample prediction error. Our
cross-validated extra-sample prediction error is just the average.
This estimator is only approximately unbiased for the true
prediction error when K=N and has larger variance and is more
computationally expensive for larger K. So once again we see the
bias–variance trade-off at play.

https://stats.stackexchange.com/questions/96739/what-is-the-632-rule-in-bootstrapping 78
Where is the .632+ bootstrap estimate coming from? #2

▪ Instead of cross-validation we could use the bootstrap


to estimate the extra-sample prediction error.
Unfortunately, this is not a particularly good estimator
because bootstrap samples may have contained a seen
sample.

▪ The leave-one-out bootstrap estimator offers an


improvement by mimicking cross-validation. This solves
the overfitting problem, but is still biased (this one is
upward biased, more biased, remember tradeoff?). The
bias is due to non-distinct observations in the
bootstrap samples that result from sampling with
replacement. The average number of distinct
observations in each sample is about 0.632*N. This is
where the ~2/3 factor comes in.
79
https://stats.stackexchange.com/questions/96739/what-is-the-632-rule-in-bootstrapping
Where is the .632+ bootstrap estimate coming from? #3

▪ To solve the bias problem the 0.632 estimator was proposed:

Naïve estimate of prediction This is the leave-one-out bootstrap error


error often called training error. The error is upward biased
The error is downward biased.

80
https://stats.stackexchange.com/questions/96739/what-is-the-632-rule-in-bootstrapping
Where is the .632+ bootstrap estimate coming from? #4

▪ However, if we have a highly overfit prediction function then even the .632 estimator will be
downward biased.
▪ The .632+ estimator is designed to be a less-biased compromise between the naïve and the
leave-one-out error.

https://stats.stackexchange.com/questions/96739/what-is-the-632-rule-in-bootstrapping 81
Dropouts for ANNs uncertainty

82
Dropout for heteroscedasticity
▪ Why the uncertainty estimates we get from dropout
networks does not increase its uncertainty to cover the
points at the far right-hand side of the plane?
▪ To understand what's going on, we need to talk about
homoscedasticity versus heteroscedasticity.
▪ Homoscedastic regression assumes identical observation
noise for every input point x. Heteroscedastic regression,
on the other hand, assumes that observation noise can
vary with input x. Heteroscedastic models are useful in
cases where parts of the observation space might have
higher noise levels than others.
▪ Using dropout we get homoscedastic model uncertainty.
We can easily adapt the model to obtain data-dependent
noise.

https://github.com/yaringal/HeteroscedasticDropoutUncertainty 83
Methods that consider the prediction error
distributed normally

84
Intro
▪ Let's assume the predictions follow a normal distribution.
▪ This is a strong assumption but it seems to be good enough so it is still commonly used.
▪ We'll have two models: one for the the prediction and one for its error.
▪ So your prediction is the mean of the gaussian distribution, whereas the error estimate is its
standard deviation.

https://medium.com/@qucit/a-simple-technique-to-estimate-prediction-intervals-for-any-regression-model-2dd73f630bcb 85
Option #1 - Training a machine learning model to predict Option #2 - Training a machine learning model to predict
values, and using its RMSE to compute the error values and errors
In practice, the error is not always constant (it depends on
the features). Bare in mind that we are addressing the fact it
This model assumes that the standard deviation of the may not be constant, but what we get is still a symmetric
normal distribution is constant. Essentially, the error is estimate of the variabilty. Therefore, as an improvement, we
constant for all the predictions. can fit a model to learn the error itself. As before, the base
model is learnt from the training data. Then, a second model
(the error model) is trained on a validation set to predict the
squared difference between the predictions and the real
values. The standard deviation of the distribution is
computed by taking the root of the error predicted by the
error model.

86
https://medium.com/@qucit/a-simple-technique-to-estimate-prediction-intervals-for-any-regression-model-2dd73f630bcb
Where is all these uncertainty coming from?
(focused is more on the theory than on anything else)

87
Aleatoric and epistemic uncertainty
• Aleatoric uncertainty (aka statistical uncertainty), and is representative of
unknowns that differ each time we run the same experiment. Aleatoric
uncertainty refers to the inherent uncertainty due to probabilistic variability.
This type of uncertainty is irreducible, in that there will always be variability in
the underlying variables. These uncertainties are characterized by a probability
distribution. The quantification for the aleatoric uncertainties can be relatively
straightforward, where traditional (frequentist) probability is the most basic
form. Techniques such as the Monte Carlo method are frequently used.
• Epistemic uncertainty (aka systematic uncertainty), and is due to things one
could in principle know but does not in practice. Epistemic uncertainty is the
scientific uncertainty in the model of the process. It is due to limited data and
knowledge. Epistemic uncertainty is generally understood through the lens of
Bayesian probability.
• Aleatoric and epistemic uncertainty can also occur simultaneously in a single
term.

https://www.kdnuggets.com/2022/04/uncertainty-quantificatio
n-artificial-intelligencebased-systems.html

You might also like