You are on page 1of 14

Prediction interval for neural network models

using weighted asymmetric loss functions.


Milo Grillo∗1 and Agnieszka Werpachowska†2
arXiv:2210.04318v2 [stat.ML] 11 Oct 2022

1
Humboldt University of Berlin, Germany
2
Savvie AS, Oslo, Norway

October 13, 2022

Abstract
We develop a novel and simple method to produce prediction intervals
(PIs) for fitting and forecasting exercises. It finds the lower and upper
bound of the intervals by minimising a weighted asymmetric loss function,
where the weight depends on the width of the interval. We give a short
mathematical proof. As a corollary of our proof, we find PIs for values
restricted to a parameterised function and argue why the method works for
predicting PIs of dependent variables. The results of applying the method
on a neural network deployed in a real-world forecasting task prove the
validity of its practical implementation in complex machine learning set-
ups.

1 Introduction
Neural network models are increasingly often used in prediction tasks, for exam-
ple in weather [1], water level [2], price [3], electricity grid load [4], ecology [5],
demographics [6] or sales forecasting. However, their often cited weakness is
that—in their vanilla form—they provide only point predictions. Meanwhile,
their users are very often interested also in prediction intervals (PIs), that is,
ranges [l, u] containing the forecasted value with a given probability (e.g. 95%).
Several approaches have been proposed to facilitate the estimation of PIs
(see [1, 2, 4, 5, 7–14] and references therein):
1. the delta method, which assumes that prediction errors are homogeneous
and normally distributed [10, 12];
∗ milo.grillo@hu-berlin.de
† a.m.werpachowska@gmail.com

1
2. Bayesian inference [8], which requires a detailed model of sources of uncer-
tainty, and is extremely expensive computationally for realistic forecasting
scenarios [14];
3. Generalized Likelihood Uncertainty Estimation (GLUE) [7], which re-
quires multiple runs of the model with parameters sampled from a dis-
tribution specified by the modeller;
4. bootstrap [11], which generates multiple training datasets, leading to high
computational cost for large datasets [14];
5. Mean-Variance Estimation (MVE) [9], which is less computationally de-
manding than the methods mentioned above but also assumes a normal
distribution of errors and gives poor results [14];
6. Lower Upper Bound Estimation (LUBE), which trains the neural network
model to directly generate estimations of the lower and upper bounds of
the prediction interval using a specially designed training procedure with
tunable parameters [1, 2, 4, 14].
All the proposed methods are either overly restrictive (the delta method, MVE)
or too expensive. Our proposed method is closest in spirit to LUBE—we train a
model to predict an either lower or upper bound for the PI—but simpler and less
computationally expensive, because it doesn’t require any parameter tuning.

2 Problem statement
We consider a prediction problem X 3 x 7→ y ∈ R, where X is a space of
features (e.g. Rd ) and y is the predicted variable. We assume that observed data
D := {(x, y)}N ⊂ X × R are N -realisations of a pair of random variables (X, Y )
with an unknown joint distribution P. We also consider a model gθ which,
given x ∈ X , produces a prediction gθ (x), where θ is are model parameters in
parameter space Θ. When forecasting, the prediction is also a function of an
independent “time” variable t, which is simply included in X .
The standard model training procedure aims to find such θ that, given x ∈ X ,
gθ (·) is an good point estimate of Y |X, e.g. gθ (x) ≈ E[Y |X = x]. It is achieved
by minimising the loss function of the form l(y, y 0 ) = l(y − y 0 ) with a minimum
at y = y 0 and increasing sufficiently fast for |y−y 0 | → ∞, where y is the observed
target value and y 0 is the model prediction. More precisely, we minimise the
sample average of the loss function l over the parameters θ:
N
X
θ̂ = arg min l (yi , gθ (xi )) .
θ i=1

The above procedure can be given a simple probabilistic interpretation by


assuming that the target value y is a realisation of a random variable Y with
the distribution µ + Z, where µ is an unknown “true value” and Z is an i.i.d.

2
error term with a probability density function ρ(z) ∼ exp(−l(z)). Two well-
known functions, Mean Squared Error (MSE) and Mean Absolute Error (MAE),
correspond to assuming a Gaussian or Laplace distribution for Z, respectively.
Minimising the loss function l corresponds then to the maximum log-likelihood
estimation of the unknown parameter µ, since ln P (y|µ) ∼ −l(y − µ).
In this paper we focus on the MAE, in which case the average loss function
(i.e. negative log-likelihood) given an i.i.d. sample [yi ]N
i=1 can be written as

N
1 X 0
|y − yi |
N i=1

which for N → ∞ equals E[|y 0 − Y |]. The optimal value of y 0 , namely ŷ, equals

ŷ = arg min E[|Y − y 0 |] for Y = µ + L ,


y 0 ∈R

where L has Laplace distribution with density ρL (z) = e−|z| /2. The minimum
fulfills the condition ∂E[|Y − y 0 |]/∂y 0 = 0. Since

E[|Y − y 0 |] = E[|µ + L − y 0 |]
Z 0
1 ∞ −|z| 1 y −µ −|z|
Z
0
= e (z + µ − y )dz − e (z + µ − y 0 )dz ,
2 y0 −µ 2 −∞

we have
y 0 −µ ∞
∂E[|Y − y 0 |]
Z Z
1 1
0
= e−|z| dz − e−|z| dz ,
∂y 2 −∞ 2 y 0 −µ

which is zero iff y 0 − µ = 0, hence ŷ = µ. For a finite sample [yi ]N


i=1 , ŷ is the
sample median, which approaches µ as N → ∞.
We are going to generalise the above result and train the model to predict
a given percentile of the distribution Y |X, corresponding to the lower or upper
bound of the PI. We show semi-analytically that this can be achieved in an
efficient way by training the model minimising a weighted MAE loss function,
i.e. not symmetric around 0:

l(y − y 0 ) = 2 α(y − y 0 )+ + (1 − α)(y − y 0 )− ,


 

for α ∈ (0, 1).

3 Main result
The main mathematical results is based on the insight that is our Theorem 1.
This theorem relates the choice of loss function to the resulting corresponding
arg min, for a specific class of loss functions.

3
Theorem 1. Let l be a loss function, defined by
(
(1 − α)|y − ŷ| for y − ŷ < 0
l(y, ŷ) = (1)
α|y − ŷ| otherwise
with α ∈ (0, 1). Let Y be a random variable on probability space (Ω, F, P)
with distribution µY := Y #P. The value ŷ minimising E [l(Y, ŷ)] is the α-th
percentile of Y .
Proof. The goal is to find ŷ, which minimises E [l(Y, ŷ)]. For this ŷ, we have

0= E[l(Y, ŷ)]
∂ ŷ
"Z Z ∞ #


= −(1 − α)(y − ŷ) dµY (y) + α(y − ŷ) dµY (y)
∂ ŷ −∞ ŷ (2)
Z ŷ Z ∞
= (1 − α) dµY (y)) − α dµY (y)
−∞ ŷ
= (1 − α)F (ŷ) − α(1 − FY (ŷ)) ,
R ŷ
where FY (ŷ) = −∞ dµY (y) is the CDF of Y . Hence, ŷ = FY−1 (α).
Remark 1. While the derivation works for any distribution µ which allows for an
α-th percentile, the interpretation of MAE minimisation as maximum likelihood
estimation of ŷ (for α = 1/2) is valid only when Y has a density function f ,
which is the PDF of the Laplace distribution. Bear in mind, though, that our
result does not require the assumption of independent Laplace distribution for
the error term.
P
Remark 2. An analogous calculation shows that minimising i l(yi , ŷ) over ŷ
leads to ŷ equal to the α-th percentile of the sample {yi }.
Corollary 2. If ŷ is restricted to ŷ = g(θ) for some differentiable function g :
Rd → R, the theorem still holds assuming that ∂g(θ)/∂θ 6= 0 for all θ. In
the proof the derivative w.r.t. ŷ is replaced by a derivative w.r.t. θ, and ŷ by
g(θ). In the last two lines of the proof, ∂θ g(θ) appears in front of both integrals.
Assuming that ∂g(θ)/∂θ 6= 0 (mind that this is a gradient, not a scalar derivative
and thus only one of its components must be non-zero), the proof is still valid.
Proof.

0= E[l(Y, g(θ))]
∂θ " #
Z g(θ) Z ∞

= −(1 − α)(y − g(θ)) dµY (y) + α(y − g(θ)) dµY (y)
∂ ŷ −∞ g(θ)
" Z g(θ) Z ∞ # (3)
∂g(θ)
= (1 − α) dµY (y) − α dµY (y)
∂θ −∞ g(θ)

∂g(θ)
= [(1 − α)F (g(θ)) − α(1 − FY (g(θ)))] .
∂θ

4
∂g(θ)
Dividing both sides by ∂θ 6= 0, we obtain g(θ) = FY−1 (α).

Remark 3. Since the only property of a minimum we have used when deriving
Theorem 1 and Corollary 2 is that the first derivative is zero in the minimum,
ŷ (or θ) could be also a local minimum, local maximum or even a saddle point.

4 Connection to problem statement


The result of the previous section can be connected with the problem stated in
Sec. 2 by considering the dependence of the predicted variable Y on features X
(which include also the time variable). Assume that we have found such a model
parameter set θ that the expected value of the loss function l is minimized for
every value of X,
∂E[l(Y, gθ (X))|X]
=0
∂θ
for any X. Then, the results of Section 3 can be easily modified to show that
gθ (X) = FY−1 |X (α), by replacing g(θ) with gθ (X), µY with µY |X (distribution of
Y conditional on X) and FY with FY |X (CDF of Y conditional on X), and the
assumption ∂g(θ)/∂θ 6= 0 for all θ with the assumption ∂gθ (X)/∂θ 6= 0 for all
θ and X.
Standard model training, however, does not minimize E[l(Y, gθ (X))|X], but
E[l(Y, gθ (X))] (as mentioned above, averaging over all X in a large training set),
which is the quantity considered by its proofs of convergence [15]. The condition

∂θ E[l(Y, ŷ)] = 0 leads then to


0= E[l(Y, ŷ)]
∂θ
Z (4)
∂gθ (x)  
= (1 − α)FY |X=x (gθ (x)) + α(1 − FY |X=x (gθ (x)) dµX (x),
X ∂θ

where µX is the distribution of feature vectors X.


Consider now a model with a separate constant term, gθ (x) = θ0 + fθ (x).
Most neural network prediction models are indeed of such a form. The compo-
nent of Eq. (4) for θ0 has the form
Z
 
0= (1 − α)FY |X=x (gθ (x)) + α(1 − FY |X=x (gθ (x)) dµX (x)
X (5)
= (1 − α)E[FY |X (gθ (X))] + α(1 − E[FY |X (gθ (X))]) .

Hence, we have E[FY |X (gθ (X))] = α.


We would like to make a stronger statement, namely that FY |X (gθ (X)) = α
(under the assumption that ∂gθ (X)/∂θ 6= 0 for all θ and X), but the integra-
tion over X precludes it. To support this conjecture we indicate the following
heuristic arguments:

5
1. Recall that stochastic gradient descent (SGD) uses mini-batches to com-
pute the loss gradient. By training long enough with small enough mini-
batches sampled with replacement (to generate a greater variance of mini-
batches), we can hope to at least approximately ensure that E[l(Y, gθ (X)]
is minimized “point-wise”
R w.r.t. to X. This is supported by the fact that
E[l(Y, gθ (X))] = X E[l(Y, gθ (X))|X = x]ρX (dx). On top of that, section
3 in [16] and Theorem 1 in [17] suggest that SGD should still work for loss
functions of the form (1) for reasonable network choices.
2. If the model gθ is flexible enough that it is capable of globally minimising
E[l(Y, gθ (X))|X] for every X (for example, it is produced by a very large
deep neural network), then this global pointwise minimum has to coincide
with the global “average” minimum of E[l(Y, gθ (X)] and can be found by
training the model long enough. Formally, if there exists such a θ̂ that θ̂ =
arg minθ E[l(Y, gθ (X))|X] for all X, then also θ̂ = arg minθ E[l(Y, gθ (X)].

3. Recall that ∂gθ∂θ(X) is a vector, not a scalar. Therefore, Eq. (4) is actu-
ally a set of conditions of the form EX [Ai (X)B(X)] = 0, where Ai (X) =
∂gθ (X)/θi and B(X) = (1 − α)FY |X (gθ (X)) + α(1 − FY |X (gθ (X)). The
larger the neural network, the more conditions we have on the pairs Ai B,
but they can be all automatically satisfied if B(X) ≡ 0. Moreover, the
value of B depends only on the predictions gθ (X) produced by the model
and the distribution of Y given X (which is dictated by the data distri-
bution D). On the other hand, the derivatives ∂gθ (X)/θi depend also on
the internal architecture of the neural network (for example, the type of
activation function used, number of layers, etc). If, what is likely, there
exists multiple architectures equally capable of minimising the expected
loss, Eq. (4) must be true for all corresponding functions ∂gθ (X)/θi at the
same time, which also suggests B(X) = 0.
4. Consider now the limit of a very large, extremely expressive neural net-
work, which achieves a global minimum of E[l(Y, gθ (X)]. We can de-
scribe it using the following parameterisation: let θ = [θi ]N
i=1 and gθ (x) ≈
PN
R θi kN (x − xi ), where xi belongs to the support of the distribution of
i=1
X, kN (x − xi )ρX (dx) = 1, and kN (d) > 0 approaches the Dirac delta
distribution in the limit N → ∞. We have ∂gθ (x)/∂θi = kN (x − xi ), and,
for each i,
Z
∂gθ (x)  
0= (1 − α)FY |X=x (gθ (x)) + α(1 − FY |X=x (gθ (x)) ρX (dx)
X ∂θi
Z

≈ kN (x − xi ) (1 − α)FY |X=x (gθ (x))

+α(1 − FY |X=x (gθ (x)) ρX (dx) .

In the limit of N → ∞, k(d) → δ(d) and the above integral becomes

0 = (1 − α)FY |X=xi (gθ (xi )) + α(1 − FY |X=xi (gθ (xi )) for all i.

6
At the same time, as N → ∞, the feature vectors xi have to progressively
fill in the entire support of the distribution of X, hence the above condition
is asymptotically equivalent to

0 = (1 − α)FY |X=x (gθ (x)) + α(1 − FY |X=x (gθ (x)) for all x ∈ X ,

or simply FY |X (gθ (X)) = α.


5. As the opposite case to the above, consider a naive model gθ (X) = θ0 +
θ1 · X describing a linear process, Y = y0 + w · X + Z, where the error
term Z ⊥ X is i.i.d. and y0 , w are unknown constants. From Eq. (5) we
have E[FY |X (gθ (X))] = α. Writing Eq. (4) for θ1 , we obtain
Z
 
0 = x (1 − α)FY |X=x (θ0 + θ1 · x) + α(1 − FY |X=x (θ0 + θ1 · x) ρX (dx)
Z
= x [(1 − α)FZ (θ0 − y0 + (θ1 − w) · x)

+α(1 − FZ (θ0 − y0 + (θ1 − w) · x)] ρX (dx)

It is easy to see that both conditions are satisfied by θ1 = w and θ0 =


FZ−1 (α). Then, we have FY |X (gθ (X)) = α for all X. Hence, our method
works perfectly in the linear case with i.i.d. error terms.

5 From theory to implementation


For any predictive data task, there are three neural networks. These are the
predicting neural network Np , trained using a standard symmetric loss function;
and the predictors of the lower and upper bounds of the PI, Nl and Nu respec-
tively. If these models are trained separately, even if trained on the same data
set, one cannot guarantee that Nu (x) ≥ Np (x) ≥ Nl (x) for all x ∈ X . At best,
we have it for most x ∈ X . In particular α closet to 50%, this is a common issue.
This is due the non-deterministic aspect in state of the art training schemes.
To combat this, we propose the following three related tricks.

1. To ensure that model parameters θ in the models are all similar, and thus
the inner logic of the neural networks is similar, we should first train the
models using a fixed training schedule and a fixed random seed.
2. We add a component to the loss function, which is 0 if the PI upper-
(lower-) bound is above (below) the predicted value, and ∞ otherwise.
3. To help with the previous trick, one could add the predicted value gener-
ated by Np as an extra input to Nl and Nu .
The first point is the easiest to justify modification, but also the least powerful
of the tricks. The third point changes the model structure of the PI neural
networks, which hurts the interpretation as drawn from section 4.

7
6 Experiments and results
In this section, we describe and discuss our results of forecasting daily demand
for 10 food products. The forecasting model is an adaptation of N-BEATS
architecture [18], whose initial version was developed by Molander [19]. We train
the model simultaneously on all products, each having ca 2-year-long history
of daily sales, including as features (usually one-hot-encoded) their categories,
providers, weather, and a calendar with indicated special days and periods. The
total size of the dataset is 17373 numerical vectors of size 160 each. During the
training, mini-batches are drawn with replacement, following point 1 of Sec. 4.
The Tensorflow implementation of our model uses the Adam optimiser [20] with
early stopping and learning rate decay (starting from 10−3 ). The algorithm
usually converges in 100 epochs.
The model predicts the next day’s sales based on a 14-day window of recent
sales data, current weather conditions and approaching holidays. The forecast
horizon in this task is shortened to 1 day to match the theoretical assumptions,
hence the presented results for the period from 2022-04-02 to 2022-06-16 consist
of 76 forecasts for each the median prediction (i.e. 50% PI) and the two PI
bounds. We perform 90 individual predictions, a.k.a. three months of forecasts,
per product and backtest them against historical data.
It is important to note that we use MAE to train the model for forecasting.
This is a special case of our weighted loss function 1 where α = 0.5. For a PI of
width β, we train to find the lower bound using α = 0.5 − β/2 and the upper
bound using α = 0.5 + β/2, e.g. for β = 0.9, we train for α = 0.05 and 0.95 for
the lower and upper bound respectively.
In Fig. 1, the prediction and PI of two randomly chosen products have been
plotted. It can be seen that the width of PIs remains relatively consistent over
time, however, there are some clear outliers on this regard, for example the
spike upwards around 2022-04-30 on product 2. The (weekly) periodic sales
behaviour, which can also be seen in the median forecast, combined with an
unusually high sale the day before, caused the model to widen the prediction
interval. As a result, the historic sales data is indeed captured within the PI.
Note that on the week before, this had failed.
Based on the tested 70%, 80% and 90% PIs, on average, the width of the
interval grows with increasing probability level. Under the assumption of either
Laplacian or Gaussian distribution of the errors, the last percentage points
before reaching 100% should correspond to the largest widening of the prediction
interval.
By naively training models separately, the integrity of the PIs is not always
maintained. Figure 2 shows that the median forecast does not always lie within
the predicted PI, for example at the peak around 2022-04-23 or 2022-06-03 for
product 3. Such situations are more likely for low-demand products, whose
trends contain more outliers and thus are difficult to model. Fixing the seed
in computations of the median and PIs, as we suggested in Sec. 5, reduces the
number of such rogue points by merely 11% (see an example in Fig. 3). Never-
theless, their proportion is rather low, amounting to ca 3% of predictions. On

8
300 median forecast
historical data
250 90% PI
200 80% PI
150
100
50
0
2022-04-04 2022-04-19 2022-05-04 2022-05-19 2022-06-03

(a) Forecast of sales for product 1 with 80% and 90% PIs.

150 median forecast


historical data
125 70% PI
90% PI
100
75
50
25
0
2022-04-04 2022-04-19 2022-05-04 2022-05-19 2022-06-03

(b) Forecast of sales for product 2 with 70% and 80% PIs.

Figure 1: Sale forecasts with prediciton intervals trained by MAE and historical
data.

top of that, the lower probability PI can be wider that that of the higher at
some points, as remarkably shown in the tail of the forecast for product 4.
Finally, table 1 shows how well the prediction interval width and the predic-
tion success-rate overlap. With only a minimal difference, the results averaged
over 10 products suggest that indeed our method accurately finds lower and
upper bounds for PIs.

Table 1: The Prediction Interval width (from .5 - width/2 to .5 + width/2 with


the corresponding success rate, i.e. the rate of which the historical data lies
within the predicted PI.

PI-width success-rate
0.9 0.89
0.8 0.8
0.75 0.749
0.7 0.695

We additionally experiment with longer forecast horizons required for the


practical applications of our model. Figure 4 shows examples of forecasts with
7-day and 14-day horizons. We perform the prediction every 7 or 14 days,
respectively. Quite surprisingly, the success-rate on our set of tested products

9
50 median forecast
historical data
40 90% PI
75% PI
30
20
10
0
2022-04-04 2022-04-19 2022-05-04 2022-05-19 2022-06-03

(a) Forecast of sales for product 3 with 75% and 90% PIs.
median forecast
100 historical data
70% PI
80 80% PI
60
40
20
0
2022-04-04 2022-04-19 2022-05-04 2022-05-19 2022-06-03

(b) Forecast of sales for product 4 with 70% and 80% PIs.

Figure 2: Sale forecasts with prediciton intervals trained by MAE and historical
data.

remains high, namely 76% of historical points lie within the analysed 7-day
75% PI and 74% of historical points lie within the 14-day 75% PI. The starting
and final days of the forecasts are marked by the grid, where they overlap. As
expected, at these points we can see that the PI widens towards the end of the
predicted period and narrows erratically as the new prediction period begins.

7 Final thoughts and future work


7.1 Final thoughts
The standard interpretation of regression models appeals to the maximum like-
lihood—their estimate is the best they can give based on the data. Without
additional ad hoc (as reflected in Remark 1) assumptions about the data distri-
bution, it does not even define what statistics they actually estimate. Neither
does it endow us with an intuition about the quality of the estimate or the mod-
els’ predictive power. In fact it can be misleading, as increasing the likelihood
does not necessarily mean increasing the model accuracy, vide overfitting.
Statistical learning theory provides a more general answer to the question
“what does a nonlinear regression model predict?”. Minimising the expected
loss over a sufficiently large i.i.d. training set, we obtain with high probability a
model producing, on previously unseen test data, predictions with the expected

10
70 median forecast
60 historical data
75% PI (not fixed seed)
50
40
30
20
10
2022-04-04 2022-04-19 2022-05-04 2022-05-19 2022-06-03

70 median forecast
60 historical data
75% PI (fixed seed)
50
40
30
20
10
2022-04-04 2022-04-19 2022-05-04 2022-05-19 2022-06-03

Figure 3: Sale forecasts trained by MAE with 75% PIs with and without fixing
the seed.

loss not much larger than the minimal expected loss possible for given model
architecture [15]. The only condition is that the test data are drawn from the
same distribution as the training set.
However, the actual task of a statistical regression model is to provide the
range of values that is likely to contain the true outcome (given the trained
parameters and the data). We presented a simple method for training a neural
network to predict lower or upper bounds of the prediction interval given a vector
of features. It does it by minimising a weighted version of the MAE loss. Its only
assumption is that the model is expressive enough to obtain an (approximate)
global minimum of the expected loss. Without needing to change our set up to
guarantee consistent results, we have shown by numerical experiments that the
mathematical result holds true in real-world implementations. This suggests
that the methods from this paper can be used to create prediction intervals of
neural networks in similar set-ups.

7.2 Future work


Although we have shown the method presented here works for linear models
and for an expressive model, it cannot be ruled out that the method is less
successful for a specific class of models. Finding the weaknesses creates a better
understanding of the possible applications of the method.
The relationship between the optimal model parameters θ and hyperparam-
eters of the model is generally complex and not straight-forward. The relation
between the percentile α and the resulting model parameters θ, in particular

11
median forecast 7D
100 7D 75% PI
80
60
40
20
04-02 04-09 04-16 04-23 04-30 05-07 05-14 05-21 05-28 06-04 06-11

median forecast 14D


50 75% PI 14D
40

30

20

10
04-02 04-16 04-30 05-14 05-28 06-11

Figure 4: Sale forecasts with 7-day and 14-day horizon trained by MAE with
75% PIs, performed every 7 or 14 days, respectively.

with regards to continuity and regularity, is one worth exploring.


Different asymmetric loss functions may result in other interesting minimali-
sations. We conjecture a similar result exists for a weighted MSE minimisation,
where the predicted value is related to an amount of variations away from pre-
diction found by minimising MSE. This should be directly related to the chosen
weight α.
Finally, our method works well in practice even though its theoretical re-
quirements are not guaranteed to be satisfied by the standard model training
algorithm (see Sec. 4). This surprising fact deserves further investigation, which
could lead to interesting discoveries about how neural networks learn during
training.

References
[1] Fangjie Liu, Yanhe Xu, Geng Tang, and Yuying Xie. A new lower and
upper bound estimation model using gradient descend training method for
wind speed interval prediction. Wind Energy, 24, 03 2021.

[2] Hairong Zhang, Jianzhong Zhou, Lei Ye, Xiaofan Zeng, and Yufan Chen.
Lower upper bound estimation method considering symmetry for construc-
tion of prediction intervals in flood forecasting. Water Resources Manage-
ment, 29, 12 2015.

12
[3] Guoao Duan, Mengyao Lin, Hui Wang, and Zhuofan Xu. Deep neural
networks for stock price prediction. In 2022 14th International Conference
on Computer Research and Development (ICCRD), pages 65–68, 2022.
[4] Mashud Rana, Irena Koprinska, Abbas Khosravi, and Vassilios G. Agelidis.
Prediction intervals for electricity load forecasting using neural networks.
In The 2013 International Joint Conference on Neural Networks (IJCNN),
pages 1–8, 2013.
[5] Kristian Miok. Estimation of prediction intervals in neural network-based
regression models. In 2018 20th International Symposium on Symbolic and
Numeric Algorithms for Scientific Computing (SYNASC), pages 463–468,
2018.
[6] A. Werpachowska. Forecasting the impact of state pension reforms in post-
Brexit England and Wales using microsimulation and deep learning. In
Proceedings of PenCon 2018 Pensions Conference, page 120, 2018.
[7] Keith J. Beven and Andrew M. Binley. The future of distributed mod-
els: model calibration and uncertainty prediction. Hydrological Processes,
6(3):279–298, 1992.
[8] David J. C. MacKay. The evidence framework applied to classification
networks. Neural Computation, 4(5):720–736, 1992.
[9] D.A. Nix and A.S. Weigend. Estimating the mean and variance of the
target probability distribution. In Proceedings of 1994 IEEE International
Conference on Neural Networks (ICNN’94), volume 1, pages 55–60 vol.1,
1994.
[10] G. Chryssolouris, M. Lee, and A. Ramsey. Confidence interval prediction for
neural network models. IEEE Transactions on Neural Networks, 7(1):229–
232, 1996.
[11] Tom Heskes. Practical confidence and prediction intervals. In M.C. Mozer,
M. Jordan, and T. Petsche, editors, Advances in Neural Information Pro-
cessing Systems, volume 9. MIT Press, 1996.
[12] J. T. Gene Hwang and A. Adam Ding. Prediction intervals for artificial neu-
ral networks. Journal of the American Statistical Association, 92(438):748–
757, 1997.
[13] Achilleas Zapranis and Efstratios Livanis. Prediction intervals for neural
network models. In Proceedings of the 9th WSEAS International Confer-
ence on Computers, ICCOMP’05, Stevens Point, Wisconsin, USA, 2005.
World Scientific and Engineering Academy and Society (WSEAS).
[14] Abbas Khosravi, Saeid Nahavandi, Doug Creighton, and Amir F. Atiya.
Lower upper bound estimation method for construction of neural network-
based prediction intervals. IEEE Transactions on Neural Networks,
22(3):337–346, 2011.

13
[15] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learn-
ing - From Theory to Algorithms. Cambridge University Press, 2014.
[16] Yi Zhou, Junjie Yang, Huishuai Zhang, Yingbin Liang, and Vahid Tarokh.
Sgd converges to global minimum in deep learning via star-convex path.
ICLR 2019, 2019.

[17] Gabriel Turinici. The convergence of the stochastic gradient descent (sgd)
: a self-contained proof. Technical report, Ceremade - CNRS, 2021.
[18] Boris N Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Ben-
gio. N-beats: Neural basis expansion analysis for interpretable time series
forecasting. arXiv preprint arXiv:1905.10437, 2019.
[19] I.C. Molander. Time series demand forecasting—Reducing food waste with
deep learning. Master’s thesis, Imperial College London, Department of
Computing, 2021.
[20] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic opti-
mization. CoRR, abs/1412.6980, 2015.

14

You might also like