You are on page 1of 13

Advances in Applied Energy 14 (2024) 100165

Contents lists available at ScienceDirect

Advances in Applied Energy


journal homepage: www.elsevier.com/locate/adapen

Probabilistic load forecasting for integrated energy systems using attentive


quantile regression temporal convolutional network
Han Guo, Bin Huang ∗ , Jianhui Wang
Electrical and Computer Engineering Department, Southern Methodist University, Dallas, 75205, TX, USA

A R T I C L E I N F O A B S T R A C T

Keywords: The burgeoning proliferation of integrated energy systems has fostered an unprecedented degree of coupling
Attention mechanism among various energy streams, thereby elevating the necessity for unified multi-energy forecasting (MEF). Prior
Load forecasting approaches predominantly relied on independent predictions for heterogeneous load demands, overlooking the
Multi-task learning
synergy embedded within the dataset. The two principal challenges in MEF are extracting the intricate coupling
Probabilistic forecasting
Temporal convolutional network
correlations among diverse loads and accurately capturing the inherent uncertainties associated with each type
of load. This study proposes an attentive quantile regression temporal convolutional network (QTCN) as a proba-
bilistic framework for MEF, featuring an end-to-end predictor for the probabilistic intervals of electrical, thermal,
and cooling loads. This study leverages an attention layer to extract correlations between diverse loads. Subse-
quently, a QTCN is implemented to retain the temporal characteristics of load data and gauge the uncertainties
and temporal correlations of each load type. The multi-task learning framework is deployed to facilitate simulta-
neous regression of various quantiles, thereby expediting the training progression of the forecasting model. The
proposed model is validated using realistic load data and meteorological data from the Arizona State University
metabolic system and National Oceanic and Atmospheric Administration respectively, and the results indicate
superior performance and greater economic benefits compared to the baselines in existing literature.

1. Introduction plied in energy sectors, including electric load forecasting [5], heating
load forecasting [6], and more.
Integrated energy systems (IES) can integrate multiple energy In IES, each energy source carries its inherent uncertainty and ex-
sources and conversion technologies to provide a more efficient, sus- hibits intricate coupling correlations with other diverse energy sources
tainable, and reliable energy supply [1–3]. In IES, various forms of [7]. On the one hand, uncovering these complex correlations has
energy are coupled through energy conversion devices, enabling not emerged as a pivotal objective in MEPF. On the other hand, how to
only the coordinated operation of diverse energy sources but also the effectively account for the uncertainty of each load and the uncertainty
transition from separate production and supply to the joint operation between different loads also poses another challenge. Addressing these
of multiple energy sources. As the development of IES continues to two challenges lies at the heart of MEPF.
progress, the need for accurate multi-energy probabilistic forecasting Depending on the load sources, the load forecasting methods can be
(MEPF) has grown significantly. Accurate MEPF has now become an classified into two types: single-energy load forecasting (SEF) and multi-
indispensable requirement for ensuring the reliable, stable, and eco- energy load forecasting (MEF). Traditional forecasting usually involves
nomically efficient operation of IES, encompassing various aspects such SEF. For single-load forecasting, the earliest models used for this pur-
as energy production, storage, and dispatch. According to [4], a rough pose are statistical models, such as the multiple linear regression model
estimate indicates that a 1% reduction in the Mean Absolute Percentage (MLP) [8] and the autoregressive integrated moving average model
Error (MAPE) for a utility with a 1GW peak load could result in savings (ARMA) [9,10]. These models are all linear and inaccurate in captur-
of $500,000 per year from long-term load forecasting and $300,000 per ing nonlinear relationships. To address this issue, researchers pay more
year from short-term load forecasting. Due to the substantial economic attention to machine learning architectures, such as support vector ma-
and environmental advantages, short-term forecasting is extensively ap- chine (SVR) [11] and random forest (RF) [12]. However, traditional

* Corresponding author.
E-mail address: bhuang1@smu.edu (B. Huang).

https://doi.org/10.1016/j.adapen.2024.100165
Received 10 November 2023; Received in revised form 9 February 2024; Accepted 9 February 2024
Available online 19 February 2024
2666-7924/© 2024 Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
H. Guo, B. Huang and J. Wang Advances in Applied Energy 14 (2024) 100165

machine learning algorithms must be better suited for mapping highly To this end, this paper proposes a multi-task attentive quantile re-
complex nonlinear relationships and may not handle large datasets ef- gression temporal convolutional network (MTL-based Attentive QTCN)
ficiently [13,14]. Subsequently, neural networks (NNs) have garnered to address the two challenges mentioned above. Firstly, the proposed
much wider attention due to their ability to excel in complex nonlin- model utilizes the attention mechanism to capture the coupling corre-
ear mapping relationships [15–18]. Long short-term memory (LSTM), lations at each time step to solve the correlation-extracting challenge.
deep belief network (DBN), and transformer deep learning model were Attention is a neural network architecture capable of extracting global
adopted to SEF in [19–21]. All works mentioned above are all point correlations and dynamically focusing on correlations between different
forecasting. In order to achieve probabilistic forecasting, uncertainties locations or features. Attention achieves this by calculating correla-
should be taken into account. Y. Wang et al. [22] extended LSTM-based tions among different inputs and generating respective attention scores,
point forecasting to the quantile regression LSTM (QLSTM) to handle thereby accomplishing the task of correlation extraction. To address
the non-stationary and stochastic features of individual consumers. Tak- both uncertainties, the proposed model utilizes the quantile regres-
ing the uncertainty of low-voltage load data into account to improve sion temporal convolutional network (QTCN) for the following reasons.
the accuracy of the probabilistic forecasting, Z. Cao et al. [23] pro- First, QTCN falls under the CNN specialized for time series analysis.
posed a hybrid ensemble learning model based on the DBN. X. Liu et Second, introducing quantile regression into temporal convolutional
al. [24] proposed an ordinary differential equation network combined network (TCN) empowers TCN to effectively capture both data and
with quantile regression to capture the uncertainties, and Z. Zheng et model uncertainties, thereby facilitating probabilistic forecasting. Addi-
al. [25] employed a stochastic recurrent encoder and decoder network tionally, quantile regression offers a non-parametric approach to proba-
for probabilistic forecasting. M. Sun et al. [26] proposed a probabilistic bilistic forecasting, meaning that the forecasting model does not assume
day-ahead net load forecasting method that combined Bayesian theory a pre-defined distribution for the load data, which can deviate from
and LSTM to capture the epistemic and aleatoric uncertainty of load the actural distribution. Lastly, QTCN demonstrates faster convergence
data, thus resulting in probabilistic forecasting. and sidesteps the issue of gradient vanishing, a common challenge en-
With the development of IES, an increasing number of scholars fo- countered in quantile regression recurrent neural networks (QRNN) like
cused on MEF simultaneously. Like independent forecasting, NNs are QLSTM [32,33]. Finally, by utilizing the MTL-based quantile regression,
still the primary forecasting models. DBN and a hybrid model com- QTCN can simultaneously forecast the several quantiles of multiple
bined convolutional neural network (CNN) and LSTM are proposed in loads given several nominal proportions of different quantiles. In this
[27] and [28], respectively, to forecast electrical, heat, and gas load. way, the proposed architecture can extract coupling relationships be-
Both of these two models are based on single-task learning (STL) NNs. tween different loads without altering the temporal properties of load
In addition to STL-based NNs, NNs for MEF always incorporate multi- data and simultaneously consider both data uncertainty and model un-
task learning (MTL) structure to accommodate multi-output tasks. MTL- certainty.
based Transformer and MTL-based LSTM model are proposed in [29] In summary, the main contributions of this paper are listed as fol-
and [30] to forecast electrical, cold, and heat load simultaneously. All lows.
the works mentioned above regarding MEF are based on point fore-
casting. For probabilistic forecasting, C. Wang et al. [31] proposed a 1. A Multi-task Learning (MTL) based Attentive Quantile Regression
Bayesian transformer network (BTN), which incorporates both Bayesian Temporal Convolutional Network (QTCN) architecture for multi-
neural network (BNN) and transformer structure, to forecast prediction energy probabilistic load forecasting is devised. The proposed
intervals (PIs) of electrical, cold, and heat loads. At first, three data model can simultaneously forecast multiple loads using a single
loads are simultaneously processed through a BTN-based encoder. After network instead of using three different decoders. Additionally,
that, three BTN-based decoders are utilized to separately process three the proposed model extends point forecasting to probabilistic fore-
different loads of data, one for one type of load, and during the decod- casting by introducing QTCN, allowing for the direct estimation
ing process, all decoders will incorporate the outputs from the encoder. of prediction intervals. Through rigorous validation of a realistic
Finally, forecasting results for the three types of loads are output by the dataset, the model demonstrably surpasses existing methodologies
three decoders. Although this paper has made substantial progress in in the domain of probabilistic forecasting.
the MEF field, there are still certain research gaps to be tackled. On the 2. The proposed probabilistic predictor leverages an attention mecha-
one hand, compared to other types of NNs, BBN and transformer intro- nism to capture the complex coupling correlations between differ-
duce more parameters, leading to overfitting problems when training ent types of loads at each time point in order to improve forecast-
data is limited, and the increase in parameters also escalates computa- ing accuracy and employs a QTCN structure to navigate data and
tional complexity. On the other hand, BBN is not explicitly tailored for model uncertainties and address the gradient vanishing issue found
processing time series data. Its capacity to extract temporal correlations in recurrent neural network (RNN), simultaneously improving the
is comparatively less robust when compared to networks explicitly de- convergence speed. In addition, the proposed model expedites the
signed for this purpose, like LSTM. In order to address these two issues training process by regressing multiple quantiles by introducing a
above, new models need to be introduced. The typical references for MTL structure, thereby enhancing the overall computational effi-
load forecasting are summarized in Table 1. ciency of the proposed model.
As mentioned above, the challenge in MEF lies in accurately captur-
ing the correlations between different loads at each time step. A better The rest of this paper is organized as follows. Section 2 introduces
relationship-capturing method can lead to better forecasting accuracy. the technical details of the proposed forecasting model and the evalu-
In addition to capturing correlations between different loads, another ation metrics of the PIs. Section 3 presents the dataset description and
critical aspect of achieving MEPF is how to analyze the uncertainty of the case studies to verify the proposed methodology. Finally, Section 4
each load. In general, probabilistic forecasting needs to consider two presents the conclusion of this paper. The source code related to this
types of uncertainty: data uncertainty and model uncertainty. Data un- paper is available on https://github.com/BinHuangScut/QTCN.
certainty may result from inaccurate measurements or human errors,
while model uncertainty is rooted in the misalignment and inefficien- 2. Forecasting model
cies of training methods. Additionally, there is no assurance that the
weight values correspond to the global minimum of the error func- This paper proposes a MTL-based attentive QTCN for the MEF task.
tion, contributing to model uncertainty. Furthermore, the finite training The description of the whole task is shown in Fig. 1. Three types of
dataset may not comprehensively capture the accurate data-generating load data, including electrical, cold, and heat load, and the weather
mechanism, exacerbating model uncertainty. data, are the inputs of the forecasting model. The outputs of the fore-

2
H. Guo, B. Huang and J. Wang Advances in Applied Energy 14 (2024) 100165

Fig. 1. Multi-energy probabilistic load forecasting framework.

Table 1
Summary of load forecasting characteristics across references.

Reference Load type Forecasting output


Single Multiple Point Probabilistic

[19–21]  
[22–26]  
[27–30]  
[31]  
Proposed Model  

casting model are PIs of three types of loads. This section introduces
the essential components of the proposed model, namely the attention
mechanism, TCN, and the MTL structure, as well as the evaluation cri-
teria for probabilistic forecasting.
Fig. 2. Structure of scaled dot-product attention mechanism.
2.1. Attention mechanism

The attention mechanism is a neural network architecture that cap-


tures correlations or relationships between elements, even without clear
topological structures. Because there is no clear topological relationship
between the various loads in the same area, but there are many cor-
relations among them, attention is employed in this paper to extract
correlations between different loads.
The attention mechanism can be viewed as a computational process
that takes a query and a set of key-value pairs as inputs, ultimately
yielding an output [34]. Throughout this process, queries, keys, values,
and outputs are all represented as vectors. The resulting outputs are
determined by computing a weighted sum of the values, with these
weights being generated by a compatibility function that assesses the
correspondence between the query and each key.

2.1.1. Scaled dot-product attention Fig. 3. Structure of the multi-head attention mechanism.
The two most commonly used attention mechanisms are additive
attention and scaled dot-product attention. In this paper, we utilize
scaled dot-product attention, which exhibits superior speed and space anism is depicted in Fig. 3. Refer to Appendix A.2 for the detailed
efficiency performance for real-world applications [35]. Refer to Ap- description of multi-head attention.
pendix A.1 for the detailed description of scaled dot-product attention.
The schematic representation of the scaled dot-product attention func- 2.2. Temporal convolutional network
tion is illustrated in Fig. 2.
TCN is a neural network architecture based on CNN, while it is de-
2.1.2. Multi-head attention signed to process time series data [36]. Like other time-serial neural net-
Instead of employing a single attention function with keys, values, works such as LSTM, TCN can also capture the long-term dependencies
and queries, this paper employs a multi-head attention architecture. in historical data to perform forecasting tasks. In addition, introducing
This approach allows the model to concurrently attend to multiple rep- residual blocks in TCN addresses the issue of gradient vanishing, and
resentation subspaces across different positions, in contrast to a single the TCN model can further be enhanced into QTCN to capture the un-
attention head. The configuration of the multi-head attention mech- certainty in multi-energy load data. Therefore, such a time-series model

3
H. Guo, B. Huang and J. Wang Advances in Applied Energy 14 (2024) 100165

2.2.3. Residual blocks


As the number of hidden layers in TCN increases, along with the en-
largement of kernel size and dilation factor, the receptive field of TCN
grows larger, capturing a more significant amount of information. How-
ever, the increase in the number of layers in TCN can lead to training
instability and the issue of vanishing gradients. In order to address these
two issues, residual blocks are adopted in this study. The details of the
residual block structure are shown in Fig. 6.
One residual block includes two branches. One branch performs a
transformation operation 𝑓 (⋅), which contains dilated causal convolu-
tional layers, weight normalization layers, ReLU activation functions,
and dropout layers. The dilated causal convolutional layers are con-
structed by combining the causal and dilated convolutions mentioned
earlier. These layers are utilized for extracting concealed features from
the inputs. WeightNorm layers are employed to accelerate training by
Fig. 4. Casual convolution structure. constraining the weight range. Activation functions are carried out us-
ing ReLU chosen for effective convergence. In order to relieve the
overfitting issues in the deep network, dropout layers are applied as
regularization techniques.
was employed in this paper. TCN consists of three components: causal
Another branch is the 1 × 1 𝐶𝑜𝑛𝑣 layer, which ensures a consistent
convolution, dilated convolution, and residual block.
number of feature maps alongside the first branch. For the input of the
ℎ-th residual block 𝑋 (ℎ−1) , the output of the ℎ-th residual block 𝑋 (ℎ)
2.2.1. Causal convolutions can be expressed as follows.
Fig. 4 shows the structure of the causal convolution layers. Causal
convolution is the most crucial structure within TCN. For a one- 𝑋 (ℎ) = 𝜎(𝑋 (ℎ−1) + 𝑓 (𝑋 (ℎ−1) )) (3)
dimensional sequence input 𝑋 = (𝑥0 , 𝑥1 , ..., 𝑥𝑡 ), the output at time 𝑡,
where 𝜎(⋅) is an activation function.
denoted as 𝑦𝑡 , solely relies on the current input 𝑥𝑡 and a portion of the To be more specific, if the entire TCN consists of M residual blocks,
previous time steps (i.e., 𝑥𝑡−1 ) without any consideration of future in- disregarding 𝜎(⋅), the forward propagation of the whole TCN can be
puts (i.e., 𝑥𝑡+1 ). expressed as follows.
This structure guarantees that the information flows only in the for-
𝑀

ward direction along the temporal dimension, preserving the causal
relationships in time-serial data and meaning that there is no infor- 𝑋 (𝑀) = 𝑋 (0) + 𝑓 (𝑋 (𝑖−1) ) (4)
𝑖=1
mation leakage from future time steps to past ones.
where 𝑋 (0) is the input of TCN, 𝑋 (𝑀) is the output of TCN. In this way,
the formula to calculate the backpropagation of the overall loss of TCN
2.2.2. Dilated convolutions
to 𝑋 (0) is as follows.
Standard causal convolutions have limitations when examining past ( )
𝑀
information, primarily due to the fixed receptive field of the convolu- 𝜕𝐿 𝜕𝐿 𝜕 ∑
tional kernel. In tasks that demand significant historical data, networks = 1+ 𝑓 (𝑋 (𝑖−1)
) (5)
𝜕𝑋 (0) 𝜕𝑋 (𝑀) 𝜕𝑋 (0) 𝑖=1
consisting solely of standard causal convolutions encounter substantial
challenges, including the potential for increased network depth, which where 𝐿 is the overall loss of the whole TCN. The presence of “1” within
can lead to slow training. To effectively address the issue of handling (5) means that gradients at the TCN’s output can be directly propagated
extensive historical data, TCN introduces a dilated causal convolution backward to the TCN’s input, in this way, decreasing the probability of
structure, which incorporates dilation within the convolutional kernel, encountering the problem of gradient vanishing. In fact, (5) can be used
effectively expanding the receptive field. Dilated causal convolutions for any pair (𝑥𝑖 , 𝑥𝑗 ), where 𝑥𝑖 and 𝑥𝑗 are 𝑖-th and 𝑗 -th residual block of
enable the network to capture patterns across various time scales and TCN respectively, thus addressing the gradient vanishing problem.
capture long-range dependencies. For one-dimensional time series in-
put 𝑋 = (𝑥0 , 𝑥1 , ..., 𝑥𝑡 ) and a convolutional kernel 𝑓 : {0, 2, ..., 𝑥𝑛−1 }, the 2.3. Multi-task learning based quantile regression
dilated convolution operation 𝐹 (⋅) on sequence element 𝑡 is defined as
2.3.1. Quantile regression
blow.
There are various forms of probabilistic forecasting, and this paper
𝑛−1
∑ adopts quantile regression as the forecasting method. On the one hand,
𝐹 (𝑡) = (𝑋 ∗𝑑 𝑓 )(𝑡) = 𝑓 (𝑖) ⋅ 𝑥𝑡−𝑑⋅𝑖 (1) quantile regression is a non-parametric method, allowing probabilistic
𝑖=0 forecasting to be performed directly without assuming the distribution
where 𝑛 is the kernel size, 𝑑 is the dilation factor, and 𝑡 − 𝑑 ⋅ 𝑖 accounts of the forecasting target. On the other hand, by utilizing quantiles, it is
for the direction of the past. possible to derive the complete probability distribution function of the
To equip TCN with the ability to process input sequences of varying forecasting target.
lengths and convert them into output sequences of uniform length, TCN Quantile regression, as introduced in [37], seeks to approximate the
introduces padding, defined as follows. conditional distribution of a random variable through the estimation of
quantiles and is expressed as an optimization task that aims to minimize
the pinball loss function. The pinball loss is formulated for any quantile
𝑝 = (𝑛 − 1) ⋅ 𝑑 (2)
𝑞 ∈ (0, 1), utilizing a weighted absolute error.
where 𝑝 is the padding size, 𝑛 the kernel size, and 𝑑 is the dilation fac- {
(𝑦 − 𝑦)𝑞,
̂ 𝑦 ≥ 𝑦̂
tor. This way, TCN possesses the same capability as LSTM for handling 𝐿𝑞 (𝑦,
̂ 𝑦) = (6)
(𝑦 − 𝑦)(𝑞
̂ − 1), 𝑦 < 𝑦̂
time-serial data.
The structure of dilated causal convolutions is shown in Fig. 5, in where 𝑦 is the target value and 𝑦̂ is the quantile forecast of 𝑞 . The
which the kernel size 𝑛 = 2 and the dilated factor 𝑑 = [1, 2, 4]. pinball loss function penalizes the forecasting bias of the model at a

4
H. Guo, B. Huang and J. Wang Advances in Applied Energy 14 (2024) 100165

Fig. 5. Structure of dilated causal convolution.

Fig. 6. Structure of residual block.

Fig. 8. Structure of MTL-based attentive QTCN model.

In order to solve the disadvantage, this paper adopts the MTL struc-
ture. MTL allows the forecasting model to employ multiple pinball loss
functions simultaneously, thereby enabling simultaneous forecasting of
multiple quantiles. Moreover, in MTL, various subtasks utilize a stan-
dard feature-sharing layer with identical feature parameters. In this
Fig. 7. Pinball loss function.
way, during the backpropagation process of the neural network, these
shared layers can be optimized simultaneously. Therefore, the MTL
specific 𝑞 . When 𝑦 is greater than 𝑦̂ , the penalty term is 𝑞 ; conversely, structure is utilized to reduce the computational costs.
when 𝑦̂ is greater than 𝑦, the penalty term is 𝑞 − 1. More specifically,
the pinball loss function is illustrated in Fig. 7. It is worth noting that 2.4. Proposed model
in this paper, we have chosen naive rearrangement because it exhibits
a low redundancy rate and favorable asymptotic properties [38]. The proposed model is shown in Fig. 8 and the Pseudo codes of train-
ing and deployment of the proposed model are offered in Algorithm 1
2.3.2. Multi-task learning and Algorithm 2. The input data consists of two components: histor-
A disadvantage of quantile regression is that a specific model needs ical load data (𝑙1 , 𝑙2 , ..., 𝑙𝑡 ) and historical weather data (𝑤1 , 𝑤2 , ..., 𝑤𝑡 ).
to be set up and trained for each quantile of the predictive distribu- Firstly, the proposed model employs the attention layer to extract the
tion to be issued [39]. For a quantile set 𝑄 = {𝑞1 , 𝑞2 , ..., 𝑞𝑠 }, to make correlations between different load data at each time step. Then, the
forecasts of the whole quantile set, 𝑠 models need to be trained, which load and weather data at each time step will be concatenated as one
may lead to a large number of models for building the whole predictive of the inputs of QTCN. After that, QTCN is employed to capture the
distribution, thus raising computational costs. temporal dependencies in time-serial data and the data and model un-

5
H. Guo, B. Huang and J. Wang Advances in Applied Energy 14 (2024) 100165

certainties to achieve probabilistic load forecasting. Finally, selecting 2.5.3. Overall score
the final output ℎ𝑡 of the QTCN, the proposed model extends the single- The overall score is also an important criterion used to assess the
quantile regression to multi-quantile regression by introducing the MTL quality of PIs. Refer to Appendix B.3 for the detailed calculation of the
structure to accelerate the training process. The output dimension of overall score.
the proposed model is 𝑛 × 3, meaning 𝑛 quantiles need to be forecast for
each load. Regarding hyperparameter tuning, we defined a class named 3. Case study
Options where all hyperparameters are specified, and then conducted
hyperparameter grid search to tune these parameters. As for the model 3.1. Data description and analysis
training process, gradient descent was employed to determine the opti-
mal parameters for the model. 3.1.1. Data description
The multi-energy load data utilized in this paper is sourced from the
Algorithm 1 Training of the multi-task learning based attentive QTCN. Arizona State University (ASU) metabolic system. This interactive on-
Input: Multi-energy load and weather training dataset 𝑋 , number of quantiles
line tool provides users access to real-time resource usage information
to be regressed 𝑄, epoch 𝑀 , batch size 𝐵 , number of elements in data across all ASU campuses. The system encompasses a wide range of load
loader 𝑁 . data, including residential load, academic building load, administrative
Output: Parameters of the multi-task learning based attentive QTCN. building load, and more. Over several years of development, this sys-
1: Initialization: all parameters 𝜽 of the proposed model. tem has accumulated extensive energy consumption data for the entire
2: for 𝑖 = 1 to 𝑀 do ASU campuses, encompassing electrical, cold, and heat loads. In this
3: for 𝑛 = 1 to 𝑁 do system, these three types of loads are measured using different units,
4: for 𝑏 = 1 to 𝐵 do namely kW, Ton/h, and mmBtu/h. For data analysis convenience, this
5: Computing the final hidden output of Attentive TCN ℎ𝑡 .
paper adopts kW as the standardized unit for all three types of loads.
6: for 𝑞 = 1 to 𝑄 do
The relationship between these three units is as follows.
7: Computing one output of MTL structure 𝑞̂.
8: Computing the pinball loss function from (10).
1 kW = 0.284 Ton∕h = 0.0034 mmBtu∕h (7)
9: end for
10: end for Weather data is sourced from the website of the National Oceanic
11: Taking the average value of 𝑄 pinball loss values. and Atmospheric Administration (NOAA), which is a scientific agency
12: Computing the gradient of the loss with respect to the parameters of operating within the U.S. Department of Commerce. This website pro-
the proposed model.
vides access to many meteorological data, including wind speed, tem-
13: Updating the parameters 𝜽 of the proposed model.
perature, humidity, and more.
14: end for
15: end for
The historical data spans from January 1, 2022, to July 21, 2023,
with a resolution of one hour. 80% of the data is used to train the pro-
posed model, while the remaining 20% is used to evaluate the model’s
performance. The look-ahead time is one hour.
Algorithm 2 Deployment of the multi-task learning based attentive
QTCN. 3.1.2. Data analysis
Input: Multi-energy load and weather testing dataset 𝑋 , number of quantiles As mentioned earlier, extracting the correlations between the data
to be regressed 𝑄, trained model. is crucial. Therefore, this paper employs the Pearson Correlation Coeffi-
Output: Prediction intervals of testing dataset. cient (PCC), which is an essential criterion for assessing the correlation
1: Computing the final hidden output of Attentive TCN ℎ𝑡 of 𝑋 . between different variables, to quantify the correlations between differ-
2: Computing all outputs of MTL structure 𝑞̂1 , ..., 𝑞̂𝑄 . ent types of loads and meteorological data. The formula for calculating
3: Generating prediction intervals of 𝑋 based on 𝑞̂1 , ..., 𝑞̂𝑄 . PCC is as follows.
∑𝑛
𝑖=1 (𝑥𝑖 − 𝑥)(𝑦𝑖 − 𝑦)
𝑃 𝐶𝐶(𝑋, 𝑌 ) = √ (8)
∑𝑛 ∑
2 𝑛 (𝑦 − 𝑦)2
𝑖=1 (𝑥𝑖 − 𝑥) 𝑖=1 𝑖
2.5. Evaluation criteria
where 𝑋 and 𝑌 are two sequences, 𝑥 and 𝑦 are the mean values of two
Choosing appropriate metrics is crucial for assessing the perfor- sequences, 𝑥𝑖 and 𝑦𝑖 represent the 𝑖-th elements of two sequences.
mance of the proposed model. Reliability, Sharpness, and Overall Score When the samples satisfy the Pearson correlation condition, the PC-
are utilized as evaluation metrics in this paper. C’s significance check (Sc) can be performed. For a random variable
with 𝑛 samples, a statistic 𝑡 can be constructed according to the formula
2.5.1. Reliability below.

Reliability is the primary metric for evaluating the quality of prob-
𝑛−2
abilistic forecasting models. Reliability implies that, given a specified 𝑡 = 𝑃 𝐶𝐶(𝑋, 𝑌 ) ⋅ (9)
prediction interval nominal confidence 100(1 − 𝛼)% (PINC), the PIs cov-
1 − (𝑃 𝐶𝐶(𝑋, 𝑌 ))2
erage probability (PICP) should be as close to the PINC as possible. The This statistic is consistent with a t-distribution with degrees of freedom
difference between PICP and PINC, which is termed as average cover- 𝑛 − 2, so it is possible to use the t-distribution to test the correlation and
age error (ACE), can be used to assess the quality of prediction intervals obtain the p-value of the significance test. The degree of significance is
(PIs). Refer to Appendix B.1 for the detailed calculation of PINC, PICP, divided into 3 levels: when p-value<0.01, the correlation is very signifi-
and ACE. cant. When 0.01<p-value<0.05, the correlation is relatively significant;
otherwise, the correlation is not significant.
2.5.2. Sharpness The results of the correlation analysis for load and weather vari-
Sharpness refers to the extent to which the predicted distribution ables are shown in Fig. 9. Plots in the diagonal line demonstrate the
closely matches the actual one. Sharpness can be measured by the av- distribution of the variables themselves, the PCC and Sc between the
erage width (AW) of the PIs. Refer to Appendix B.2 for the detailed variables are shown in the upper triangle, and the regression relation-
calculation of AW. ship between the variables is shown in the lower triangle. As for the

6
H. Guo, B. Huang and J. Wang Advances in Applied Energy 14 (2024) 100165

Fig. 9. Correlation analysis between load and weather variables.

three types of load, it can be seen that there is a positive correlation be- QTCN, quantile regression transformer (QTransformer), and BNN. In
tween electrical load and cold load, while both electrical load and cold addition to comparing these five models, this paper also applies two
load exhibit an explicit negative correlation with heat load. In addition, modified versions of the proposed model for multi-quantile forecast-
it can be observed that there is a positive correlation between electrical ing of a single type of load and single-quantile forecasting for multiple
load and cold load with all three temperatures and wind speed, while types of loads to validate the superiority of the proposed model. These
there is a negative correlation with relative humidity and station pres- two modified models are denoted as PM_SL_MQ and PM_ML_SQ, respec-
sure, whereas heat load exhibits the opposite pattern. tively. The first denotation means the single-load and multi-quantile
forecasting based on the proposed model, and the second means the
3.2. Benchmarks multi-load and single-quantile forecasting based on the proposed model.
QRF is a non-parametric machine learning method that is well-
To assess the performance of the proposed model, this paper con- suited for modeling high-dimensional data and nonlinear relationships.
ducts comparative experiments with seven other probabilistic forecast- It combines original RF and quantile regression, enabling it to estimate
ing models, including quantile regression random forest (QRF), QLSTM, conditional quantiles rather than conditional mean values. QRF is a

7
H. Guo, B. Huang and J. Wang Advances in Applied Energy 14 (2024) 100165

Table 2 3.3. Experimental results and analysis


Parameters of Proposed Model And All Benchmarks.

Forecasting Model Model Parameter The results of the proposed model are compared with the bench-
marks in this section. Fig. 10 displays the PIs results of the proposed
QRF N-estimators: 100/100/200
model for electrical, cold, and heat loads, with each load having three
QLSTM Number of residual block: 1
forecasting intervals for three different PINCs, namely 80%, 90%, and
Number of fully-connected layer: 3
Size of LSTM hidden: 10
95%.
From Fig. 10(a), it can be observed that the electrical load curve
QTCN Number of residual block: 1
Number of fully-connected layer: 3
exhibits numerous minor fluctuations, with relatively indistinct peaks
Size of TCN hidden: 10 and valleys and a certain degree of uncertainty. During the fluctua-
TCN channel: 25 tions from peak to valley, the electrical load curve deviates from the
QTransformer Number of attention head: 8 intervals covered by 80% and 90% PINC. However, the majority of the
Number of encoder layer: 12 load curve remains within the range of 95% PINC. Furthermore, the
Number of decoder layer: 6 upper bounds of the three different confidence intervals for the electri-
Size of model: 8
cal load curve have significant margins. In contrast, the lower bounds
BNN Number of fully-connected layer: 2 have smaller margins. Compared to the electrical load curve, the cold
Number of first hidden neurons: 20 load curve appears more regular, displaying distinct peaks and valleys
Number of second hidden neurons: 40
and lower uncertainty, which can be found in Fig. 10(b). The cold load
Proposed Model Number of residual block: 1 curve tends to deviate from the 80% and 90% PINC during the transi-
Number of attention layer: 1
tion from valley to peak. However, like the electrical load, most cold
Number of attention head: 4
Number of fully-connected layer: 3 load curves are enveloped by the 95% PINC. Additionally, unlike the
Size of attention embedding: 100 electrical load curve, the lower bounds of the three confidence intervals
Size of TCN hidden: 10 for cold load curves have more significant margins. Fig. 10(c) presents
TCN channel: 25
the heat load results within the three PINCs. The heat load curve ex-
hibits fluctuations but generally distinguishes peaks and valleys. Unlike
the cold load curve, the heat load curve tends to deviate from the 80%
and 90% PINC during the transition from peak to valley. Similar to the
Table 3 electrical load curve, the peak of the heat load curve has a significant
Descriptions Of Proposed Model And All Benchmarks.
margin from the upper bounds of the three confidence intervals. Simi-
Forecasting Model
Model Description larly, the heat load curve rarely exceeds the range of the 95% PINC.
Single-Energy Multi-Energy Single- Multi-Quantile
Quantile 3.3.1. Results analysis of evaluation criteria and economic benefits
(𝛼)
QRF   Table 4 and Table 5 present the ACE, AW, and Overall Score 𝑆 𝑡 re-
QLSTM   sults of seven different models at three different PINC. Table 4 presents
QTCN  
a comparison of experiments between different models, while Table 5
QTransformer  
BNN 
conducts ablation experiments. It can be observed that at an 80% PINC,
PM_SL_MQ   the results obtained by the proposed model are generally better than
(𝛼)
PM_ML_SL   most of the results, except for the 𝑆 𝑡 results of the electrical load
Proposed Model   (𝛼)
obtained by the BNN, and the AW and 𝑆 𝑡 results of the cold load
obtained by the PM_ML_SQ model. Regarding the ACE results, the pro-
posed model achieves slightly better results for electrical and heat load
representative regression model in traditional machine learning algo- than other comparative models. In comparison, the ACE results for the
rithms, so it is chosen as a benchmark in this paper. QLSTM is adopted cold load are significantly better than other models, with a difference
(𝛼)
by introducing the pinball loss function, transforming the point fore- of up to 2.25% compared to QLSTM. In terms of AW and 𝑆 𝑡 results,
casting model LSTM to a conditional quantile regression model, thus the proposed model excels for the heat load. The results on the cold
enabling probabilistic forecasting. As QLSTM is regarded as a classi- load are slightly worse than those of the PM_ML_SQ model, with differ-
(𝛼)
cal model for time-serial forecasting and analysis, it is used as one ences of 0.0040 and 0.0093, respectively. In addition, the 𝑆 𝑡 result on
of the benchmarks. Like QLSTM, QTCN achieves quantile regression the electrical load is slightly worse than that of BNN. Based on the 80%
by introducing the pinball loss function. To contrast the advantages PINC results, it can be concluded that the proposed model generally
of introducing the attention mechanism in the forecasting results, this outperforms other comparative models.
paper employs QTCN as a benchmark. Since the transformer structure Similar to the results at the 80% PINC, when the PINC is set at 90%,
is currently a widespread neural network model widely used in time- the proposed model outperforms the comparative models in terms of
serial analysis, it is considered a benchmark. As BNNs are inherently ACE and AW. For ACE results, the absolute difference between the pro-
classical probability models, they can be directly utilized for proba- posed model’s result and QTransformer’s result on the electrical load
bilistic forecasting [40]. Consequently, BNN also serves as a benchmark is 1.41%, while the absolute difference between the results of the pro-
in this regard. What’s more, to demonstrate that forecasting multiple posed model and the PM_ML_SQ model on the cold load and heat load is
energy simultaneously yields better results than forecasting them indi- 0.80%, and 1.25%, respectively. Regarding AW results, the differences
vidually, the PM_SL_MQ model will also serve as a benchmark. Finally, between the three loads and the second-best results are 0.0162, 0.0084,
(𝛼)
the PM_ML_SQ is used as a benchmark to demonstrate the efficiency of and 0.0175, respectively. For 𝑆 𝑡 results, the proposed model performs
the MTL structure. The parameters and descriptions of all models are better than other models for electrical load and heat load but slightly
shown in Table 2 and Table 3, respectively. It is worth noting that since worse than the results obtained by the PM_SL_MQ and PM_ML_SQ on
BNNs are probability models themselves, there is no need for quantile cold load, with a difference of 0.0013 and 0.0010, respectively. There-
regression to determine the PIs. Therefore, in Table 3, the sections re- fore, it can be concluded that at 90% PINC, the proposed model is
lated to the quantile are not annotated. generally superior to the comparative models.

8
H. Guo, B. Huang and J. Wang Advances in Applied Energy 14 (2024) 100165

Fig. 10. The results of the proposed model for the three loads under three different PINCs: (a) Electrical Load, (b) Cold Load, (c) Heat Load.

Table 4
The Evaluation Criteria Results For Different Models.
(𝛼)
PINC Methods ACE AW 𝑆𝑡
Electricity Cold Heat Electricity Cold Heat Electricity Cold Heat

QLSTM −1.65% −2.61% −3.28% 0.2547 0.2014 0.1283 −0.1352 −0.1022 −0.0637
QRF −4.02% 3.62% 5.25% 0.3005 0.4022 0.1081 −0.1465 −0.1805 −0.0534
80% QTransformer −7.43% 0.65% −3.20% 0.1879 0.1294 0.0840 −0.1063 −0.0645 −0.0508
BNN 10.19% 11.49% 16.68% 0.1896 0.1542 0.1311 −0.0703 −0.0628 −0.0556
Proposed Model 0.80% 0.36% 1.32% 0.1835 0.1072 0.0653 −0.0959 −0.0543 −0.0358
QLSTM 5.26% 4.89% −3.34% 0.3089 0.2723 0.2822 −0.0668 −0.0591 −0.0626
QRF −4.46% 3.62% −2.82% 0.3738 0.4611 0.1207 −0.0883 −0.0972 −0.0322
90% QTransformer 2.96% −3.34% −4.31% 0.2256 0.1589 0.1054 −0.0520 −0.0393 −0.0297
BNN 7.64% 9.63% 8.97% 0.2342 0.1973 0.1686 −0.0527 −0.0396 −0.0350
Proposed Model 1.55% −1.27% −0.60% 0.2094 0.1417 0.0879 −0.0516 −0.0339 −0.0225
QLSTM 4.56% −4.34% −1.52% 0.3571 0.2755 0.3607 −0.0366 −0.0348 −0.0384
QRF −6.27% 3.07% −3.23% 0.4160 0.4909 0.1413 −0.0524 −0.0497 −0.0194
95% QTransformer 1.37% 3.00% −8.34% 0.2673 0.1977 0.1217 −0.0307 −0.0215 −0.0185
BNN 4.34% 4.63% 4.34% 0.2572 0.2361 0.2028 −0.0275 −0.0236 −0.0211
Proposed Model 0.63% 2.78% −2.86% 0.2309 0.1955 0.1158 −0.0272 −0.0206 −0.0149

9
H. Guo, B. Huang and J. Wang Advances in Applied Energy 14 (2024) 100165

Table 5
Ablation Experiment Results.
(𝛼)
PINC Methods ACE AW 𝑆𝑡
Electricity Cold Heat Electricity Cold Heat Electricity Cold Heat

QTCN 6.51% -4.09% −2.02% 0.3323 0.2048 0.0869 −0.1488 −0.1041 −0.0462
PM_SL_MQ 5.25% -4.31% 1.54% 0.2327 0.1132 0.1389 −0.1008 −0.0573 −0.0628
80% PM_ML_SQ −2.23% -3.28% 1.69% 0.2255 0.0979 0.0825 −0.1093 −0.0503 −0.0428
Proposed Model 0.80% 0.36% 1.32% 0.1835 0.1072 0.0653 −0.0959 −0.0543 −0.0358
QTCN 5.03% 4.66% −2.60% 0.3706 0.2712 0.2135 −0.0782 −0.0578 −0.0480
PM_SL_MQ 6.81% 8.07% −3.05% 0.2527 0.1607 0.2024 −0.0538 −0.0326 −0.0453
90% PM_ML_SQ 6.89% 2.07% 1.85% 0.2696 0.1501 0.1167 −0.0573 −0.0329 −0.0283
Proposed Model 1.55% −1.27% −0.60% 0.2094 0.1417 0.0879 −0.0516 −0.0339 −0.0225
QTCN 4.41% 1.81% −1.38% 0.4121 0.3137 0.2218 −0.0420 −0.0331 −0.0244
PM_SL_MQ 3.44% 4.70% −3.01% 0.2705 0.3050 0.2452 −0.0290 −0.0306 −0.0276
95% PM_ML_SQ 4.48% 3.81% 1.89% 0.3108 0.2161 0.1483 −0.0315 −0.0218 −0.0172
Proposed Model 0.63% 2.78% −2.86% 0.2309 0.1955 0.1158 −0.0272 −0.0206 −0.0149

For the PINC of 95%, the proposed model outperforms other com-
(𝛼) (𝛼)
parative models in terms of 𝑆 𝑡 and AW results. Regarding 𝑆 𝑡 results,
the difference between the proposed model and the second-best result
is 0.0003, 0.0009, and 0.0023, respectively. The difference between
the proposed model and the second-best AW results is 0.0263, 0.0022,
and 0.0059, respectively. Regarding ACE results, the proposed model
achieves the best results for electrical load. At the same time, it slightly
lags behind the results obtained by QTCN for cold load and QLSTM,
QTCN, and PM_ML_SQ for heat load. Overall, at the PINC of 95%, the
proposed model exhibits the best results compared to other models.
For the economic benefits brought about by the proposed model,
[3] mentions that for a utility with a peak load of 1GW, improving the
short-term forecasting accuracy by 1.0% can result in annual savings
of $300,000. Based on experimental results, the proposed algorithm’s
accuracy is significantly higher than other benchmarks. When PINC is
80%, the average accuracy of the proposed method is 1.69% higher than
the second most accurate algorithm, theoretically saving $507,000 an-
nually. When PINC is 90%, the proposed algorithm’s average accuracy
is 2.40% higher than the second-best algorithm, resulting in annual sav-
ings of $720,000. When PINC is 95%, the proposed algorithm’s accuracy
is 1.21% higher than the second-best algorithm, saving $363,000 annu-
ally.
In summary, the proposed model outperforms comparative models
in three different PINCs, particularly excelling in the results for elec- Fig. 11. Attention scores of the proposed model at a certain time step: (a) Head
trical load. From the above results, two conclusions can be drawn: 1) 1, (b) Head 2, (c) Head 3, (d) Head 4.
The proposed model performs better in joint forecasts than traditional
machine learning models represented by QRF, time-serial models rep- emphasis. From Fig. 11, it can be observed that, at this specific moment,
resented by QLSTM, QTCN, and QTransformer, and probability models all four attention heads primarily emphasize the 𝐾 values associated
represented by BNN. 2) Introducing the attention mechanism in joint with the cold load, followed by the electrical, and finally, the heat loads.
forecasting, instead of independently forecasting the three loads, allows
for extracting correlations between the loads, leading to improved fore- 3.3.3. Comparison of convergence speed
casting accuracy. 3) Incorporating the MTL structure can also improve In addition to the three probabilistic evaluation metrics mentioned
the results. above, the convergence speed of the models is also an important met-
ric for evaluating the models. Fig. 12 shows the loss curves for training
3.3.2. Attention score results and testing QLSTM, QTCN, and the proposed model. It can be observed
Fig. 11 displays the attention scores of the proposed model for three that QLSTM converges slower compared to QTCN and the proposed
types of loads at a specific time step. There are four separate plots, each model. Therefore, regarding convergence speed, choosing QTCN as the
representing one attention head. The size of each plot is 3 × 3, indicating time-serial model for the multi-energy forecasting task is better than
the presence of three different types of loads: electrical, cold, and heat. QLSTM. Furthermore, the convergence speed of QTCN is almost consis-
In each plot, the rows correspond to the 𝑄 values for the three load tent with that of the proposed model. Even though the proposed model
types, and the columns correspond to the 𝐾 values for the same three introduces the attention layer, the convergence speed remains relatively
load types. Therefore, the first element in the first row represents the high. Therefore, using the attention layer to extract the correlation be-
attention score obtained by multiplying the electrical load’s 𝑄 value tween loads is reasonable in terms of convergence speed. Moreover,
with the electrical load’s 𝐾 value. Similarly, the second element in the while the QTransformer converges faster than the proposed model, it is
first row represents multiplying the electrical load’s 𝑄 value with the worth noting that as the number of epochs increases, at epoch 150, the
cold load’s 𝐾 value. These scores indicate the degree of association proposed model achieves a lower loss than the QTransformer. There-
between each 𝑄 and 𝐾 . Higher scores imply that the 𝑄 focuses more on fore, from a convergence perspective, the results of the proposed model
the information corresponding to the 𝐾 , while lower scores indicate less are also better than other benchmarks.

10
H. Guo, B. Huang and J. Wang Advances in Applied Energy 14 (2024) 100165

Fig. 12. Training and testing loss for the proposed, QLSTM, QTCN and QTransformer models.

3.3.4. Comparison of training time Table 6


To illustrate the effectiveness of the MTL structure in accelerating Training And Testing Time Of The PM_ML_SQ And
the training process, Table 6 compares model training time between Proposed Model.
single-task and multi-task models. In this table, the MTL model simulta- Method Training Time Testing Time
neously outputs six quantiles. At the same time, the PM_ML_SQ model
PM_ML_SQ 10.5496 min 0.1773 s
requires training six separate models, each corresponding to a quan- Proposed Model 3.0467 min 0.1305 s
tile. When the model introduces multi-task learning, it can be seen that
the training time reduces from 10 minutes to 3 minutes because the
MTL proposed model incorporates a hard-sharing mechanism, mean-
ing that shared parts of the network only need to be trained once. In
rithm by 0.0009, 0.0043, and 0.0027, respectively. Furthermore, the
contrast, when a model is trained in PM_ML_SQ, the entire neural net-
estimated economic benefits that the proposed algorithm can bring
work must be retrained. Therefore, using MTL results in shorter training
are higher than the second-best algorithm by $50,700, $72,000, and
times. Combined with the results in Table 5, it is evident that by in-
$36,300, respectively.
troducing MTL, the forecasting results improve, and the training time
Second, the experimental results indicate that the attention mech-
for the proposed model is significantly reduced. Therefore, introducing
anism can effectively extract correlations between different loads,
MTL in this forecasting task can enhance the model’s performance. It is
thereby improving the accuracy of predictions. After introducing the
worth noting that the proposed model outperforms PM_ML_SQ for sev-
eral reasons: 1) Compared to STL, MTL structure helps the model to attention mechanism, the average prediction accuracy increases by
better capture underlying features and knowledge, thereby improving 3.38%, 2.96%, and 0.44% under the three PINCs, respectively. In addi-
its generalization ability to new data. By sharing underlying represen- tion, QTCN exhibits faster convergence than QLSTM, and combining the
tations, the model can adapt more effectively to different related tasks attention mechanism with TCN does not significantly degrade QTCN’s
[41]. 2) MTL has the potential to decrease the likelihood of overfitting. convergence speed.
Viewing the sharing of underlying representations as a regularization Finally, the MTL structure can enhance model training speed while
mechanism aids in mitigating the model’s tendency to learn specific maintaining results accuracy. After introducing the MTL structure, the
noise or outlier patterns on a particular task, resulting in improved pre- training time of the model is reduced by threefold on the given dataset.
diction accuracy with a lower rate of redundancy [42]. 3) In situations All the results indicate that the proposed model has great potential for
where data is limited, MTL proves advantageous in optimizing the ef- practical MEPF tasks in IES.
fective utilization of data by exchanging information across multiple The experimental results indicate that the proposed model can ef-
tasks. The network can capitalize on knowledge gained from one task fectively enhance prediction accuracy, leading to improved economic
to enhance the predictive performance of the model with a lower rate benefits. Therefore, the proposed model holds certain application value
of redundancy [43]. in the prediction tasks of IES.

4. Conclusion
CRediT authorship contribution statement

This paper proposes a MTL-based attentive QTCN for the multi-


energy probabilistic load forecasting. The proposed model is verified Han Guo: Writing – review & editing, Writing – original draft, Vi-
based on the electrical, cold, and heat loads from the ASU metabolic sualization, Validation, Software, Methodology, Investigation, Formal
system. First, compared to other benchmarks, the proposed model analysis, Data curation. Bin Huang: Writing – review & editing, Writ-
performs better on the three evaluation criteria for PIs. The pro- ing – original draft, Visualization, Validation, Supervision, Software,
posed method demonstrates superior average accuracy compared to the Resources, Methodology, Investigation, Formal analysis, Data curation,
second-best algorithm by 1.69%, 2.40%, and 1.21%, with average inter- Conceptualization. Jianhui Wang: Writing – review & editing, Writing
val widths outperforming the second-best algorithm by 0.0151, 0.017, – original draft, Supervision, Resources, Project administration, Fund-
and 0.0149, and average overall scores surpassing the second-best algo- ing acquisition, Conceptualization.

11
H. Guo, B. Huang and J. Wang Advances in Applied Energy 14 (2024) 100165

Declaration of competing interest More specifically, given a nominal proportion 𝛼 , for example, 0.05 is
also referred to as the significance level in statistics. Then, we can drive
The authors declare that they have no known competing financial PINC using (B.1). According to this definition, we know that PINC is also
interests or personal relationships that could have appeared to influence called the confidence level in statistics. Assuming there are 1000 sam-
the work reported in this paper. ples and given a nominal proportion 𝛼 , for example, 0.10, PINC equals
90%. In this regard, PINC refers to the probability that the actual value
Data availability falls within the prediction interval. So, in this example, theoretically,
there would be 900 samples that fall within their respective prediction
Data will be made available on request. intervals. PICP refers to the probability that 1000 samples fall within
their respective prediction intervals. If 800 samples fall within their re-
Appendix A. Attention mechanism spective prediction intervals, the PICP equals 0.8. Finally, ACE can be
calculated using (B.3). According to the definitions of PINC and PICP,
A.1. Scaled dot-product attention the smaller the absolute value of ACE, the higher the forecasting accu-
racy.
The scaled dot-product attention function is a computational process
B.2. Sharpness
that accepts a query and a set of key-value pairs, producing an output.
In this procedure, the query, keys, values, and output are all depicted
Sharpness can be measured by the AW of the PIs and can be repre-
as matrices, each with specific definitions outlined below.
sented as.
⎧ 𝑄=𝑋∗𝑊 , 1 ∑ ̂ (𝛼) ̂ (𝛼)
⎪ 𝑞 𝐴𝑊 = (𝑈 − 𝐿𝑡 ) (B.4)
⎨ 𝐾 = 𝑋 ∗ 𝑊𝑘 , (A.1) |𝑁| 𝑡∈𝑁 𝑡
⎪ 𝑉 = 𝑋 ∗ 𝑊𝑣 .
⎩ where 𝑈̂ 𝑡 is the upper bound of the PI for the test sample 𝑡, and 𝐿
(𝛼) ̂ is (𝛼)
𝑡
𝑄𝐾 𝑇 the lower bound of the PI given a nominal proportion 𝛼 . A lower value
𝑂 = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑄, 𝐾, 𝑉 ) = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥( √ )𝑉 (A.2)
𝑑𝑘 of AW is preferable when the ACE value is promising.

where 𝑊𝑞 ∈ ℝ𝐷×𝑑𝑘 , 𝑊𝑘 ∈ ℝ𝐷×𝑑𝑘 , 𝑊𝑣 ∈ ℝ𝐷×𝑑𝑣 are the linear transfor- B.3. Overall score
mation matrices of the input sequence 𝑋 ∈ ℝ𝑇 ×𝐷 , and 𝑄 ∈ ℝ𝑇 ×𝑑𝑘 , 𝐾 ∈
ℝ𝑇 ×𝑑𝑘 , 𝑉 ∈ ℝ𝑇 ×𝑑𝑣 , 𝑂 ∈ ℝ𝑇 ×𝑑𝑣 are the matrices of the queries, keys, val- The overall score for PIs can be calculated as follows.
ues, and outputs, respectively.
The dimension of the input sequence 𝑋 in this paper is 1 × 3, mean- 𝑊𝑡(𝛼) = 𝑈̂ 𝑡(𝛼) − 𝐿̂ (𝛼)
𝑡 (B.5)
ing 3 different loads need to be forecast. The query matrix 𝑄 serves the ⎧ −2𝛼𝑊 (𝛼) − 4(𝐿̂ (𝛼) − 𝑦 ), 𝑦 < 𝐿̂ (𝛼)
purpose of evaluating the relationships between the current position ⎪ 𝑡 𝑡 𝑖 𝑖 𝑡
and other positions, while the key matrix 𝐾 is employed to gauge the 𝑆𝑡(𝛼)= ⎨ −2𝛼𝑊𝑡(𝛼) , 𝑦𝑖 ∈ 𝐼̂𝑡(𝛼) (B.6)
⎪ −2𝛼𝑊 (𝛼) − 4(𝑦 − 𝑈̂ (𝛼) ), 𝑦 > 𝑈̂ (𝛼)
significance of other positions concerning the current one. The value ⎩ 𝑡 𝑖 𝑡 𝑖 𝑡
matrix 𝑉 is crucial in generating the final weighted representation. (𝛼) 1 ∑ (𝛼)
𝑆𝑡 = 𝑆 (B.7)
|𝑁| 𝑡∈𝑁 𝑡
A.2. Multi-head attention
(𝛼) (𝛼)
where 𝑊𝑡 and 𝑆𝑡 are the PI’s width and overall score for test sample
The output of the multi-head attention is defined below. (𝛼)
𝑡, respectively, and 𝑆 𝑡 is the average overall score for the test dataset
𝑁 . More specifically, when the actual values fall within the PIs, the
𝑀𝑢𝑙𝑡𝑖ℎ𝑒𝑎𝑑(𝑄, 𝐾, 𝑉 ) = 𝐶𝑜𝑛𝑐𝑎𝑡(ℎ𝑒𝑎𝑑1 , ..., ℎ𝑒𝑎𝑑ℎ )𝑊 𝑂 (A.3) (𝛼)
absolute value of 𝑆𝑡 is minimized, and there is a penalty term for
ℎ𝑒𝑎𝑑𝑖 = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛𝑖 (𝑄𝑖 , 𝐾𝑖 , 𝑉𝑖 ) (A.4) (𝛼)
𝑆𝑡 when the actual values fall outside the PIs. Therefore, according
(𝛼)
where ℎ is the number of heads, 𝑊𝑂 ∈ ℝℎ𝑑𝑣 ×𝑑𝑜𝑢𝑡
is the parameter ma- to the definition, the smaller the absolute value of 𝑆𝑡 , the better the
trix that projects the concatenation of ℎ-head outputs to the size of the forecasting performance.
multi-head attention model. Take an example. Given a 𝛼 equal to 0.10, and the PI equal to
[0.2, 0.6] for a certain test sample. If the true value is 0.5 within the
Appendix B. Detailed calculation of evaluation criteria PI, thus 𝑆 can be calculated using (B.6), which is equal to −2 ∗ 0.1 ∗
(0.6 − 0.2). If the true value is 0.7 without the PI, 𝑆𝑡(𝛼) will be equal
B.1. Reliability to −2 ∗ 0.1 ∗ (0.6 − 0.2) − 4 ∗ (0.7 − 0.6), and the −4 ∗ (0.7 − 0.6) is the
penalty term. Suppose there are 1000 test samples, each with its PI
(𝛼)
given a specific 𝛼). We can calculate the corresponding 𝑆𝑡 for each PI.
The formula of PINC and PICP can be expressed as follows.
(𝛼)
By taking the average of these 𝑆𝑡 values, we can obtain the average
𝑃 𝐼𝑁𝐶 = 100(1 − 𝛼)%, 𝛼 ∈ (0, 1) (B.1) value of
(𝛼)
𝑆𝑡 .
1 ∑
𝑃 𝐼𝐶𝑃 = 1{𝑦𝑖 ∈ 𝐼̂𝑡(𝛼) } (B.2)
|𝑁| 𝑡∈𝑁 References
where 𝑁 represents the size of the test dataset, |𝑁| is the cardinality [1] Guo M, Mu Y, Jia H, Deng Y, Xu X, Yu X. Electric/thermal hybrid energy storage
of the test dataset 𝑁 , 1(⋅) is the indicator function, 𝛼 is the nominal planning for park-level integrated energy systems with second-life battery utiliza-
proportion and 𝐼̂𝑡 is the prediction interval (PI) for test sample 𝑡 given
(𝛼)
tion. Adv Appl Energy 2021;4:100064.
a certain PINC 100(1 − 𝛼)%. [2] Manuel SD, Floris T, Kira W, Jos S, André F. High technical and temporal resolution
integrated energy system modelling of industrial decarbonisation. Adv Appl Energy
The difference between PICP and PINC, which is termed as average
2022;7:100105.
coverage error (ACE), can be used to assess the quality of PIs. [3] Zhang T, Li Z, Wu Q, Pan S, Wu Q. Dynamic energy flow analysis of integrated
gas and electricity systems using the holomorphic embedding method. Appl Energy
𝐴𝐶𝐸 = 𝑃 𝐼𝐶𝑃 − (1 − 𝛼) (B.3) 2022;309:118345.

12
H. Guo, B. Huang and J. Wang Advances in Applied Energy 14 (2024) 100165

[4] Hong T. Crystal ball lessons in predictive analytics. EnergyBiz Mag 2015;12(2):35–7. [24] Liu X, Yang L, Zhang Z. The attention-assisted ordinary differential equa-
[5] Chen K, Chen K, Wang Q, He Z, Hu J, He J. Short-term load forecasting with deep tion networks for short-term probabilistic wind power predictions. Appl Energy
residual networks. IEEE Trans Smart Grid 2018;10(4):3943–52. 2022;324:119794.
[6] Dotzauer E. Simple model for prediction of loads in district-heating systems. Appl [25] Zheng Z, Zhang Z. A stochastic recurrent encoder decoder network for multistep
Energy 2002;73(3–4):277–84. probabilistic wind power predictions. IEEE Trans Neural Netw Learn Syst 2023.
[7] Amabile L, Bresch-Pietri D, El Hajje G, Labbé S, Petit N. Optimizing the self- [26] Sun M, Zhang T, Wang Y, Strbac G, Kang C. Using Bayesian deep learning to
consumption of residential photovoltaic energy and quantification of the impact capture uncertainty for residential net load forecasting. IEEE Trans Power Syst
of production forecast uncertainties. Adv Appl Energy 2021;2:100020. 2019;35(1):188–201.
[8] Ramanathan R, Engle R, Granger CW, Vahid-Araghi F, Brace C. Short-run forecasts [27] Zhou B, Meng Y, Huang W, Wang H, Deng L, Huang S, et al. Multi-energy net load
of electricity loads and peaks. Int J Forecast 1997;13(2):161–74. forecasting for integrated local energy systems with heterogeneous prosumers. Int J
[9] Chen J-F, Wang W-M, Huang C-M. Analysis of an adaptive time-series autoregressive Electr Power Energy Syst 2021;126:106542.
moving-average (arma) model for short-term load forecasting. Electr Power Syst Res [28] Zhu R, Guo W, Gong X. Short-term load forecasting for cchp systems considering
1995;34(3):187–96. the correlation between heating, gas and electrical loads based on deep learning.
[10] Wang Z, Hong T, Li H, Piette MA. Predicting city-scale daily electricity consumption Energies 2019;12(17):3308.
using data-driven models. Adv Appl Energy 2021;2:100025. [29] Wang C, Wang Y, Ding Z, Zheng T, Hu J, Zhang K. A transformer-based method of
[11] Chen Y, Xu P, Chu Y, Li W, Wu Y, Ni L, et al. Short-term electrical load forecasting multienergy load forecasting in integrated energy system. IEEE Trans Smart Grid
using the support vector regression (svr) model to calculate the demand response 2022;13(4):2703–14.
baseline for office buildings. Appl Energy 2017;195:659–70. [30] Tan M, Liao C, Chen J, Cao Y, Wang R, Su Y. A multi-task learning method for
[12] Lindberg O, Lingfors D, Arnqvist J, van der Meer D, Munkhammar J. Day-ahead multi-energy load forecasting based on synthesis correlation analysis and load par-
probabilistic forecasting at a co-located wind and solar power park in Sweden: trad- ticipation factor. Appl Energy 2023;343:121177.
ing and forecast verification. Adv Appl Energy 2023;9:100120. [31] Wang C, Wang Y, Ding Z, Zhang K. Probabilistic multi-energy load forecasting for
[13] Huang B, Zhao T, Yue M, Wang J. Bi-level adaptive storage expansion strat- integrated energy system based on Bayesian transformer network. IEEE Trans Smart
egy for microgrids using deep reinforcement learning. IEEE Trans Smart Grid Grid 2024;15(2):1495–508.
2024;15(2):1362–75. [32] Zheng Z, Zhang Z. A temporal convolutional recurrent autoencoder based frame-
[14] Huang B, Wang J. Adaptive static equivalences for active distribution networks with work for compressing time series data. Appl Soft Comput 2023;147:110797.
massive renewable energy integration: a distributed deep reinforcement learning [33] Fei Z, Zhang Z, Yang F, Tsui K-L. A deep attention-assisted and memory-augmented
approach. IEEE Trans Netw Sci Eng 2023. temporal convolutional network based model for rapid lithium-ion battery remain-
[15] Huang B, Wang J. Applications of physics-informed neural networks in power ing useful life predictions with limited data. J Energy Storage 2023;62:106903.
systems-a review. IEEE Trans Power Syst 2022;38(1):572–88. [34] Ye Y, Wang H, Cui T, Yang X, Yang S, Zhang M-L. Identifying generalizable equilib-
[16] Paletta Q, Terrén-Serrano G, Nie Y, Li B, Bieker J, Zhang W, et al. Advances in solar rium pricing strategies for charging service providers in coupled power and trans-
forecasting: computer vision with deep learning. Adv Appl Energy 2023:100150. portation networks. Adv Appl Energy 2023;12:100151.
[17] Zhan C, Ghaderibaneh M, Sahu P, Gupta H. Deepmtl pro: deep learning based [35] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention
multiple transmitter localization and power estimation. Pervasive Mob Comput is all you need. Adv Neural Inf Process Syst 2017;30.
2022;82:101582. [36] Bai S, Kolter JZ, Koltun V. An empirical evaluation of generic convolutional and
[18] Zhan C, Ghaderibaneh M, Sahu P, Gupta H. Deepmtl: deep learning based mul- recurrent networks for sequence modeling. arXiv preprint, arXiv:1803.01271, 2018.
tiple transmitter localization. In: 2021 IEEE 22nd international symposium on [37] Koenker R, Hallock KF. Quantile regression. J Econ Perspect 2001;15(4):143–56.
a world of wireless, mobile and multimedia networks (WoWMoM). IEEE; 2021. [38] Wang Y, Zhang N, Tan Y, Hong T, Kirschen DS, Kang C. Combining probabilistic
p. 41–50. load forecasts. IEEE Trans Smart Grid 2018;10(4):3664–74.
[19] Chen Y, Zhang D. Theory-guided deep-learning for electrical load forecasting (tgdlf) [39] Zhang Y, Wang J, Wang X. Review on probabilistic forecasting of wind power gen-
via ensemble long short-term memory. Adv Appl Energy 2021;1:100004. eration. Renew Sustain Energy Rev 2014;32:255–70.
[20] Fu G. Deep belief network based ensemble approach for cooling load forecasting of [40] Shen Z, Wei K, Zang H, Li L, Wang G. The application of artificial intelligence
air-conditioning system. Energy 2018;148:269–82. to the Bayesian model algorithm for combining genome data. Acad J Sci Technol
[21] Gao J, Chen Y, Hu W, Zhang D. An adaptive deep-learning load forecasting 2023;8(3):132–5.
framework by integrating transformer and domain knowledge. Adv Appl Energy [41] Crawshaw M. Multi-task learning with deep neural networks: a survey. arXiv
2023;10:100142. preprint. arXiv:2009.09796, 2020.
[22] Wang Y, Gan D, Sun M, Zhang N, Lu Z, Kang C. Probabilistic individual load fore- [42] Zhang Y, Yang Q. A survey on multi-task learning. IEEE Trans Knowl Data Eng
casting using pinball loss guided LSTM. Appl Energy 2019;235:10–20. 2022;34(12):5586–609. https://doi.org/10.1109/TKDE.2021.3070203.
[23] Cao Z, Wan C, Zhang Z, Li F, Song Y. Hybrid ensemble deep learning for de- [43] Vandenhende S, Georgoulis S, Van Gansbeke W, Proesmans M, Dai D, Van Gool L.
terministic and probabilistic low-voltage load forecasting. IEEE Trans Power Syst Multi-task learning for dense prediction tasks: a survey. IEEE Trans Pattern Anal
2019;35(3):1881–97. Mach Intell 2021;44(7):3614–33.

13

You might also like