Evaluating The Effect of Volatile Federated Timeseries On Modern DNNs Attention Over Long-Short Memory

2023 12th International Conference on Modern Circuits and Systems Technologies (MOCAST)
Evaluating the Effect of Volatile Federated Timeseries on Modern DNNs:

Attention over Long/Short Memory
2023 12th International Conference on Modern Circuits and Systems Technologies (MOCAST) | 979-8-3503-2107-4/23/$31.00 ©2023 IEEE | DOI: 10.1109/MOCAST57943.2023.10176585
Ilias Siniosoglou∗ , Konstantinos Xouveroudis† , Vasileios Argyriou‡ , Thomas Lagkas§ ,

Sotirios K. Goudos¶ , Konstantinos E. Psannis∥ and Panagiotis Sarigiannidis∗
Abstract—In regards to the field of trend forecasting in time process non-Independent and Identically Distributed (non-IID)
series, many popular Deep Learning (DL) methods such as data [1]. This data is highly volatile and usually leads to lower
Long Short Term Memory (LSTM) models have been the gold model quality. This work investigates the effect of volatile
standard for a long time. However, depending on the domain
and application, it has been shown that a new approach can be federated data on benchmark Deep Neural Architectures in
implemented and possibly be more beneficial, the Transformer order to specify their efficacy in the federated environment on
deep neural networks. Moreover, one can incorporate Federated everyday tasks.
Learning (FL) in order to further enhance the prospective utility Under the broader category of Deep Learning, Transformers
of the models, enabling multiple data providers to jointly train [2] are a particular type of Artificial Intelligence (AI) model
on a common model, while maintaining the privacy of their
data. In this paper, we use an experimental Federated Learning that excels in Natural Language Processing (NLP) [3] tasks in
System that employs both Transformer and LSTM models on particular. These models are unrelated to other popular types,
a variety of datasets. The sytem receives data from multiple such as Recurrent Neural Networks (RNNs) [4].
clients and uses federation to create an optimized global model. Federated Learning is a technique designed to overcome
The potential of Federated Learning in real-time forecasting is
various challenges associated with traditional Machine Learn-
explored by comparing the federated approach with conventional
local training. Furthermore, a comparison is made between the ing methods, such as the privacy dangers that come from the
performance of the Transformer and its equivalent LSTM in need to collect data from multiple sources [5].
order to determine which one is more effective in each given This paper introduces a Centralized Federated Learning
domain, which shows that the Transformer model can produce System (CFLS) that employs state-of-the-art DL techniques
better results, especially when optimised by the FL process.
Index Terms—Federated Learning, Deep Learning, LSTM,
to address the challenges posed by each provided use case
Transformers, Forecasting, Federated Dataset and also helps in deploying and evaluating the aforementioned
DNN architectures. For our purposes, we focus on Transformer
I. I NTRODUCTION models and how they compare to the popular Long Short Term
The development process of reliable AI models in the realm Memory (LSTM) [6] approach. In summary, this work offers
of Machine Learning (ML) often requires the gathering of data the following contributions:L
from multiple sources. This strategy of using varied data can • Creates an experimentation system to evaluate the per-
enhance the model’s predictive capabilities, as it is trained formance of Federated data on benchmark DNN archi-
on a broader range of patterns. Nevertheless, this approach tectures
has several drawbacks and risks, the primary concern being • Compares the performance of two architectures (LSTM,
privacy. All datasets from each source must be uploaded to a Transformer) on volatile (non-iid) federated data
single server, creating a strictly local ML technique that may The rest of this paper is organized as follows: Section II
intentionally or unintentionally expose sensitive information. features the related work. Section III, outlines the methodol-
Federated Learning is a technique aimed at addressing this ogy, while Section IV presents a series of quantitative results
difficulty, among others. Unfortunately, when talking about on the available data. The paper concludes with Section V.
federated data it is specified that usually, the system has to
∗ I. Siniosoglou and P. Sarigiannidis are with the Department of Electrical II. L ITERATURE R EVIEW
and Computer Engineering, University of Western Macedonia, Kozani, Greece Transformer models are widely used in the field, with no-
- E-Mail: {isiniosoglou, psarigiannidis}@uowm.gr
† K. Xouveroudis is with the R&D Department, MetaMind Innovations table examples including BERT [7], a method of pre-training
P.C., Kozani, Greece - E-Mail: kxouveroudis@metamind.gr Transformer models that has achieved top-level performance
‡ V. Argyriou is with the Department of Networks and Digital Media,
across various NLP tasks. In addition, [8] proposed RoBERTa,
Kingston University, Kingston upon Thames, United Kingdom - E-Mail:
vasileios.argyriou@kingston.ac.uk which made several optimizations to the BERT pre-training
§ T. Lagkas is with the Department of Computer Science, In- process, resulting in improved performance. Other noteworthy
ternational Hellenic University, Kavala Campus, Greece - E-Mail: publications consist of XLM-R [9], a variant of RoBERTa
tlagkas@cs.ihu.gr
¶ S. K. Goudos is with the Physics Department, Aritostle that is cross-lingual and has the ability to attain high-quality
University of Thessaloniki, Thessaloniki, Greece - E-Mail: representations for languages that it was not explicitly trained
sgoudo@physics.auth.gr on, as well as GPT-2 [10], a Transformer-based language
∥ K. E. Psannis is with the Department of Applied Informatics, School
of Information Sciences, University of Macedonia, Thessaloniki, Greece - model of significant scale that has demonstrated remarkable
E-Mail: kpsannis@uom.edu.gr performance overall.
Authorized licensed use limited to: Aristotle University of Thessaloniki. Downloaded on July 18,2023 at 07:48:04 UTC from IEEE Xplore. Restrictions apply.
979-8-3503-2107-4/23/$31.00 ©2023 IEEE
In the current era of machine learning, a substantial propor-

tion of ML models created to forecast time series data [11]
are still actively applied across various scientific disciplines.
It’s no surprise that there have been attempts to implement
Transformers for time series forecasting as well, such as
Informer [12], The Influenza Prevalence Case [13] and the
proposed Probabilistic Transformer model [14].
In in 2016, Google introduced FL as a new sub-field of
ML [15], [16]. By implementing a framework designed around
FL, a model can be trained without the need for independent
providers to share raw data. This addresses several challenges
in the field, such as improving the efficiency of handling
large datasets or facilitating supervised training [17], but most
importantly, removing the possibility of disclosing otherwise
private information to outsiders.
Many published works have already experimented with
FL regarding time series forecasting. In their recent work
[18], the authors introduced a Federated LSTM system for
safeguarding Intelligent Connected Vehicles against network
intrusion threats by predicting message IDs, thereby protecting
all vehicles connected to the specific network. [19] leverages
Fig. 1: Transformer Architecture
FL to analyze time series data and proposes a framework
designed for detecting anomalies in the Industrial Internet of
Things (IIoT) domain. to fit within the range of [0, 1]. This reduces the gap between
Furthermore, attempts to utilize Transformers in FL have data points and enhances the model’s overall performance.
already been published, such as FedTP [20], which concludes
that in the presence of data heterogeneity, FedAvg algorithms B. Transformer Neural Networks
can have a detrimental effect on self-attention, which Trans- Transformer models fall under the category of DL, using
formers rely on. Another application is AnoFed [21], which an encoder-decoder arrangement for their design. While more
concentrates on the challenges of anomaly detection. classic methodologies, such as RNNs, process their input
III. M ETHODOLOGY sequentially, Transformers are designed to concurrently un-
derstand the connections between each element in a sequence.
A. Preprocessing
Transformers can potentially be more effective in capturing
Effective data preprocessing is considered equally important long-term dependencies and connections between various se-
as building the actual neural network model, as the predictive quence pieces, and therefore maintaining context in a DL task
performance of the network is significantly influenced by with no restrictions, as opposed to RNNs and their popular
the format of the data input, placing the task at a similar subcategory, the Long-Short Term Memory (LSTM) models.
importance level as the role of implementation and opti- Typically, models with an encoder-decoder structure are
mization. This highlights the significance of efficient data limited, as the encoder is set to modify the input data into a
preprocessing, as noted in the empirical study by Huang et al. vector of fixed length, which leads to the data processed by the
[22]. Preprocessing encompasses a range of methods utilized decoder being partially incomplete. Transformers, however,
to modify the input data to meet the requirements of the apply an attention mechanism [24], which allows the network
specific task. to produce output by focusing on specific segments of the input
Within the context of this work, the preprocessing steps sequence, boosting performance. The encoder and the decoder
remained consistent for each dataset, beginning with resam- work in parallel, the former being responsible for transforming
pling since all the data fell under the time series category. the input into a vector containing context information for the
Resampling refers to altering the frequency of time series whole sequence, whereas the latter is in charge of deciphering
observations. For instance, a dataset uneven measurements the context and meaningfully resolving the output.
may be sampled at constant intervals. In a DL task, the order of components makes up a sequence.
An additional possiblity that needs to be covered is miss- Models that receive data sequentially don’t face any difficulties
ing values, which are frequently encountered. We employed in this department, however Transformers need to assign each
forward filling [23], which replaces the missing value with component a relative position. This process is called positional
the preceding value. If any missing values persist, they are encoding [2] and it’s carried out through clever use of the sin
replaced with zeros. This enables us to retain the entire row, and cos functions:
even if one feature is absent, as it may contain significant
information. The final step involves scaling the remaining data P E(pos,2i) = sin (pos/100002i/dmodel ) (1)
P E(pos,2i+1) = cos (pos/100002i/dmodel ) (2) embedding. Subsequently, the residual connection’s output
is normalized via layer normalization and is then projected
Where pos represents the position of the element within
through a pointwise feed-forward network to be further pro-
the sequence, i denotes the index of the dimension in the
cessed. This network, also seen in Fig. 4, consists of linear
embedding vector, and dmodel refers to the embedding’s di-
layers with a ReLu activation function sandwiched between
mensionality. The sin formula (Eq. 1) is used to build a vector
them. The output of this network is normalized and added to
for each even index and the cos formula (Eq. 2) for each odd
the layer normalization as well. The Decoder, like the Encoder,
index. Then, those vectors are incorporated into the parts of
has two multi-headed attention layers, a feed-forward layer,
the input that correspond to them.
residual connections, and layer normalization. It generates
We can see the Transformer model’s complete structure via
tokens sequentially using a start token and is auto-regressive.
the diagram in Fig. 1. The Encoder layer has two sub-modules:
The first attention layer calculates scores for the input using
multi-headed attention and a completely interconnected net-
positional embeddings. The final layer is a classifier, with the
work. There are also residual connections encircling each of
Encoder providing attention-related information.
the two sub-layers, and additionally, a normalization layer. To prevent the conditioning of future tokens in the first at-
Self-attention is applied to the encoder via the multi-headed tention layer, a masking process is implemented. This modifies
attention module. This feature allows each input component attention scores for future tokens using look-ahead masking,
to be connected to other components. In order to achieve self- which involves setting them to ”-inf”. Such is displayed in Eq.
attention, we need to feed the input into three distinct and 4, where a) the first matrix represents the scaled scores, b) the
fully connected layers, (i) the Query Q, (ii) the Key K and second matrix represents the look-ahead mask added to it and
(iii) the Values V . and c) the last matrix represents the new, masked scores.
The overall process can be seen in Fig. 2. Once the Q, K,
and V vectors have been passed through a linear layer each, 
0.4 0.1 0.1 0.1 0.2
 
0 −inf −inf −inf −inf
 
0.4 −inf −inf −inf −inf

−inf −inf −inf  −inf −inf −inf 
we compute the attention score, which determines the degree 0.3

0.5

0.9
0.8
0.1
0.5
0.1
0.4
0.2
0.9
 0
+
 

0
0
0 0 −inf
 0.3
−inf  =
 

0.5
0.9
0.8 0.5 −inf

−inf 
 (4)
−inf  0.2 −inf 
of importance that one component, such as a word, should 0.2
0.2
0.6
0.1
0.7
0.3
0.9
0.9
0.7 0
0.5 0
0
0
0
0
0
0 0 0.2
0.6
0.1
0.7
0.3
0.9
0.9 0.5
assign to other components. In order to calculate the attention
When utilizing attention, the masking process is executed
score, the Q vector is dot-multiplied with all other K vectors
between the scaler and the softmax layer (Fig. 2). The new
within the arrangement of data.
scores are then normalized through softmax, making all future
To achieve greater gradient stability, the scores are scaled
token values zero, and eliminating their impact. The Decoder’s
down by dividing them with the square root of the dimensions
second attention layer aligns the input of the Encoder with that
of the query and key. Finally, the attention weights are derived
of the Decoder, emphasizing the most significant information
by applying the softmax function to the scaled score, resulting
between them. The output of this layer undergoes further
in probability values [0 − 1], with a higher value indicating a
processing before being fed into a linear and a softmax layer,
greater level of significance for the component, seen in Eq. 3.
and the highest score index is chosen as the prediction. The
QK T Decoder can be stacked on multiple layers, resulting in added
Attention(Q, K, V ) = sof tmax( √ )V (3)
dk diversity and enhancing predictive capabilities.
Additionally, multi-headed attention can be enabled by par- C. Federated Learning
alleling the attention mechanism, enabling multiple processes Federated Learning is a modern technique for securely
to utilize it in conjunction. This is achieved by splitting the training decentralized Machine Learning and Deep Learning
query, key and value into multiple vectors prior to imple- models. FL [16] manages the distribution, training, and ag-
menting self-attention. A self-attention process is referred to gregation of DL models across multiple devices at the edge
as a head. Every head generates an output vector, which is
concatenated into a singular vector before entering the final
linear layer. The goal is for each head to learn information at
least partially exclusive to it, further expanding the encoder’s
capabilities. We can see a representation of multi-head atten-
tion at Fig. 3.
A residual connection is created by adding the multi-
headed attention output vector to the initial positional input
Fig. 2: Self-Attention Process Fig. 3: Multiple attention layers operating in parallel.
synthetic data of the same caliber but differently geo-located

distributions. The purpose of generating synthetic data is to aid
the models in achieving better generalization [27]. Additional
information for the dataset is provided in Table I.
Fig. 4: The input and output of the pointwise feed-forward layer, connected
via a residual connection.
Farm Animal Welfare (Synthetic)
Training Features Air Humidity, Air Temperature, Ch4, CO2 Average
Target Feature Ch4
[25]. In the centralized FL used in this study, a central server Records >80K
selects a population Pf ∈ [1, N ] (N ∈ N∗ ) of devices where Nodes 5
0
a global model wGlobal is locally trained using the device’s TABLE I: Animal Stable Sensor Data
collected data Di∈N . The locally trained models wli , are
collected by the server, and an aggregation algorithm, such as 2) Product Sales Dataset: The last category has a distinct
Federated Averaging 5, is used to merge them into an updated goal, which is to predict future product sales using existing
k
model wGlobal . The updated model is then redistributed to information. Sales data spanning three years for seven distinct
remote devices, and this process is repeated over multiple FL products were made available, with each one treated as a
iterations. separate node. The datasets offer various details on the daily
N
k 1 X
unit sales of the products, as depicted in Table II, and compare
wG = P Di wik (5)
i∈N Di i=1 them to their corresponding figures from the previous year.
k Both of the described datasets include measurements from a
Above, wG notes the global model at the kth epoch and wik
different number of distributed data nodes. The data are non-
denotes the nodes’ ith model at the specific epoch. Figure 5
iid in the sense that the distribution of the information differs
illustrates the underlying mechanism of FL.
in each note. Furthermore, the data are volatile since both of
IV. E VALUATION the datasets include multifactor information the behavioural
A. Datasets pattern of which is affected by external factors that are not
depicted in the measured features but also are subject to
For the evaluation of the two model architectures, two frequent and unpredictable changes. In particular, in the first
distinct datasets are utilised. The first dataset pertains to the case, even though the stable sensors measure the conditions in
Smart Farming domain [26], while the last one concerns each stable, the behaviour and number of animals is not none.
Product Sales of an industrial ecosystem. This is a factor that, of course, changes the measurements since
1) Smart Farming Dataset: The first dataset consists of the number of animals in the stable at a specific moment,
time-based sensor readings collected from sensors located the quality of animal feed, etc. are variables that cannot be
inside distributed animal stables. The dataset includes 80,000 easily quantified. Conversely, the second dataset, even though
sets of data. The objective is to leverage the available sensor it includes accurate measurements of the production and sales
data to anticipate future stable conditions and ensure the well- of the produced products, can’t depict the consumer behaviour
being of the animals residing in these facilities. affected for example by sales or increased advertisement. In
Due to the limited and uneven information collected by this aspect, the employed models have to be able to generalise
the field sensors, the dataset has been augmented to include with the available information producing quantifiable results.
B. Experimental Results
To validate the performance of the tested models, three
benchmark metrics were used, outlined below in Eq. 6, 7, and
8. Here, Ti represents the true values, Pi represents the model
predictions, and n denotes the total number of data points.
1) Mean Square Error (MSE): This metric calculates the
average of the squared differences between the predicted and
Daily Product Sales

Daily Sales, Daily Sales (Previous Year),
Daily Sales (Percentage Difference), Daily Sales KG,
Daily Sales KG (Previous Year)
Training Features
Daily Sales KG (Percentage Difference)
Daily Returns KG, Daily Returns KG (Previous Year),
Points of Distribution, Points of Distribution (Previous Year)
Target Features Daily Sales, Daily Returns KG
Records 7K
Nodes 7
Fig. 5: Federated Learning Architecture TABLE II: Daily Product Sales Dataset Information
(a) Mean Square Error

MSE
LSTM Transformer
Local Federated Local Federated
Node0 0,3124 0,3153 0,3247 0,3206
Node1 0,0185 0,0189 0,0157 0,0149
Node2 0,0642 0,0674 0,0655 0,0649
Node3 0,1174 0,1023 0,0909 0,0689
Node4 0,0151 0,0203 0,0094 0,0149
Node5 0,0651 0,0565 0,0689 0,0359
Node6 0,0237 0,0372 0,0359 0,0230
Average 0,0881 0,0883 0,0873 0,0776
(b) Root Mean Square Error

RMSE
LSTM Transformer
Node0 0,4711 0,4726 0,5699 0,5662
Node1 0,1303 0,1312 0,1252 0,1223
Node2 0,2359 0,2440 0,2559 0,2548
Fig. 6: Stable Sensor Data Evaluation Node3 0,2700 0,2618 0,3015 0,2624
Node4 0,0929 0,1366 0,0971 0,1222
Node5 0,2477 0,2077 0,2626 0,1895
Node6 0,1462 0,1778 0,1896 0,1517
Average 0,2277 0,2331 0,2574 0,2384
(c) Mean Absolute Error

MAE
LSTM Transformer
Node0 0,1347 0,1340 0,1336 0,1243
Node1 0,0933 0,1036 0,0836 0,0908
Node2 0,0868 0,1114 0,0857 0,0967
Node3 0,2166 0,2131 0,1758 0,1695
Node4 0,0720 0,1187 0,0549 0,1005
Node5 0,2165 0,1689 0,2169 0,1252
Node6 0,0972 0,1343 0,1521 0,0965
Average 0,1310 0,1406 0,1289 0,1148
TABLE IV: Results - Daily Sales Dataset

Fig. 7: Product Sales Evaluation
2) Root Mean Square Error (RMSE): RMSE i a very useful

evaluation metric as it shares the same units as the response
actual values. A model that produces no errors would result variable, obtained by taking the square root of MSE. Eq. (7)
in an MSE of zero. Eq. (6) displays the formula for MSE: displays the formula for RMSE:
v
n u n
1X u1 X
M SE = (Ti − Pi )2 (6) RM SE = t (Ti − Pi )2 (7)
n i=1 n i=1
3) Mean Absolute Error (MAE): This metric involves

(a) Mean Square Error
calculating the average of the absolute differences between
MSE the predicted and actual values. It is a popular choice for
LSTM Transformer
Local Federated Local Federated evaluation because it provides a straightforward interpretation
Node0 0.0054 0.0311 0.0081 0.0049 of the difference between the two. The mathematical formula
Node1 0.0086 0.0137 0.0096 0.0051
Node2 0.0074 0.0133 0.0078 0.0050 for MAE is shown in Eq. (8):
Node3 0.0061 0.0136 0.0074 0.0050
n
Node4 0.0181 0.0146 0.0089 0.0051 1X
Average 0.0091 0.0173 0.0084 0.0050 M AE = |Ti − Pi | (8)
n i=1
(b) Root Mean Square Error
RMSE
LSTM Transformer The outcomes of the experiments are demonstrated in Tables
Node0
Local
0.0733
Federated
0.1764
Local
0.0901
Federated
0.0702
III and IV, which correspond to the Animal Welfare datasets
Node2 0.0861 0.1152 0.0883 0.0710 and the Product Sales dataset, respectively. Each table is
Node3 0.0782 0.1168 0.0859 0.0707
Node4 0.1347 0.1210 0.0942 0.0716 divided into three subtables, one for each of the evaluation
Average 0.0930 0.1293 0.0913 0.0709 metrics. In addition, we can see the aforementioned tables’
(c) Mean Absolute Error results visualized in Figures 6 and 7.
MAE Based on the findings, we can infer that in most cases
LSTM Transformer
Local Federated Local Federated (nodes) the Transformer models show increased generalisation
Node0 0.0551 0.0935 0.0654 0.0525
Node1 0.0763 0.0792 0.0715 0.0533 ability over the LSTM models. This can be attributed to the
Node2
Node3
0.0712
0.0600
0.0731
0.0792
0.0643
0.0608
0.0524
0.0539
multi-head attention mechanism that is able to keep track
Node4 0.1073 0.0830 0.0718 0.0534 of multiple key pieces of information in the accumulation
Average 0.0740 0.0816 0.0668 0.0531
process. This is further augmented when the models are
TABLE III: Results - Animal Welfare Prediction Loss [Synthetic] federated since the produced global model contains mutual
knowledge from the distributed data nodes. As can be seen in [7] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training
the results, the Federated Transformer models are superior to of deep bidirectional transformers for language understanding,” arXiv
preprint arXiv:1810.04805, 2018.
both the local and Federated LSTM models. [8] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis,
A more pronounced distinction is observed in the stable L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert
sensor data. In the local training, the LSTM and Transformer pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
[9] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek,
models exhibit similar performance. However, when subjected F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov, “Unsu-
to FL, the Transformer displays substantially lower loss while pervised cross-lingual representation learning at scale,” arXiv preprint
it produces more consistent results across its nodes overall for arXiv:1911.02116, 2019.
[10] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al.,
both of its models and unlike the LSTM, the Transformer’s “Language models are unsupervised multitask learners,” OpenAI blog,
FL models outperform their local counterparts. vol. 1, no. 8, p. 9, 2019.
Lastly, in the forecasting of Daily Product Sales data, we [11] J. G. De Gooijer and R. J. Hyndman, “25 years of time series forecast-
ing,” International journal of forecasting, vol. 22, no. 3, pp. 443–473,
observe another scenario where the performance of the two 2006.
models is comparable. Yet, in this instance, the Transformer [12] H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, and W. Zhang,
model also outperforms the LSTM. This pattern holds for both “Informer: Beyond efficient transformer for long sequence time-series
forecasting,” in Proceedings of the AAAI conference on artificial intel-
local and Federated Learning of the models. ligence, vol. 35, no. 12, 2021, pp. 11 106–11 115.
[13] N. Wu, B. Green, X. Ben, and S. O’Banion, “Deep transformer mod-
V. C ONCLUSIONS els for time series forecasting: The influenza prevalence case,” arXiv
preprint arXiv:2001.08317, 2020.
Federated Learning is a valuable attribute that aids in safe- [14] B. Tang and D. S. Matteson, “Probabilistic transformer for time series
guarding data confidentiality, as well as offering model quality analysis,” Advances in Neural Information Processing Systems, vol. 34,
pp. 23 592–23 608, 2021.
enhancement and quicker processing times for large datasets [15] J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and
and real-time predictions. In this paper, we compare the D. Bacon, “Federated learning: Strategies for improving communication
performance of two benchmark DNN architectures, namely, efficiency,” arXiv preprint arXiv:1610.05492, 2016.
[16] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas,
a) LSTMs and b) Transformers, for timeseries forecasting “Communication-efficient learning of deep networks from decentralized
in real-world federated datasets containing non-iid distribu- data,” in Artificial intelligence and statistics. PMLR, 2017, pp. 1273–
tions. We further employed FL in conjunction to evaluate 1282.
[17] A. Taı̈k and S. Cherkaoui, “Electrical load forecasting using edge com-
the performance of the local models against the federated puting and federated learning,” in ICC 2020-2020 IEEE International
global model that is trained on knowledge from distributed Conference on Communications (ICC). IEEE, 2020, pp. 1–6.
data nodes. The experimental results showed that the perfor- [18] T. Yu, G. Hua, H. Wang, J. Yang, and J. Hu, “Federated-lstm based net-
work intrusion detection method for intelligent connected vehicles,” in
mance of both the local and Federated Transformer models ICC 2022-IEEE International Conference on Communications. IEEE,
was comparable to LSTM models. The results indicate that 2022, pp. 4324–4329.
consistently Transformers have the potential to compete with [19] Y. Liu, S. Garg, J. Nie, Y. Zhang, Z. Xiong, J. Kang, and M. S.
Hossain, “Deep anomaly detection for time-series data in industrial
and even surpass LSTMs in terms of performance in predicting iot: A communication-efficient on-device federated learning approach,”
on federated data intricately affected by external factors, not IEEE Internet of Things Journal, vol. 8, no. 8, pp. 6348–6358, 2020.
always included in the data features. [20] H. Li, Z. Cai, J. Wang, J. Tang, W. Ding, C.-T. Lin, and Y. Shi, “Fedtp:
Federated learning by transformer personalization,” arXiv preprint
arXiv:2211.01572, 2022.
ACKNOWLEDGEMENT [21] A. Raza, K. P. Tran, L. Koehl, and S. Li, “Anofed: Adaptive anomaly de-
This project has received funding from the European tection for digital health using transformer-based federated learning and
support vector data description,” Engineering Applications of Artificial
Union’s Horizon 2020 research and innovation programme Intelligence, vol. 121, p. 106051, 2023.
under grant agreement No. 957406 (TERMINET). [22] J. Huang, Y.-F. Li, and M. Xie, “An empirical analysis of data
preprocessing for machine learning-based software cost estimation,”
R EFERENCES Information and software Technology, vol. 67, pp. 108–127, 2015.
[23] W. McKinney et al., “pandas: a foundational python library for data
[1] M. Tiwari, I. Maity, and S. Misra, “Fedserv: Federated task service analysis and statistics,” Python for high performance and scientific
in fog-enabled internet of vehicles,” IEEE Transactions on Intelligent computing, vol. 14, no. 9, pp. 1–9, 2011.
Transportation Systems, vol. 23, no. 11, pp. 20 943–20 952, 2022. [24] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
[2] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, jointly learning to align and translate,” arXiv preprint arXiv:1409.0473,
Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in 2014.
neural information processing systems, vol. 30, 2017. [25] Q. Yang, Y. Liu, Y. Cheng, Y. Kang, T. Chen, and H. Yu, “Federated
[3] P. M. Nadkarni, L. Ohno-Machado, and W. W. Chapman, “Natural learning,” Synthesis Lectures on Artificial Intelligence and Machine
language processing: an introduction,” Journal of the American Medical Learning, vol. 13, no. 3, pp. 1–207, 2019.
Informatics Association, vol. 18, no. 5, pp. 544–551, 2011. [26] C. Chaschatzis, A. Lytos, S. Bibi, T. Lagkas, C. Petaloti, S. Goudos,
[4] S. Grossberg, “Recurrent neural networks,” Scholarpedia, vol. 8, no. 2, I. Moscholios, and P. Sarigiannidis, “Integration of information and
p. 1888, 2013. communication technologies in agriculture for farm management and
[5] I. Siniosoglou, V. Argyriou, S. Bibi, T. Lagkas, and P. Sarigiannidis, knowledge exchange,” in 2022 11th International Conference on Modern
“Unsupervised ethical equity evaluation of adversarial federated Circuits and Systems Technologies (MOCAST), 2022, pp. 1–4.
networks,” in Proceedings of the 16th International Conference on [27] L. Perez and J. Wang, “The effectiveness of data augmentation in
Availability, Reliability and Security, ser. ARES 21. New York, NY, image classification using deep learning,” 2017. [Online]. Available:
USA: Association for Computing Machinery, 2021. [Online]. Available: https://arxiv.org/abs/1712.04621
https://doi.org/10.1145/3465481.3470478
[6] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
computation, vol. 9, no. 8, pp. 1735–1780, 1997.

Evaluating The Effect of Volatile Federated Timeseries On Modern DNNs Attention Over Long-Short Memory

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Evaluating The Effect of Volatile Federated Timeseries On Modern DNNs Attention Over Long-Short Memory

Uploaded by

Copyright:

Available Formats

2023 12th International Conference on Modern Circuits and Systems Technologies (MOCAST)

Evaluating the Effect of Volatile Federated Timeseries on Modern DNNs:

Ilias Siniosoglou∗ , Konstantinos Xouveroudis† , Vasileios Argyriou‡ , Thomas Lagkas§ ,

In the current era of machine learning, a substantial propor-

Fig. 2: Self-Attention Process Fig. 3: Multiple attention layers operating in parallel.

synthetic data of the same caliber but differently geo-located

Daily Product Sales

(a) Mean Square Error

(b) Root Mean Square Error

(c) Mean Absolute Error

TABLE IV: Results - Daily Sales Dataset

2) Root Mean Square Error (RMSE): RMSE i a very useful

3) Mean Absolute Error (MAE): This metric involves

You might also like