Professional Documents
Culture Documents
Abstract—In regards to the field of trend forecasting in time process non-Independent and Identically Distributed (non-IID)
series, many popular Deep Learning (DL) methods such as data [1]. This data is highly volatile and usually leads to lower
Long Short Term Memory (LSTM) models have been the gold model quality. This work investigates the effect of volatile
standard for a long time. However, depending on the domain
and application, it has been shown that a new approach can be federated data on benchmark Deep Neural Architectures in
implemented and possibly be more beneficial, the Transformer order to specify their efficacy in the federated environment on
deep neural networks. Moreover, one can incorporate Federated everyday tasks.
Learning (FL) in order to further enhance the prospective utility Under the broader category of Deep Learning, Transformers
of the models, enabling multiple data providers to jointly train [2] are a particular type of Artificial Intelligence (AI) model
on a common model, while maintaining the privacy of their
data. In this paper, we use an experimental Federated Learning that excels in Natural Language Processing (NLP) [3] tasks in
System that employs both Transformer and LSTM models on particular. These models are unrelated to other popular types,
a variety of datasets. The sytem receives data from multiple such as Recurrent Neural Networks (RNNs) [4].
clients and uses federation to create an optimized global model. Federated Learning is a technique designed to overcome
The potential of Federated Learning in real-time forecasting is
various challenges associated with traditional Machine Learn-
explored by comparing the federated approach with conventional
local training. Furthermore, a comparison is made between the ing methods, such as the privacy dangers that come from the
performance of the Transformer and its equivalent LSTM in need to collect data from multiple sources [5].
order to determine which one is more effective in each given This paper introduces a Centralized Federated Learning
domain, which shows that the Transformer model can produce System (CFLS) that employs state-of-the-art DL techniques
better results, especially when optimised by the FL process.
Index Terms—Federated Learning, Deep Learning, LSTM,
to address the challenges posed by each provided use case
Transformers, Forecasting, Federated Dataset and also helps in deploying and evaluating the aforementioned
DNN architectures. For our purposes, we focus on Transformer
I. I NTRODUCTION models and how they compare to the popular Long Short Term
The development process of reliable AI models in the realm Memory (LSTM) [6] approach. In summary, this work offers
of Machine Learning (ML) often requires the gathering of data the following contributions:L
from multiple sources. This strategy of using varied data can • Creates an experimentation system to evaluate the per-
enhance the model’s predictive capabilities, as it is trained formance of Federated data on benchmark DNN archi-
on a broader range of patterns. Nevertheless, this approach tectures
has several drawbacks and risks, the primary concern being • Compares the performance of two architectures (LSTM,
privacy. All datasets from each source must be uploaded to a Transformer) on volatile (non-iid) federated data
single server, creating a strictly local ML technique that may The rest of this paper is organized as follows: Section II
intentionally or unintentionally expose sensitive information. features the related work. Section III, outlines the methodol-
Federated Learning is a technique aimed at addressing this ogy, while Section IV presents a series of quantitative results
difficulty, among others. Unfortunately, when talking about on the available data. The paper concludes with Section V.
federated data it is specified that usually, the system has to
∗ I. Siniosoglou and P. Sarigiannidis are with the Department of Electrical II. L ITERATURE R EVIEW
and Computer Engineering, University of Western Macedonia, Kozani, Greece Transformer models are widely used in the field, with no-
- E-Mail: {isiniosoglou, psarigiannidis}@uowm.gr
† K. Xouveroudis is with the R&D Department, MetaMind Innovations table examples including BERT [7], a method of pre-training
P.C., Kozani, Greece - E-Mail: kxouveroudis@metamind.gr Transformer models that has achieved top-level performance
‡ V. Argyriou is with the Department of Networks and Digital Media,
across various NLP tasks. In addition, [8] proposed RoBERTa,
Kingston University, Kingston upon Thames, United Kingdom - E-Mail:
vasileios.argyriou@kingston.ac.uk which made several optimizations to the BERT pre-training
§ T. Lagkas is with the Department of Computer Science, In- process, resulting in improved performance. Other noteworthy
ternational Hellenic University, Kavala Campus, Greece - E-Mail: publications consist of XLM-R [9], a variant of RoBERTa
tlagkas@cs.ihu.gr
¶ S. K. Goudos is with the Physics Department, Aritostle that is cross-lingual and has the ability to attain high-quality
University of Thessaloniki, Thessaloniki, Greece - E-Mail: representations for languages that it was not explicitly trained
sgoudo@physics.auth.gr on, as well as GPT-2 [10], a Transformer-based language
∥ K. E. Psannis is with the Department of Applied Informatics, School
of Information Sciences, University of Macedonia, Thessaloniki, Greece - model of significant scale that has demonstrated remarkable
E-Mail: kpsannis@uom.edu.gr performance overall.
Authorized licensed use limited to: Aristotle University of Thessaloniki. Downloaded on July 18,2023 at 07:48:04 UTC from IEEE Xplore. Restrictions apply.
979-8-3503-2107-4/23/$31.00 ©2023 IEEE
2023 12th International Conference on Modern Circuits and Systems Technologies (MOCAST)
Authorized licensed use limited to: Aristotle University of Thessaloniki. Downloaded on July 18,2023 at 07:48:04 UTC from IEEE Xplore. Restrictions apply.
2023 12th International Conference on Modern Circuits and Systems Technologies (MOCAST)
P E(pos,2i+1) = cos (pos/100002i/dmodel ) (2) embedding. Subsequently, the residual connection’s output
is normalized via layer normalization and is then projected
Where pos represents the position of the element within
through a pointwise feed-forward network to be further pro-
the sequence, i denotes the index of the dimension in the
cessed. This network, also seen in Fig. 4, consists of linear
embedding vector, and dmodel refers to the embedding’s di-
layers with a ReLu activation function sandwiched between
mensionality. The sin formula (Eq. 1) is used to build a vector
them. The output of this network is normalized and added to
for each even index and the cos formula (Eq. 2) for each odd
the layer normalization as well. The Decoder, like the Encoder,
index. Then, those vectors are incorporated into the parts of
has two multi-headed attention layers, a feed-forward layer,
the input that correspond to them.
residual connections, and layer normalization. It generates
We can see the Transformer model’s complete structure via
tokens sequentially using a start token and is auto-regressive.
the diagram in Fig. 1. The Encoder layer has two sub-modules:
The first attention layer calculates scores for the input using
multi-headed attention and a completely interconnected net-
positional embeddings. The final layer is a classifier, with the
work. There are also residual connections encircling each of
Encoder providing attention-related information.
the two sub-layers, and additionally, a normalization layer. To prevent the conditioning of future tokens in the first at-
Self-attention is applied to the encoder via the multi-headed tention layer, a masking process is implemented. This modifies
attention module. This feature allows each input component attention scores for future tokens using look-ahead masking,
to be connected to other components. In order to achieve self- which involves setting them to ”-inf”. Such is displayed in Eq.
attention, we need to feed the input into three distinct and 4, where a) the first matrix represents the scaled scores, b) the
fully connected layers, (i) the Query Q, (ii) the Key K and second matrix represents the look-ahead mask added to it and
(iii) the Values V . and c) the last matrix represents the new, masked scores.
The overall process can be seen in Fig. 2. Once the Q, K,
and V vectors have been passed through a linear layer each,
0.4 0.1 0.1 0.1 0.2
0 −inf −inf −inf −inf
0.4 −inf −inf −inf −inf
−inf −inf −inf −inf −inf −inf
we compute the attention score, which determines the degree 0.3
0.5
0.9
0.8
0.1
0.5
0.1
0.4
0.2
0.9
0
+
0
0
0 0 −inf
0.3
−inf =
0.5
0.9
0.8 0.5 −inf
−inf
(4)
−inf 0.2 −inf
of importance that one component, such as a word, should 0.2
0.2
0.6
0.1
0.7
0.3
0.9
0.9
0.7 0
0.5 0
0
0
0
0
0
0 0 0.2
0.6
0.1
0.7
0.3
0.9
0.9 0.5
assign to other components. In order to calculate the attention
When utilizing attention, the masking process is executed
score, the Q vector is dot-multiplied with all other K vectors
between the scaler and the softmax layer (Fig. 2). The new
within the arrangement of data.
scores are then normalized through softmax, making all future
To achieve greater gradient stability, the scores are scaled
token values zero, and eliminating their impact. The Decoder’s
down by dividing them with the square root of the dimensions
second attention layer aligns the input of the Encoder with that
of the query and key. Finally, the attention weights are derived
of the Decoder, emphasizing the most significant information
by applying the softmax function to the scaled score, resulting
between them. The output of this layer undergoes further
in probability values [0 − 1], with a higher value indicating a
processing before being fed into a linear and a softmax layer,
greater level of significance for the component, seen in Eq. 3.
and the highest score index is chosen as the prediction. The
QK T Decoder can be stacked on multiple layers, resulting in added
Attention(Q, K, V ) = sof tmax( √ )V (3)
dk diversity and enhancing predictive capabilities.
Additionally, multi-headed attention can be enabled by par- C. Federated Learning
alleling the attention mechanism, enabling multiple processes Federated Learning is a modern technique for securely
to utilize it in conjunction. This is achieved by splitting the training decentralized Machine Learning and Deep Learning
query, key and value into multiple vectors prior to imple- models. FL [16] manages the distribution, training, and ag-
menting self-attention. A self-attention process is referred to gregation of DL models across multiple devices at the edge
as a head. Every head generates an output vector, which is
concatenated into a singular vector before entering the final
linear layer. The goal is for each head to learn information at
least partially exclusive to it, further expanding the encoder’s
capabilities. We can see a representation of multi-head atten-
tion at Fig. 3.
A residual connection is created by adding the multi-
headed attention output vector to the initial positional input
Authorized licensed use limited to: Aristotle University of Thessaloniki. Downloaded on July 18,2023 at 07:48:04 UTC from IEEE Xplore. Restrictions apply.
2023 12th International Conference on Modern Circuits and Systems Technologies (MOCAST)
B. Experimental Results
To validate the performance of the tested models, three
benchmark metrics were used, outlined below in Eq. 6, 7, and
8. Here, Ti represents the true values, Pi represents the model
predictions, and n denotes the total number of data points.
1) Mean Square Error (MSE): This metric calculates the
average of the squared differences between the predicted and
Authorized licensed use limited to: Aristotle University of Thessaloniki. Downloaded on July 18,2023 at 07:48:04 UTC from IEEE Xplore. Restrictions apply.
2023 12th International Conference on Modern Circuits and Systems Technologies (MOCAST)
Authorized licensed use limited to: Aristotle University of Thessaloniki. Downloaded on July 18,2023 at 07:48:04 UTC from IEEE Xplore. Restrictions apply.
2023 12th International Conference on Modern Circuits and Systems Technologies (MOCAST)
knowledge from the distributed data nodes. As can be seen in [7] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training
the results, the Federated Transformer models are superior to of deep bidirectional transformers for language understanding,” arXiv
preprint arXiv:1810.04805, 2018.
both the local and Federated LSTM models. [8] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis,
A more pronounced distinction is observed in the stable L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert
sensor data. In the local training, the LSTM and Transformer pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
[9] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek,
models exhibit similar performance. However, when subjected F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov, “Unsu-
to FL, the Transformer displays substantially lower loss while pervised cross-lingual representation learning at scale,” arXiv preprint
it produces more consistent results across its nodes overall for arXiv:1911.02116, 2019.
[10] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al.,
both of its models and unlike the LSTM, the Transformer’s “Language models are unsupervised multitask learners,” OpenAI blog,
FL models outperform their local counterparts. vol. 1, no. 8, p. 9, 2019.
Lastly, in the forecasting of Daily Product Sales data, we [11] J. G. De Gooijer and R. J. Hyndman, “25 years of time series forecast-
ing,” International journal of forecasting, vol. 22, no. 3, pp. 443–473,
observe another scenario where the performance of the two 2006.
models is comparable. Yet, in this instance, the Transformer [12] H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, and W. Zhang,
model also outperforms the LSTM. This pattern holds for both “Informer: Beyond efficient transformer for long sequence time-series
forecasting,” in Proceedings of the AAAI conference on artificial intel-
local and Federated Learning of the models. ligence, vol. 35, no. 12, 2021, pp. 11 106–11 115.
[13] N. Wu, B. Green, X. Ben, and S. O’Banion, “Deep transformer mod-
V. C ONCLUSIONS els for time series forecasting: The influenza prevalence case,” arXiv
preprint arXiv:2001.08317, 2020.
Federated Learning is a valuable attribute that aids in safe- [14] B. Tang and D. S. Matteson, “Probabilistic transformer for time series
guarding data confidentiality, as well as offering model quality analysis,” Advances in Neural Information Processing Systems, vol. 34,
pp. 23 592–23 608, 2021.
enhancement and quicker processing times for large datasets [15] J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and
and real-time predictions. In this paper, we compare the D. Bacon, “Federated learning: Strategies for improving communication
performance of two benchmark DNN architectures, namely, efficiency,” arXiv preprint arXiv:1610.05492, 2016.
[16] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas,
a) LSTMs and b) Transformers, for timeseries forecasting “Communication-efficient learning of deep networks from decentralized
in real-world federated datasets containing non-iid distribu- data,” in Artificial intelligence and statistics. PMLR, 2017, pp. 1273–
tions. We further employed FL in conjunction to evaluate 1282.
[17] A. Taı̈k and S. Cherkaoui, “Electrical load forecasting using edge com-
the performance of the local models against the federated puting and federated learning,” in ICC 2020-2020 IEEE International
global model that is trained on knowledge from distributed Conference on Communications (ICC). IEEE, 2020, pp. 1–6.
data nodes. The experimental results showed that the perfor- [18] T. Yu, G. Hua, H. Wang, J. Yang, and J. Hu, “Federated-lstm based net-
work intrusion detection method for intelligent connected vehicles,” in
mance of both the local and Federated Transformer models ICC 2022-IEEE International Conference on Communications. IEEE,
was comparable to LSTM models. The results indicate that 2022, pp. 4324–4329.
consistently Transformers have the potential to compete with [19] Y. Liu, S. Garg, J. Nie, Y. Zhang, Z. Xiong, J. Kang, and M. S.
Hossain, “Deep anomaly detection for time-series data in industrial
and even surpass LSTMs in terms of performance in predicting iot: A communication-efficient on-device federated learning approach,”
on federated data intricately affected by external factors, not IEEE Internet of Things Journal, vol. 8, no. 8, pp. 6348–6358, 2020.
always included in the data features. [20] H. Li, Z. Cai, J. Wang, J. Tang, W. Ding, C.-T. Lin, and Y. Shi, “Fedtp:
Federated learning by transformer personalization,” arXiv preprint
arXiv:2211.01572, 2022.
ACKNOWLEDGEMENT [21] A. Raza, K. P. Tran, L. Koehl, and S. Li, “Anofed: Adaptive anomaly de-
This project has received funding from the European tection for digital health using transformer-based federated learning and
support vector data description,” Engineering Applications of Artificial
Union’s Horizon 2020 research and innovation programme Intelligence, vol. 121, p. 106051, 2023.
under grant agreement No. 957406 (TERMINET). [22] J. Huang, Y.-F. Li, and M. Xie, “An empirical analysis of data
preprocessing for machine learning-based software cost estimation,”
R EFERENCES Information and software Technology, vol. 67, pp. 108–127, 2015.
[23] W. McKinney et al., “pandas: a foundational python library for data
[1] M. Tiwari, I. Maity, and S. Misra, “Fedserv: Federated task service analysis and statistics,” Python for high performance and scientific
in fog-enabled internet of vehicles,” IEEE Transactions on Intelligent computing, vol. 14, no. 9, pp. 1–9, 2011.
Transportation Systems, vol. 23, no. 11, pp. 20 943–20 952, 2022. [24] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
[2] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, jointly learning to align and translate,” arXiv preprint arXiv:1409.0473,
Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in 2014.
neural information processing systems, vol. 30, 2017. [25] Q. Yang, Y. Liu, Y. Cheng, Y. Kang, T. Chen, and H. Yu, “Federated
[3] P. M. Nadkarni, L. Ohno-Machado, and W. W. Chapman, “Natural learning,” Synthesis Lectures on Artificial Intelligence and Machine
language processing: an introduction,” Journal of the American Medical Learning, vol. 13, no. 3, pp. 1–207, 2019.
Informatics Association, vol. 18, no. 5, pp. 544–551, 2011. [26] C. Chaschatzis, A. Lytos, S. Bibi, T. Lagkas, C. Petaloti, S. Goudos,
[4] S. Grossberg, “Recurrent neural networks,” Scholarpedia, vol. 8, no. 2, I. Moscholios, and P. Sarigiannidis, “Integration of information and
p. 1888, 2013. communication technologies in agriculture for farm management and
[5] I. Siniosoglou, V. Argyriou, S. Bibi, T. Lagkas, and P. Sarigiannidis, knowledge exchange,” in 2022 11th International Conference on Modern
“Unsupervised ethical equity evaluation of adversarial federated Circuits and Systems Technologies (MOCAST), 2022, pp. 1–4.
networks,” in Proceedings of the 16th International Conference on [27] L. Perez and J. Wang, “The effectiveness of data augmentation in
Availability, Reliability and Security, ser. ARES 21. New York, NY, image classification using deep learning,” 2017. [Online]. Available:
USA: Association for Computing Machinery, 2021. [Online]. Available: https://arxiv.org/abs/1712.04621
https://doi.org/10.1145/3465481.3470478
[6] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
computation, vol. 9, no. 8, pp. 1735–1780, 1997.
Authorized licensed use limited to: Aristotle University of Thessaloniki. Downloaded on July 18,2023 at 07:48:04 UTC from IEEE Xplore. Restrictions apply.