You are on page 1of 14

DEGREE PROJECT IN TECHNOLOGY,

FIRST CYCLE, 15 CREDITS


STOCKHOLM, SWEDEN 2021

Predicting a business application's


cloud server CPU utilization using
the machine learning model LSTM

FILIP NÄÄS STARBERG

AXEL ROOTH

KTH ROYAL INSTITUTE OF TECHNOLOGY


SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
1

P REDICTING A BUSINESS APPLICATION ’ S CLOUD


SERVER CPU UTILIZATION USING THE MACHINE
LEARNING MODEL LSTM
F. Nääs Starberg, and A. Rooth, Students, KTH

Abstract—Cloud Computing sees increased adoption as com- I. I NTRODUCTION


panies seek to increase flexibility and reduce cost. Although the
large cloud service providers employ a pay-as-you-go pricing
model and enable customers to scale up and down quickly, there
is still room for improvement. Workload in the form of CPU
T HE adoption of cloud computing (CC) for enterprise
applications is growing at a fast pace and it is expected
to grow even more the coming years [1]. Some of the
utilization often fluctuates which leads to unnecessary cost and
environmental impact for companies. To help mitigate this issue, advantages of CC are easy implementation, allowing focus on
the aim of this paper is to predict future CPU utilization using companies’ core competencies, facilitating quick adaption to
a long short-term memory (LSTM) machine learning model. By current computational needs, and in many cases reduced cost.
predicting utilization up to 30 minutes into the future, companies Even though there are different pricing models, a fundamen-
are able to scale their capacity just in time and avoid unnecessary tal concept within CC is usage based pricing, often referred to
cost and damage to the environment. The study is divided into
two parts. The first part analyses how well the LSTM model as ”pay-as-you-go”. Comparing usage based pricing to owning
performs when predicting one step at a time compared with a private servers which must accommodate maximum load at all
state-of-the-art model. The second part analyses the accuracy times, it is easy to grasp how CC can reduce cost in theory.
of the LSTM when making predictions up to 30 minutes into However, with an expected cloud growth usage of 47
the future. To allow for an objective analysis of results, the
LSTM is compared with a standard RNN, which is similar to
percent, 30 percent of cloud spend is wasted and 23 percent
the LSTM in its inherit algorithmic structure. To conclude, the of cloud usage is over budget [2]. At the same time, data
results suggest that LSTM may be a useful tool for reducing cost centers’ energy consumption in 2018 corresponded to 1% of
and unnecessary environmental impact for business applications global energy consumption [3]. Thus making the argument for
hosted on a public cloud. the existence of cloud inefficiencies that need to be addressed
Sammanfattning—Användandet av molntjänster ökar bland by company executives in order for companies adapt in a cloud
företag som önskar förbättrad flexibilitet och sänkta kostna- based sustainable future.
der. De stora molntjänstleverantörerna använder en prismodell The company Afry, which this research is done in collab-
där kostnaden är direkt kopplad till användningen, och låter oration with, seeks to improve upon these environmental and
kunderna snabbt ställa om sin kapacitet, men det finns ändå
förbättringsmöjligheter. CPU-behoven fluktuerar ofta vilket leder budgetary issues for themselves. The company has experienced
till meningslösa kostnader och onödig påverkan på klimatet när a quick and large-scale adoption of CC and now seeks to get
kapacitet är outnyttjad. För att lindra detta problem används i a better understanding of its usage and spending. To achieve
denna rapport en LSTM maskininlärningsmodell för att förutspå this, Afry is looking for new analytical tools.
framtida CPU-utnyttjande. Genom att förutspå utnyttjandet
Computational power is often overprovisioned to sustain
upp till 30 minuter in i framtiden hinner företag ställa om
sin kapacitet och undvika onödig kostnad och klimatpåverkan. workload during peak hours [4] which generates the idle
Arbetet är uppdelat i två delar. Först en del där LSTM-modellen capacity outside peak hours. By optimizing the computational
förutspår ett tidssteg åt gången. Därefter en del som analyserar capacity occupied, and avoiding underutilization, it would
träffsäkerheten för LSTM flera tidssteg in i framtiden, upp be possible for a company to reduce both its cost and its
till 30 tidssteg. För att möjliggöra en objektiv utvärdering så
ecological footprint.
jämfördes LSTM-modellen med ett standard recurrent neural
network (RNN) vilken liknar LSTM i sin struktur. Resultaten One way to enhance companies’ capabilities of making the
i denna studie visar att LSTM verkar vara överlägsen RNN, necessary decisions in this area is to reduce the uncertainty
både när det gäller att förutspå ett tidssteg in i framtiden och about what CPU capacity will be required in the near future.
när det gäller flera tidssteg in i framtiden. LSTM-modellen In turn, this means that the company must be able to reliably
var kapabel att förutspå CPU-utnyttjandet 30 minuter in i
framtiden med i hög grad bibehållen träffsäkerhet, vilket också make accurate predictions about load surges and ebbs. In this
var målet med studien. Sammanfattningsvis tyder resultaten på paper, focus will be on CPU utilization in databases hosting
att denna LSTM-modell, och möjligen liknande LSTM-modeller, internal company applications used by employees at Afry.
har potential att användas i samband med företagsapplikationer Previous research of predicting CPU utilization in the cloud
då man önskar att reducera onödig kostnad och klimatpåverkan. has shown that standard recurrent neural networks (RNN)
are able to make accurate predictions up to 15 minutes into
Index Terms—Cloud computing; CPU prediction; Infrastruc- the future [5]. However, there is demand for predictions
ture as a service; LSTM; Machine learning; Neural networks; up to as much as 30 minutes into the future; this would
Recurrent neural networks; Time series prediction;
allow enough time to adjust CPU capacity preemptively. An
2

advanced version of a recurrent neural network is a model B. Cloud computing market


called long short-term memory (LSTM) which will be the Firstly, there is what is called private clouds. A private
basis of this research. The LSTM model has been proven to cloud’s resources are exclusively used by the company or orga-
be state-of-the-art for time series predictions with its ability nization which either hosts it for themselves or pays a service
to adapt continuously combined with the ability to remember provider to host it. Secondly, there is public clouds. A public
and differentiate between long and short term patterns. cloud is owned and hosted by a third-party company known as
Therefore, the aim of this paper is to understand how an a cloud service provider (CSP) which owns and manages all
LSTM performs compared with a standard recurrent neural connected hardware, software and support infrastructure [6].
network, and to also assess how far into the future an LSTM The CSPs are what constitutes the cloud computing market
can predict. If the LSTM can make reliable predictions 30 available to everyone.
minutes into the future it could become a helpful tool for The largest CSPs to date are Amazon Web Services (AWS)
companies hoping to optimize their CC usage in order to with 32%, Microsoft Azure with 19% and Google Cloud with
reduce their spending and their ecological footprint. 7% [8] market shares. The paper will be limited to these
companies since they realize advantages only available at their
economies of scale [4] and thus are able to fulfil what is
A. Research questions required by the NIST definition.
The research questions posed by this paper are the follow- For the purposes of this paper, the term cloud computing
ing: can be distilled further. There are three main categories within
cloud computing and the categories provide different levels of
1) Does a univariate LSTM produce more accurate single- service and control respectively.
step forecasts compared to a standard RNN? The highest level of service is Software as a Service (SaaS)
2) What accuracy does a multistep LSTM retain when which is a final product provided by the CSP. Examples
predicting 30 minutes into the future? of SaaS are Microsoft Outlook, Google Gmail or any other
application that the user connects to via internet. The CSP
manages all the underlying software and infrastructure [6].
B. Paper disposition
The intermediate level is Platform as a Service (PaaS). It
The paper’s disposition is the following. First, Section provides an environment for software developers to develop,
II aims to define, and give the reader better understanding test, deliver and manage applications without worrying about
of, cloud computing and its use in corporate environments. the infrastructure [6].
Section III describes related work regarding forecasting within Finally there is Infrastructure as a Service (IaaS) which
cloud computing. Section IV continues with an explanation offers access to physical or virtual computers (several virtual
of necessary theory regarding machine learning and neural computers can be hosted on the same physical machine) and
networks. In Section V the methodology used in this paper data storage.
is explained. The findings of this paper are presented in To keep the explanation simple only one CSP is used as
Section VI and followed up in section VII with a discussion an example since there might be smaller, but not important
concerning these findings. Lastly, Section VIII summarizes the to this paper, differences between CSPs. For the most part,
conclusions and Section IX provides ideas for future work. AWS’s implementation of IaaS is used as an example since
they are the largest cloud service provider.
The major CSPs provide the option to reserve an instance
II. BACKGROUND AND O BJECTIVES over a time period in which case the customer pays for the
reserved capacity regardless of usage but at a discounted price.
This section first aims to give the reader necessary back-
If the customer knows it will no longer need all of its reserved
ground regarding cloud computing and then to discuss what
capacity it can sell the abundant capacity in a marketplace
adopting CC means for companies in practice and what interest
provided by AWS. However, the sell price is usually lower
companies might have in this research.
than what was paid for that capacity [9]. Microsoft Azure
does not provide a marketplace but customers can terminate
reserved instances to get a refund but there is a termination
A. Definition of cloud computing
fee of 12 percent. In a perfect market there should be no price
Cloud computing is a collective name for the delivery of difference which would allow selling abundant capacity at will
different types of computing services across the internet [6]. and purchasing more capacity when required. A perfect market
A commonly used definition for Cloud Computing is the NIST would thus make it easy to match capacity with demand but
definition [7]: the market is currently far from perfect. This is evident by the
Cloud computing is a model for enabling ubiquitous, con- fact that the two largest CSPs together maintain an over 50%
venient, on-demand network access to a shared pool of config- market share.
urable computing resources (e.g., networks, servers, storage,
applications, and services) that can be rapidly provisioned and C. Cloud computing pricing
released with minimal management effort or service provider With AWS’s on-demand instances the customer pays for
interaction. every second (minimum 60 seconds) or hour (machines that
3

do not run Linux get charged by the hour) of usage, this D. Companies and cloud computing
is what is known as pay-as-you-go. Depending on hardware More and more companies are switching over to public CC
configuration there is a differing price for the usage amount, from private clouds and other types of on-premise computing
meaning that it is in the customer’s interest to match the [1]. There are many aspects associated with this change in
hardware capacity with the requirements to avoid paying how the companies operate.
unnecessary costs. First of all, cloud computing allows companies to focus on
With on-demand instances it is possible to scale capacity improving customer experience, increase productivity, lower
up or down and therefore it is possible to match capacity cost and – by enabling quicker time to market – also helps in
with fluctuating capacity requirements. The problem is that the generating revenue.
adjustments are not guaranteed to be committed instantly; in Second of all, CC is helping companies during the digiti-
the case of Microsoft Azure scaling can take up to 30 minutes. zation trends where the users demand access to systems from
To save cost it would be beneficial for the customer to reliably anywhere and at any time. Also, the ability of CC to rapidly
know – up to 30 minutes in advance – whether they should scale is well suited for companies building APIs since the APIs
scale up, scale down or keep the current level. might handle huge amounts of information which increases the
chances of unpredictable load.
There is a specific reason for why this, and previous,
Third and last, when switching to CC, a company can
research focuses on the CPU utilization part of CC. Here, the
be seen as moving away from CapEx (capital expenditure)
reason for choosing to analyze CPU – and not the other parts
to OpEx (operatinonal expenditure). Instead of paying for
that together with CPU constitute cloud computing, such as
hardware, software and implementation the costs are ongoing.
bandwidth and memory – will be elaborated upon.
This can be misinterpreted as reducing investment in IT while
Pricing for both memory and bandwidth (data transfer) the costs of operating are increasing [1]. To avoid confusion
relies on bytes used. Although there are different underlying the transition must be communicated clearly and might require
hardware optimized for different types of applications with new ways of analyzing IT.
varying pricing, storage pricing for both AWS and Azure is
directly related to how many gigabytes are used on average per
month [9] [10]. Storage can of course fluctuate over time, and E. Cost management
thus affect price per month, but it is usually not as volatile as A company’s cost management is crucial for their business
CPU usage. The cost of bandwidth is also calculated based on to run efficiently. If costs exceed the profits it can have a
bytes used each month and AWS offers a discount for larger detrimental effect on the future of the firm. In the worst
volumes [11] [10]. To save money on bandwidth it would case, poor cost management could even lead to bankruptcy.
likely be best for the customer to look at ways to reduce data However, tracking and managing a firm’s costs is not always
transfer. easy. Therefore, companies should emphasise the importance
of cost management and implement tools and metrics to help
When it comes to CPU the issues are more complex. As
managers understand their actions.
mentioned, it is common for CSPs to use pay-as-you-go
Research has shown that computerised tools can be bene-
pricing. The customer is billed based on the amount of time the
ficial for managers in retaining that understanding [16]. As
CPU was used in some way; it could be CPU cycles, seconds
mentioned, companies moving their operations to CC also
or hours. This may be interpreted as customers being charged
switch from CapEx to OpEx. Managers therefore have to
the minimal amount for their computing needs but that notion
reevaluate how their cost management is handled which can
is somewhat misleading.
be difficult with limited resources.
AWS and Google Cloud calls it vCPU and Azure calls it As was just suggested, computerized tools can be beneficial
vCore. The ”v” stands for virtual since customers do not have for understanding and managing cost. Microsoft Azure, for
to worry about how the hardware infrastructure is implemented instance, has some tools built in their PaaS platform which
by the CSP. One vCore (or vCPU) essentially corresponds to can be a great start for a new way to manage cost. However,
a core on a physical CPU [12]. Consequently, more vCores managers should not be satisfied and stop at that point because
roughly translates to more computing power. these tools may not be suited for that particular firm. Com-
The final cost that the customer gets billed is based in part panies have therefore started to build tools to analyse costs
on how much CPU time was consumed and in part on how on top of Microsoft Azure to tailor to the firms own specific
many vCores were used during that time. This applies to all needs regarding cost management. This paper will provide an
of AWS, Azure and Google Cloud [10] [13] [14]. In addition, example of what such a tool could look like.
the price varies depending on which generation of CPU the
cores are a part of [15] but this will not affect this paper’s III. R ELATED W ORK
research.
Prediction of workload in CC services has been researched
This paper will focus on how companies can save money extensively. Calheiros et al. [17] point out the importance of
by scaling the number of vCores used to improve utilization. understanding the future demand to provide good quality of
Part of the reason to use vCores is to allow scaling so this service, to avoid customer dissatisfaction and to avoid poten-
feature should be taken advantage of [10]. tially losing customers. Therefore, predicting the workload is
4

crucial to a company. In their paper, Calheiros et al. imple- IV. T HEORY


ment an ARIMA (Auto regressive moving average) model, a This section attempts to present all the non-trivial theory
variation of moving average, to make such a prediction. The necessary for understanding the methods used in this paper.
outcome was shown to perform with an average accuracy of
91 percent .
A. Time series prediction
Further research, by Cao et al. [18], has utilized the en-
semble approach to predict for multiple steps from 5 to up The aim of this paper is to implement and compare different
to 50 steps, where each step corresponds to 10 minutes. techniques for forecasting future CPU utilization. Forecasting
The models used for the ensemble were most similar pattern is based on the assumption that previous occurrences influence
model, weighted nearest neighbor, weighted nearest neighbor the future. With this assumption, predictions can be made for
model for differenced data, exponential smoothing and autore- future occurrences.
gression. Overall, the ensemble model showed more accurate Mathematically, observations with respect to t through time
results than the individual models. However, autoregression will be used to forecast the value ŷ at time t. The choice
– which is a part of ARIMA in the paper of Calheiros et of function f is crucial for making an accurate prediction.
al. – performed the worst. Of course, the results are very However, making such a choice requires care and proper
much dependent on the data provided, and direct comparisons analysis [20].
between the research can not be made in this context.
ŷt = f (x(t 1), x(t 2), x(t 3)...), (1)
More recent research has instead applied machine learning
models, where types of ARIMA models have been the bench-
mark for the research. The reason being, as paper Duggan et al. B. Machine learning
[5] suggests, that CPU utilization conforms to a non-stationary The models that are compared in this paper are both
pattern which ARIMA models do not apprehend like machine examples of machine learning. The area of machine learning
learning models do. is one of the most prominent subjects of research. Therefore,
Duggan et al. instead used a recurrent neural network multiple fields within machine learning have emerged but
to predict CPU utilization data from the CoMon project, a they all stem from the same set of rules. One of the most
monitoring infrastructure for PlanetLab. The structure of the popular and accepted definitions of machine learning is by
data was aggregated of multiple virtual machines where each Tom Mitchell [21], ”A computer program is said to learn
step was 5 minutes of CPU utilization. With their recurrent from experience E with respect to some class of tasks T and
neural network, Duggan et al. were able to accurately predict performance measure P, if its performance at tasks in T, as
3 steps, or 15 minutes, into the future. The authors also measured by P, improves with experience E.1” . However, in
emphasize the importance of the model to predict further than recent times a more modern definition has been created by Ian
just one step ahead. The authors suggest up to 20 to 30 minutes Goodfellow. He defines machine learning in the following way,
in prediction is suitable. By then data centre management ”Machine learning is essentially a form of applied statistics
systems can reliably be informed of future demand and act with increased emphasis on the use of computers to statisti-
accordingly in time. cally estimate complicated functions and a decreased emphasis
Moving average was one of the models used for comparison on proving confidence intervals around these functions” [22].
to the recurrent neural network as a baseline model. When The aim of a machine learning model is therefore to be
predicting one step, the recurrent neural network was more able to predict the correct outcome with the help of some
accurate compared to moving average. Similarly, multi-step input using statistics to estimate this function accurately.
prediction of the recurrent neural network demonstrated better The function contains different types of parameters. There
result than moving average for up to 3 steps. are hyper-parameters which are set before training and there
are ordinary parameters which are updated during training
As future work the authors proposed the LSTM model as with the aim of minimizing or maximizing the cost function.
a means for improvement which has been applied by others The ordinary parameters are then updated with the help of
[5]. The data in which Nashold et al. [19] used to predict an optimizer function which calculates the parameters that
with LSTM have great similarities with the data used in optimize the cost function.
this paper’s research. Nashold et al. propose an alternative Furthermore, one of the sub-fields of machine learning that
comparable model to the LSTM instead of the simple moving recently has revolutionized the field is deep learning. The field
average with the SARIMA (Seasonal auto-regressive moving is inspired by the human brain and its connecting neurons. By
average) model. However, as this research also suggest, the applying this philosophy to machines, the models are able to
result indicated LSTM to be superior in predicting time series perform as well as the human brain and even exceed humans.
data, even though much of the analysed data was indicated to However, the design of a brain and a neural network is far
be stationary. from similar. Machines equivalent of neurons are called nodes.
A node is just a function which takes in data and spits out
another value with the help of a non-linear activation function.
Connecting multiple nodes together in layers creates a neural
network which has been proven to solve tasks which exceeds
5

humans’ capacity. These layers are called hidden layers and


how many layers are needed and how to connect them depends
on the task. Therefore, multiple types of artificial neural
networks have been created to serve different tasks, like
recurrent neural networks (RNNs).

C. Recurrent neural networks

Figure 2. LSTM architecture with forget gate ft ,input gate it , output gate
ot [25]

1
(x) = x
(7)
1+e
Figure 1. RNN architecture, folded to the left and unfolded to the right [23]
ex e x
tanh(x) = (8)
A recurrent neural network (RNN) is one of the machine ex + e x
learning models that will be used for predictions in this paper. The difference between the forget gate and the input gate is
RNNs work by recurrently using the output from itself to ĉt function which will be used to update it which later will
predict next output with the help of the next input as shown be used in the next phase of calculation of the cell state ct
in Figure 1. Thus creating a sort of memory of past data. (14) which is then used to update the hidden state ht with the
Therefore, RNNs are well suited for sequenced data and time output gate vector.
series prediction. Furthermore, there exists different definitions
ĉt = tanh(Wc · [ht 1 , xt ]) + bc (9)
of standard RNNs. In this paper a standard RNN is defined
as an Elman network with the equations of the RNN shown
below [23]. c t = ft ⇤ c t 1 + it ⇤ ĉt (10)

ht = (Wh xt + Uh ht 1 + bh ) (2)
ht = ot ⇤ tanh(ct ) (11)

yt = (Wy ht + by ) (3) The hidden state vector and the cell state vector are then
connected to the LSTM again to account for the recurrent
The most basic of RNNs, have however a rather insufficient feature of the network. Therefore, with every iteration the
memory because of the vanishing gradient problem [24]. hidden state and the cell state is updated to fit the data
Therefore, other types of RNNs were created to account for accordingly while the weights and biases stay the same until
the lack of memory to be able to remember longer sequences updated with back-propagation through time (BPTT) [26].
to improve accuracy. One of these RNNs is called long short- 2) Optimizer Adam: The optimizer which is used to update
term memory (LSTM) which this paper uses to compare with the weights’ each input sequence during training is Adam
the standard RNN. (adaptive moment estimation). The method uses an adaptive
1) Long short-term memory: The LSTM architecture is learning rate to update the parameters. Meaning that the
based on three gates: The forget gate ft (8), the input gate learning rate for the parameters are individually updated.
it (9) and the output gate ot (10) [25]. An overall preview of Adam achieves this by using both the second and the first
the architecture is shown in Figure 2. gradient of the cost function, where mt is the mean and vt
is the uncentered variance. 1 and 2 are hyper-parameters
ft = (Wf · [ht 1 , xt ]) + bf (4)
which is recommended to be initialized to 0.99 and 0.999
respectively.
it = (Wi · [ht 1 , xt ]) + bi (5)
mt = 1 mt 1 + (1 1 )gt (12)

ot = (Wo · [ht 1 , xt ]) + bo (6)


2
vt = 2 vt 1 + (1 2 )gt (13)
The activation function that is used in the LSTM cell is the
logistic function (8) and hyperbolic tangent (9). The logistic mt and vt are initialized to 0’s in the beginning which
function outputs values between 1 and 0. In this context, 1 contributes to a bias towards zero. To counteract this bias,
means a value that is remembered and 0 is disposed. The equations (12) and (13) are calculated and are used when
hyperbolic tangent function is used to update the input vector updating the weights with with equation (14). When updating
with values between -1 and 1. the weights, the authors of the Adam paper also recommend
6

setting the hyper-parameters ⌘ (learning rate) and ✏ to 0.001 the dimension of the measured error unit changes. The m
and 10 7 respectively, which is rarely changed [27]. in the equation is the total number of predictions and by
dividing with m the mean squared error of all the predictions
mt
m̂t = t (14) is obtained.
1 1 The fact that the error becomes smaller when – and even
vt 0 when predictions are completely accurate (although this is
vˆt = (15)
1 t
2
unlikely) – can be taken advantage of when training a machine
learning model [22]. To improve the performance the model

✓t+1 = ✓ p m̂t (16) simply has to reduce the error.
vˆt + ✏ As mentioned there are many different measures to choose
from and they should be chosen with regards to the problem
D. Machine learning implementation in practice at hand. To broaden the perspectives and help interpretation of
In practice, implementing a machine learning model to results a second error measurement was chosen as well, mean
solve a problem can be a nuisance and demand persistence. absolute error (MAE).
Successfully finding an accurate model is not a trivial task. m
1 X
Having a systematical design framework when implementing M AE = |(ŷ y)i |, (18)
a machine learning model can be a savior. Therefore, this paper m i
adheres to the following common practices.
Unlike MSE, the MAE is only a function of one charac-
Andrew Ng proposes four key points to consider for a
teristic instead of three. MSE can therefore be a more natural
practitioner [28]:
error measurement to interpret [29].
• Determine your goals - what error metric are you going
2) Training and testing: Furthermore, one should acknowl-
to use and what your target value is. The goals and error edge the importance of how to go about testing the model’s
metrics should be driven by the task the machine learning performance. The provided data is not recommended to fully
model is intended to solve. be used when training the algorithm. The reason for this is
• Set up a end-to-end pipeline as soon as possible. To
because of the implication of then measuring the algorithms
also make sure to find an appropriate estimation of the performance to generalize. Therefore, is is recommended to
performance metrics. split the data in a training set, validation set and a test set
• Measure the system accordingly to find performance
with a common ratio of 8:1:1. The training set and validation
bottlenecks in the model due to over-fitting, under-fitting, set is used during training, where the validation set is used to
defects in data or software bugs. alert for over-fitting of the training data. The test set is used
• Frequently make incremental changes in regards to find-
to understand how the model performs on unseen data and it
ing more data, adjusting hyper-parameters, or changing is therefore attractive to adjust the model to find the lowest
model based on findings regarding performance bottle- possible test error.
necks. 3) Baseline models: When having calculated a test error the
With this design framework in mind, there are also three question of what a decent error result is emerges. In practice,
fundamental concepts within machine learning that should be the error will never reach 0 so a practitioner has to find
elaborated upon. another way of bench-marking the test error [22]. What is most
1) Error measurements: There exists an abundance of stan- commonly used is a baseline model which is not necessary
dard error measurements. Which measurement that eventually a machine learning model. For time series prediction, there
gets chosen should stem from the characteristics of the prob- exists plenty of methods but moving average and weighted
lem that the model is facing. The chosen error measurement moving average are two common ones. Moving average is
can then be used to measure the models performance. A simply calculated as:
better performing model will produce smaller errors. One m
measurement of a models performance is the mean square 1 X
ŷt+1 = yi , (19)
error (MSE) [22]. m i=t m
m
1 X where t+1 is the next time-step to be predicted and m is
M SE = (ŷ y)2i , (17) the total number of steps to be used in the average.
m i
If the test error is worse than the baseline model, the
Upon examination of the equation above, it is apparent practitioner should then use the design framework’s fourth key
that if the predictions, ŷ, are closer to the real values, y, point and change the model accordingly.
the error becomes smaller. When squaring the difference of
the prediction and real value, three things happen. Firstly,
it is ensured that all summed values are positive to avoid E. One-way ANOVA
positive and negative values cancelling each other and thus To ensure that results are faithful, an analysis of variance
underestimating the error. Secondly, larger individual errors (ANOVA) is used. First a null-hypothesis must be established.
receive greater punishment since their weight in the final error If the null-hypothesis were to be true, there is no significance
will be greater than the weight of their initial size. Thirdly, to be found in the results. An alternative hypothesis which is
7

considered true if the null-hypothesis is rejected must also be the main page a request is made to read the database so news
prepared. can be displayed. Each read, write or delete operation requires
An F-test is then conducted to calculate the probability (p- CPU power, and more active users requires more CPU power.
value) of the null-hypothesis being false. If the probability In Figure 3 a pattern is discernible. The first day in the
is below a chosen probability called ↵, five percent being a dataset is a Sunday which can be seen as low activity in the
common threshold for ↵, the null-hypothesis can be rejected figure. During the weeks higher activity can be seen at the start
and the alternative hypothesis be considered as truthful [30]. of the workdays with a decline throughout the day. During the
weekends the activity is close to 0. The second week coincided
F. Central Processing Unit (CPU) with Easter holiday on Friday stretching to Monday, which is
reflected as low activity for those days in the figure.
Since a general understanding of a CPU is beneficial for The data points in the dataset are the Max CPU Utilization
the full understanding of this paper, a simple explanation is Percentage for each minute during the months that the dataset
provided. stretches over. A choice was made to use max CPU utilization
The central processing unit is the primary component to instead of average CPU utilization to avoid under-provisioning.
process instructions in a machine. These instructions vary from In total there are 58691 data points.
logical, arithmetic, controls and I/O (input/output) operations.
Essentially, the CPU is the brain of a machine. The interpre-
tation of the data processed in the CPU is in the form of
bytes, 1’s and 0’s. The particular operation process to execute B. Forecasting models used
commands is called the instruction cycle. The instruction is Three different forecasting models are implemented:
divided into three stages: fetch, decode and execute. First 1) LSTM,
some data is fetched in bytes from the program memory. The 2) Standard RNN,
instructions are then decoded into signals which the internal 3) Moving Average.
components of the CPU can comprehend. The CPU can then LSTM and standard RNN are both part of this paper’s research
start processing the instruction and make the executions in the whereas moving average will serve as a baseline model
form of actions [31]. for analysing results. Since LSTM is a type of RNN, the
parameters that need to be chosen are identical for LSTM
V. M ETHODOLOGY and standard RNN.
This section details how the study was conducted. First the For the first research question each model uses the 60
data used for the experiments is explained in detail. Then previous minutes as input to predict one step, or one minute,
follows an overview of which models were used and how they ahead. The second research question uses the same number
were implemented. of input steps but to instead predict several steps ahead. Both
the LSTM and RNN have 1 feature as input (historical CPU),
A. Description of dataset meaning they are univariate, and use 11 hidden hidden units
to produce 1 output. Adam is used as an optimizer for both
The data used to train and test the models was received LSTM and RNN, and the learning rate was 0.001.
by the engineering consultancy firm Afry (Afry is their brand Moving average also used 60 input steps since it is fair
name, the company is formally called ÅF Pöyry AB). Afry to compare them under the same conditions. If a different
operates globally but the lion’s share of operations is within amount of input steps were to be chosen it would complicate
Europe, and in particular across Scandinavia. As a conse- comparing performance since one model used less input for
quence the behavior of the roughly 16000 employees will to a its predictions which an entirely different aspect that would
great extent be related to Scandinavian office hours, weekdays need to be considered.
and holidays. This is apparent in Figure 3 which will be
discussed in more detail.
The data reaches from March 21 to May 10, year 2021 C. Procedure
and is from a database with company news connected to the In order to be able to train the LSTM and RNN models,
intranet. Every time an employee logs into the intranet or visits the input data X was reshaped into the dimensions (number

Figure 3. CPU percentage utilization from March 21 to May 10, 2021 (outliers removed in graph to improve readability)
8

of occurrences, number of timesteps in the sequence, number will be in up to 30 minutes.


of features). The dimensions of X in this paper was (73364, The results of the LSTM were compared to both RNN and
60, 1). Each occurrence sequence Xt was then matched with moving average as baselines. Again, error measurements are
the next value in the dataset to the output yt . Furthermore, the used as result of performance.
datasets X and y were divided into 58691 data points (80%) Like the previous research question, the procedure was car-
for training, 7337 points (10%) for validation and 7337 points ried out four times over to calculate a standard deviation of the
(10%) for testing. Outlier data points were observed in the error measurements. Again, the presented error measurement
training data but were not removed. results are the average of all four passes.
The experiments were conducted in two parts; one part for
each research question.
1) First research question: The goal with the first part was D. Code implementation
to compare the results of the LSTM with RNN to see which When implementing the machine learning models Python
performs better for one step ahead. Moving average was used was used together with the libraries Tensorflow and Keras
as a naive algorithm baseline for comparison. If the models do to build the model frameworks for RNN and LSTM. The
not outperform the naive baseline, then they can be regarded Keras library which sits on top of Tensorflow can create a
as useless. model with a few lines of code [32]. LSTM and RNN are
First, the neural networks were trained on the training as mentioned both recurrent neural networks which makes the
dataset over 150 epochs. MSE is used as the loss function for algorithm identical except for changing the model from LSTM
both models. If the validation loss – the loss when the model to SimpleRNN in Keras and vice versa. Below the algorithm
tests itself on the validation dataset – increased for 3 con- in pseudo code is provided.
secutive epochs, then the training was stopped early to avoid
over-fitting. The validation MSE error was then optimized by Algorithm 1 for LSTM/RNN model
manually tuning the hyper-parameters accordingly. Input: input data format:
All three models then used the test data – data which (number of occurences, timesteps in sequence, features)
they had not seen before – to make predictions. The models Output: trained model Initialisation : Prepare training, val-
repeatedly use 60 input steps to predict one step ahead over idation and test data with ratio 8:1:1 for input (X) and
the whole test data set. Two measurements of error are used output (y)
as result of models’ performance: mean squared error (MSE) 1: Create Sequential model
and mean absolute error (MAE). The error measurements are 2: Add LSTM/RNN layer with number of units and input
calculated using the predicted values compared with the actual shape: (timesteps, features)
values of the test data set. 3: Add Dense layer with 1 unit with linear activation function
The reasoning behind using two measurements of error is 4: compile model with mean squared error as loss function
to get more insight into the type and magnitude of the errors 5: train model with the provided data
made. MAE will be of the same dimension as the predicted
value and thus easier to translate into what impact the error
could have on a real world application. By comparing MSE
with MAE it is also possible to get an indication whether a
E. Hypothesis testing
few large errors were made (since they are given more weight
in MSE) or if many smaller errors were made. To determine whether the differences in the calculated error
The whole procedure was done four times over to calculate metrics between LSTM and RNN were statistically significant,
a standard deviation of the error measurement. The presented the one-way ANOVA test was used.
error measurement results are the average error measurements The null-hypothesis was formulated as ”there is no sig-
of all four passes. nificant difference between the means of the models”. Con-
2) Second research question: The second part was con- sequently, the alternative hypothesis could be formulated as
ducted in a similar manner to the first part. The objective was ”there is a significant difference between the results of the
to determine how many minutes into the future the LSTM is models”.
able to accurately predict, with a cap at 30 minutes. The same P-values were calculated using 5 occurrences from each
neural networks trained in part one were used. Again, 60 input model. The Python library Scipy was used for these calcu-
steps were used to predict the subsequent values. To predict lations. The threshold ↵ was decided to be five percent. If the
n minutes ahead the model made n predictions single-step resulting p-values are below 0.05, the null-hypothesis can be
predictions. The predicted value was then seen by model as rejected.
the latest ”true” value in the 60 previous values that are used as
inputs for then next prediction, meaning it replaced the oldest
VI. R ESULT
value of input values used in the previous prediction.
The errors were counted for the last and final step. It was This section provides the obtained results of the experiments
decided to only count the final step since what ”path” the that were conducted. First the findings from single-step pre-
model took to get to the final value is not important. Rather, diction of CPU utilization are presented followed by the multi-
the intended purpose is predicting what the CPU utilization step prediction.
9

A. Single-step prediction B. Multi-step prediction


Table I shows the accuracy of the three models used in In case of the multi-step predictions, the deviation from the
this experiment. Both mean squared error (MSE) and mean real data for the RNN and LSTM is larger than the previous
absolute error (MAE) were used to better comprehend the single-step predictions. Looking at Table II, both MAE and
accuracy of the models. Looking at Table I, with all three MSE are larger, regardless of it being 10, 20 or 30 steps.
cases, MSE and MAE correlated with each other such that if Looking at the results for 10 steps ahead, RNN has lower MSE
one of the two was higher, the other one was higher as well. and MAE than moving average for single-step. Furthermore,
Perhaps suggesting that none of the models were too volatile LSTM still retains relatively better accuracy in performance
in their predictions with varying large and small errors but compared to moving average and RNN, which still perform
instead somewhat consistent. well for 30 steps. Comparing MSE and MAE between RNN
For moving average the deviations were non-existent, which and LSTM, the errors for LSTM at 20 steps are lower than
is excepted due to the nature of the model simply using an the errors at 10 steps for RNN.
average each passing. For both RNN and LSTM the deviations Like the first research question, the standard deviations for
were just a small fraction of the respective error results. both RNN and LSTM were small compared to the errors
Moving average had the highest MSE and MAE and LSTM themselves.
had the lowest. As expected, the differences of MSE between Table II
the algorithms were much larger than the differences in MAE. M ULTI - STEP : AVERAGE ACCURACY ON TEST DATA
N umber of steps Algorithm Error metrics
Table I MSE (st.dev) MAE (st.dev)
S INGLE - STEP : AVERAGE ACCURACY ON TEST DATA 10 steps RNN 1.0768 (0.0432) 0.7273 (0.0128)
Algorithm Error metrics LSTM 0.9353 (0.0161) 0.6848 (0.0045)
MSE (st.dev) MAE (st.dev) 20 steps RNN 1.3114 (0.0491) 0.7933 (0.0135)
Moving Average 1.1663 (0.0000) 0.7373 (0.0000) LSTM 1.0225 (0.0233) 0.7113 (0.0069)
RNN 0.9199 (0.0193) 0.6782 (0.0058) 30 steps RNN 1.6145 (0.0336) 0.8650 (0.0122)
LSTM 0.8755 (0.0070) 0.6643 (0.0018) LSTM 1.1247 (0.0349) 0.7439 (0.0098)

Furthermore, by visually inspecting the predictions in Figure Compared with Figure 4, both predictions in Figure 5 are
4 it is apparent that all three methods have made fairly accurate less accurate and less smooth as a result. The most drastic
predictions and follows the pattern pretty well. A few decimals change is the RNN prediction where the model has piece-
of difference in MSE and MAE does not seem to be sufficient wise linear patterns which do not correspond to the real
to completely disqualify any of the methods used. However, data whatsoever. LSTM behaved better overall which was
while the statement holds true, the moving average seems to confirmed by the error metrics in table 2. Where both models
predict prematurely compared to the real data which in turn struggled the most was still during drastic change of CPU
impairs the accuracy of the prediction. capacity.
When observing the RNN and the LSTM, both methods
converge with a similar pattern with a few distinctions. LSTM
seems more adept at predicting the peaks, nevertheless, it also
often falls short here.

Figure 5. Visualized 30-step predictions of 30 percent of test data

With further investigation regarding performance during


CPU capacity peaks, Figure 6 confirms earlier observations.
Figure 4. Visualized single-step prediction of 30 percent of test data During low workload, both the RNN and the LSTM made
significant results compared to high workloads. Comparing the
While predicting one step (one minute) at a time has some models, LSTM is still more adaptable during these high peaks.
relevancy for comparing models, its practical applications are However, the MAE is still by large affected of the deviations
less apparent. In practice the multi-step prediction would be during high peaks. During low workload, the MAE is around
much more useful. 0.5 which is lower than the aggregated single-step prediction
10

to the statistical significant MAE and MSE. The noteworthy


results indicate that the LSTM model is superior in retaining
information from previously seen data.
Comparing both error metrics, MSE was larger than MAE
but less so for LSTM. Thus indicating that LSTM does
not make relatively big errors that are heavily punished by
MSE, but rather makes many small errors. This could be
an advantage as it might instill confidence in the models
predictions. Although it will not be completely accurate, the
model is reliable in such a way that it will usually be near
the truth and not completely wrong. The standard deviations
were negligible when compared to the errors themselves and
thus indicating that the results are trustworthy and not due to
Figure 6. MAE calculated over 50 data points comparing RNN and LSTM mere chance.
for 30 predicted steps One predicament, that can be seen in Figure 4, is that the
model consistently underestimated the volatility of the peaks.
Visually, the moving average is seemingly better at conforming
with around 0.18 less. While during the peaks the errors of
to the largest peak. However, this is not aligned with the error
LSTM are up between 1.5 to 2.0 and the RNN is drastically
metrics calculated. One reason for this can be that the moving
above 2.0 to 3.5. The range of MAE are therefore drastically
average is great with almost a linear data structure by its inherit
larger than RNN compared to LSTM.
algorithm. Therefore, the moving average can be interpreted
as descent were the data is somewhat stationary but do not
C. Statistical significance of results perform during non-stationary intervals which the LSTM is
In regards to the hypothesis test for the calculated error much better suited for.
metrics between RNN and LSTM, table 3 shows that both 2) Second research question: The purpose of the second
MSE and MAE are statistically significant. The p-values for research question was to see if the model could actually be
single-step and 30 steps are both drastically below the chosen useful in practice. For it to be useful, it would require an
threshold for ↵ which was five percent. This result shows that ability to predict up to 30 minutes into the future. The reason
the differences are not due to chance. Comparing the single- being that, currently, it is not possible to adjust capacity with
step result with multi-step result shows that the further into the just a minute’s notice.
future that the models predict, the smaller the p-values seem As expected, predicting multiple steps ahead was associated
to appear. with greater errors compared to the single-step predictions
while still being statistically significant. Interestingly though,
Table III
ANOVA TEST: P- VALUE FOR ERROR METRICS MSE increased significantly more than MAE and it seems to
N umber of steps Error metrics correlate with how many steps were predicted. The simple
MSE p-value MAE p-value reason for this pattern could be that predicted previous steps
Single step 0.00129 0.00091 are used to predict the next step. Therefore, the next step
30 steps 1.55916e-08 1.27780e-07
inherits the other steps’ deviations from the original data.
One aspect of the result of the multi-step prediction is that
the LSTM errors for 20 steps are lower than RNN at 10 steps.
VII. D ISCUSSION Suggesting that the LSTM can predict 10 minutes further than
First of all, the results of the ANOVA test shows that the RNN. A result which mostly can be attributed to how the
the results are trustworthy and not due to chance. Therefore models predict during peaks.
further discussion is justified. The results themselves seem to With Figure 6 clarifying the structure of the errors, the
suggest that LSTM is an improvement of the standard RNN. largest MAE errors can be seen to arise during the peaks
Both LSTM and RNN also appear to outperform the baseline, of workload. Creating an algorithm that can diminish these
moving average, which is an important aspect to consider. errors during large changes in workload could perform even
There are some considerations that have to be kept in mind better than this LSTM. Regardless, with 30 steps predicted, the
however. These considerations will be discussed after a more LSTM only had a MAE error of around 0.74. A translation
thorough analysis of the results. in simple terms is that the LSTM deviated on average with
0.74 percent of the true CPU capacity. An error which may
not negatively affect the uncertainty too much in managers’
A. Result takeaways decision making.
1) First research question: The aim with the first research
question was to see if LSTM could outperform the RNN,
since RNN had seen success in previous work. When tested B. Applicability for businesses
on previously unseen data, LSTM obtained better results than Since companies will incorporate cloud based solutions in
both RNN and the baseline moving average, with regards their businesses, tools to help guide the managers to make
11

better decisions will be important. The reason is because of not very volatile. It followed a somewhat clear pattern over the
the complexity of data and cloud management that companies weeks and in addition the CPU utilization percentage remained
have to account for. Therefore, making some tasks automated low even during the peaks. It is difficult to say how the model
and liberating managers to focus on more difficult tasks at would fare if there was a bigger spread between the peaks and
hand can indeed be beneficial. troughs.
One task that could be automated with accurate predictive Fourthly, when evaluating the model on the testing data,
tools is regulating the cloud capacity. Making sure that the it still came from the same company. Thus it is not known
business is not overpaying for the cloud providers’ services how well the trained model generalizes to other companies.
while still operating at an optimal level. Predictive models If it is possible to develop a model that generalizes between
such as the LSTM can therefore be a great tool to have in a similar companies it would be advantageous. This would
company’s toolbox. To excel even more, the model should be enable companies without much data of their own, or com-
integrated with additional software to be able to change the putational capacity to train a model on their existing data, to
amount of vCores needed during a particular time. This will instead implement a pre-trained model. Perhaps eliminating
indeed demand more ingenious work to handle. However, if the problems that were mentioned as being associated with
the benefits transcend the drawbacks, the question of acting collecting more data.
upon an implementation should be an obvious choice. Fifthly, what is considered a good result is very much
While the incentive of creating an optimizing tool usually subjective. When comparing the models it is reasonable to
is budgetary, managers should acknowledge the environmental say that the LSTM achieved good results if it outperformed
aspect of using excessive computing power in their businesses. both the RNN and the baseline moving average. In the eyes
Cloud waste can to some extent be minimized from the of a manager, however, it is difficult to say what is regarded
provider but the best understanding of ones business is retained as a good enough result. A good enough result could be both
by the ones who operate that business. Therefore, making the higher or lower, and it could differ between managers.
argument that cloud waste is not only an issue that has to Lastly, if a company were to apply this there is an important
be solved on the host end of the cloud, but rather as a co- problem that has to be solved for it to be practical. The
operation by the involved parts. A step in the right direction problem stems from the fact that the measurement used is CPU
to achieving zero cloud waste could be by implementing a utilization percentage. If the company adjusts the numbers of
tool incorporating a model similar to the LSTM. Saving both vCores, meaning increased or decreased computational power,
managers’ time and effort in the processes while reducing the model has no understanding of how it affects the measure-
unnecessary cost and keeping the ecological footprint at a ment. To solve this, the values must somehow be translated to
minimum. a different measurement of computional power which might
require collaborating with the cloud service providers.
C. Other considerations
VIII. C ONCLUSION
When analysing results there are a few points that are When embarking upon this research topic the aim was to
believed to be very important to bear in mind. In total six increase the ability to predict CPU utilization of a cloud server
considerations are presented. by using an LSTM. Previous research had showed that RNN
Firstly, very little data (given the context) was used to train performs with this type of prediction and this paper shows
the model. The model still seems to perform well but the that LSTM outperforms the RNN, although on a somewhat
results can likely be improved upon. In the real world one way different dataset with a clearer pattern. With little training data
to improve performance is to collect more data. However, the the LSTM was able to obtain results with a high degree of
collection of data must be contrasted to the expected benefit accuracy for one-step predictions as well as up to 30 steps.
from improved results [22]. Even if a few uncertainties remain that need to be solved
In the context of this paper’s research, collecting more data before this paper’s findings can see practical use without
requires little effort but a lot of time and patience; Microsoft afterthought, the findings should at least serve as a proof of
Azure only stores historical minute-wise data for the past concept. This paper has showed that:
month. What is believed to be more important in this context is
that having a model which performs well with as little training
1) LSTM outperforms a standard RNN when predicting
data as possible could in itself be an advantage.
CPU utilization,
Secondly, this paper only uses a univariate LSTM. It is very
2) LSTM can predict CPU utilization up to 30 steps (30
likely that a multivariate model (a model which uses more than
minutes) while retaining a high degree of accuracy.
one feature) could produce better results. Examples of features
to consider are: hour in the day, day of the week, weekend or
IX. F UTURE WORK
workday, holiday or not and so on. An attempt was made to
use a multivariate LSTM but gave up at an early stage due to As mentioned, several uncertainties remain which could
bad performance, which is believed to be due to the curse of be investigated in future work for other researchers. These
dimensionality [22]. The curse of dimensionality entails the suggestions are:
need for more data, which could be seen as a disadvantage 1) Train an LSTM model with more data and with data of
when evaluating the model holistically. Thirdly, the data was a different pattern,
12

2) Test a multivariate LSTM model, [14] ”Machine types — Compute Engine Documentation —
3) Evaluate how well a pre-trained model generalizes to a Google Cloud”, Google Cloud, 2021. [Online]. Available:
https://cloud.google.com/compute/docs/machine-types.
separate – but similar – company or application, [15] S. Stein, ”Översikt över köpmodell för virtuella kärnor - Azure
4) Research further how to apply the model in practice and SQL Database & Azure SQL Managed Instance”, Docs.microsoft.com,
what the practical implications are of a model that can 2021. [Online]. Available: https://docs.microsoft.com/sv-se/azure/azure-
sql/database/service-tiers-vcore?tabs=azure-portal.
accurately predict CPU utilization on a cloud server. [16] Mansor, Zulkefli Razali, Rozilawati Yahaya, Jamaiah Yahya, Saadiah
Arshad, Noor. (2016). Issues and Challenges of Cost Management in
Agile Software Development Projects. Advanced Science Letters. 22.
ACKNOWLEDGMENT 1981-1984. 10.1166/asl.2016.7752.
Foremost, we would like to express our sincerest gratitude [17] R. N. Calheiros, E. Masoumi, R. Ranjan and R. Buyya, ”Workload
Prediction Using ARIMA Model and Its Impact on Cloud Applications’
towards to the organizations and people who gave us invalu- QoS,” in IEEE Transactions on Cloud Computing, vol. 3, no. 4, pp.
able support in our research. 449-458, 1 Oct.-Dec. 2015, doi: 10.1109/TCC.2014.2350475.
A sincere thank you to Kemal Karahmetovic at Afry for [18] Cao, J., Fu, J., Li, M. and Chen, J. (2014), CPU load prediction for
cloud environment based on a dynamic ensemble model. Softw. Pract.
always being available to help us when necessary. Exper., 44: 793-804. https://doi.org/10.1002/spe.2231
We also wish to show our gratitude to the other employees [19] L. Nashold and R. Krishnan, ”Using LSTM and SARIMA Models
at Afry who helped us on at least one occasion. A big thank to Forecast Cluster CPU Usage”, arXiv.org, 2021. [Online]. Available:
https://arxiv.org/abs/2007.08092.
you to Patrik Sjölin, Emelie Hedqvist and Peter Wallberg. [20] Hamilton, James Douglas. Time series analysis. Princeton university
We want to thank our supervisors at KTH – Royal Institute press, 1994.
of Technology. Thank you Jonas Beskow and Mattias Wigg- [21] T. Mitchell, Machine Learning. New York: McGraw-Hill, 1997.
[22] I. Goodfellow, Y. Bengio and A. Courville, Deep learning. .
berg for valuable feedback during our research. [23] Elman, Jeffrey L. (1990). ”Finding Structure in Time”. Cognitive Sci-
ence. 14 (2): 179–211. doi:10.1016/0364-0213(90)90002-E
[24] Hochreiter, Sepp. ”The vanishing gradient problem during learning
R EFERENCES recurrent neural nets and problem solutions.” International Journal of
Uncertainty, Fuzziness and Knowledge-Based Systems 6.02 (1998): 107-
[1] J. Ward, ”The rise and rise of Cloud Computing”, ey.com, 2019. 116.
[Online]. Available: https://www.ey.com/en ie/technology/the-rise-and- [25] Sepp Hochreiter; Jürgen Schmidhuber (1997). ”Long short-
rise-of-cloud-computing. term memory”. Neural Computation. 9 (8): 1735–1780.
[2] ”Making the cloud pay: How industrial companies can accelerate doi:10.1162/neco.1997.9.8.1735. PMID 9377276. S2CID 1915014
impact from the cloud”, mckinsey.com, 2020. [Online]. Available: [26] Werbos, Paul J. ”Backpropagation through time: what it does and how
https://www.mckinsey.com/industries/advanced-electronics/our- to do it.” Proceedings of the IEEE 78.10 (1990): 1550-1560.
insights/making-the-cloud-pay-how-industrial-companies-can- [27] D. Kingma and J. Ba, ”Adam: A Method for Stochastic Optimization”,
accelerate-impact-from-the-cloud#. arXiv.org, 2021. [Online]. Available: https://arxiv.org/abs/1412.6980.
[3] S. Lohr, ”Cloud Computing Is Not the Energy Hog That Had Been [28] A. Ng, Machine Learning Yearning, 1st ed. 2018.
Feared (Published 2020)”, Nytimes.com, 2020. [Online]. Available: [29] C. Willmott and K. Matsuura, ”Advantages of the mean absolute
https://www.nytimes.com/2020/02/27/technology/cloud-computing- error (MAE) over the root mean square error (RMSE) in assessing
energy-usage.html. average model performance”, Int-res.com, 2021. [Online]. Available:
[4] M. Armbrust et al., ”A view of cloud com- https://www.int-res.com/articles/cr2005/30/c030p079.pdf.
puting”, dl.acm.org, 2021. [Online]. Available: [30] Abenius, T., 2021. Envägs variansanalys (ANOVA) för test av
https://dl.acm.org/doi/fullHtml/10.1145/1721654.1721672#T1. olika väntevärde i flera grupper. [online] Math.chalmers.se. Available
[5] Duggan, Martin Mason, Karl Duggan, Jim Howley, Enda Barrett, at: ¡http://www.math.chalmers.se/Stat/Grundutb/CTH/lma136/1112/25-
Enda. (2017). Predicting Host CPU Utilization in Cloud Computing ANOVA.pdf¿
using Recurrent Neural Networks. 10.23919/ICITST.2017.8356348. [31] Kuck, David (1978). Computers and Computations, Vol 1. John Wiley
[6] ”What Is Cloud Computing? A Beginner’s Guide — Microsoft & Sons, Inc. p. 12. ISBN 978-0471027164.
Azure”, Azure.microsoft.com, 2021. [Online]. Available: [32] ”Keras documentation: About Keras”, Keras.io, 2021. [Online]. Avail-
https://azure.microsoft.com/en-us/overview/what-is-cloud-computing/. able: https://keras.io/about/.
[7] P. Mell and T. Grance, ”The NIST definition of
Cloud Computing”, nist.gov, 2011. [Online]. Available:
https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800- AUTHORS
145.pdf.
[8] ”AWS leads $42 bn global Cloud services market in Q1, Axel Rooth is currently studying Industrial Engineering and Management
Microsoft follows”, Business-standard.com, 2021. [Online]. about to pursue a Master’s degree in machine learning at KTH Royal Institute
Available: https://www.business-standard.com/article/technology/aws- of Technology in Stockholm, Sweden. He contributed to all parts in the
leads-42-bn-global-cloud-services-market-in-q1-microsoft-follows- research paper and was mainly responsible for implementing the algorithms.
121050200180 1.html. ”Hacking life since 98”
[9] ”Amazon EC2 Reserved Instances”, Amazon Web Services, Inc.,
2021. [Online]. Available: https://aws.amazon.com/ec2/pricing/reserved- Filip Nääs Starberg is currently studying Industrial Engineering and Man-
instances/. agement and about to pursue a Master’s degree in machine learning at KTH
[10] ”Pricing - Azure SQL Database Single Database — Microsoft Royal Institute of Technology in Stockholm, Sweden. He is a good person
Azure”, Azure.microsoft.com, 2021. [Online]. Available: and wants a career where he can earn a living. He contributed to all parts
https://azure.microsoft.com/en-us/pricing/details/azure-sql- in the research paper but was mainly responsible for researching the cloud
database/single/#pricing. computing market.
[11] ”Amazon S3 Simple Storage Service Pricing - Amazon Web Ser- ”Earning cred since 98”
vices”, Amazon Web Services, Inc., 2021. [Online]. Available:
https://aws.amazon.com/s3/pricing/.
[12] ”Knowledge center — Microsoft Azure”, Azure.microsoft.com,
2021. [Online]. Available: https://azure.microsoft.com/en-
us/resources/knowledge-center/what-is-a-vcore/.
[13] ”Optimize CPU options - Amazon Elastic Compute
Cloud”, Docs.aws.amazon.com, 2021. [Online]. Available:
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-
optimize-cpu.html.
TRITA-EECS-EX-2021:366

www.kth.se

You might also like