You are on page 1of 9

Predicting file lifetimes for data placement in

multi-tiered storage systems for HPC


Luis Thomas Sebastien Gougeaud Stéphane Rubini
luis.thomas@ensta-bretagne.org sebastien.gougeaud@cea.fr stephane.rubini@univ-brest.fr
ENSTA Bretagne CEA Univ. Brest
Lab-STICC, CNRS, UMR 6285 Bruyères-le-Châtel, France Lab-STICC, CNRS, UMR 6285
Brest, France Brest, France

Philippe Deniel Jalil Boukhobza


philippe.deniel@cea.fr jalil.boukhobza@ensta-bretagne.fr
CEA ENSTA Bretagne
Bruyères-le-Châtel, France Lab-STICC, CNRS, UMR 6285
Brest, France
Abstract Keywords: Data placement, Multi-Tier Storage, File lifetime,
The emergence of Exascale machines in HPC will have the Convolutional Neural Network, Machine Learning, High
foreseen consequence of putting more pressure on the stor- Performance Computing, Heterogeneous Storage, Storage
age systems in place, not only in terms of capacity but also Hierarchy
bandwidth and latency. With limited budget we cannot imag- ACM Reference Format:
ine using only storage class memory, which leads to the use Luis Thomas, Sebastien Gougeaud, Stéphane Rubini, Philippe De-
of a heterogeneous tiered storage hierarchy. In order to make niel, and Jalil Boukhobza. 2021. Predicting file lifetimes for data
the most efficient use of the high performance tier in this placement in multi-tiered storage systems for HPC. In Workshop
storage hierarchy, we need to be able to place user data on on Challenges and Opportunities of Efficient and Performant Stor-
the right tier and at the right time. In this paper, we assume age Systems (CHEOPS ’21), April 26, 2021, Online, United Kingdom.
a 2-tier storage hierarchy with a high performance tier and ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3439839.
a high capacity archival tier. Files are placed on the high 3458733
performance tier at creation time and moved to capacity tier
once their lifetime expires (that is once they are no more 1 Introduction
accessed). The main contribution of this paper lies in the
The amount of data generated in the world seems to grow
design of a file lifetime prediction model solely based on its
exponentially, from 4 zetabytes in 2013 to an estimated 185
path based on the use of Convolutional Neural Network. Re-
zetabytes in 2025 [4]. This trend will continue as new su-
sults show that our solution strikes a good trade-off between
percomputers are already reaching the exaflop in AI driven
accuracy and under-estimation. Compared to previous work,
workloads [37]. Storage has always been a performance bot-
our model made it possible to reach an accuracy close to
tleneck in High Performance Computing (HPC) [28][36][20].
previous work (around 98.60% compared to 98.84%) while
In order to cope with the amount of data that Exascale ma-
reducing the underestimations by almost 10x to reach 2.21%
chines will produce, storage systems need to improve both
(compared to 21.86%). The reduction in underestimations is
in terms of capacity and speed (bandwidth and latency) [20].
crucial as it avoids misplacing files in the capacity tier while
Since no single storage technology is capable of fulfilling
they are still in use.
both those roles at a reasonable cost [13], we need to use a
CCS Concepts: • Information systems → Information stor- hierarchy of different storage technologies. The combined
age systems; Storage management; Hierarchical storage use of Flash-based drives, magnetic drives (HDD) and tapes
management; Information lifecycle management; is standard [24] in HPC systems. Memory class storage such
as 3dX-point are expected to join the aforementioned hier-
ACM acknowledges that this contribution was authored or co-authored archy in production to partially fill the gap between DRAM
by an employee, contractor or affiliate of a national government. As such, and Flash [17][5].
the Government retains a nonexclusive, royalty-free right to publish or The management of this hierarchy is a complex endeavour.
reproduce this article, or to allow others to do so, for Government purposes Parallel distributed file systems such as Lustre [12] are used
only.
CHEOPS ’21, April 26, 2021, Online, United Kingdom
to manage the vast amount of disks and data across those
© 2021 Association for Computing Machinery.
disks. The standard architecture used in HPC is a three tier
ACM ISBN 978-1-4503-8302-8/21/04. . . $15.00 hierarchy [19] composed of clusters of Flash-based drives,
https://doi.org/10.1145/3439839.3458733 magnetic drives and tapes each grouped together in storage
CHEOPS ’21, April 26, 2021, Online, United Kingdom Thomas and Gougeaud, et al.

nodes and managed by a parallel distributed file system or an In this paper, like previous work [25], we used the access
object store such as DAOS [22]. Because the storage capacity path of a file to accurately predict when a file will last be
available on the high performance tier is often an order accessed. While previous work used two Convolutional Neu-
of magnitude lower than that of the other tiers [24], the ral Network using two different loss functions, we propose
fastest tier of this hierarchy should only contain a subset of to use a single Convolutional Neural Network using a cus-
files for the time period during which they require such a tom loss function in order to reduce the computing power
performance. necessary for a prediction. To achieve this, we designed the
In our work, we used a hierarchy of two tiers, a high custom loss function to optimize the ratio between accuracy
performance tier and a capacity tier used for archival purpose. and under-estimations. The intuition behind the design of
The aforementioned issue of data placement over the storage the custom loss function stems from multi task learning[16],
tiers can be translated in two main research questions: 1) we used the weighted average of the two loss functions used
when to store files in the performance tier and 2) when to in previous work [25]. Our objective is to increase the ac-
move them to the capacity tier. Questioning about how data curacy of our predictions while maintaining a low level of
should be moved/scheduled from a tier to another is out of under-estimations. We also used early-stopping to prevent
the scope of this paper, so is the prefetching issue. overfitting.
Several state-of-the-art work solutions investigated the Using a dataset of 5 millions entries from a production
data placement of objects or files in high performance tiers system at CEA and through several experiments we empir-
[32] [35] [34]. They were mainly performed for hybrid stor- ically found that the best weights/ratios between the two
age systems composed of flash-based disks and magnetic loss functions was around 0.2 for the loss function yield-
drives. A large proportion of those studies such as [23] [3] ing high accuracy and 0.8 for the loss function yielding
relied on continuous system monitoring of I/O patterns [27] low under-estimations. With this setting we were able to
to infer the moment at which data should be placed in such train our network to reach 98.60% accuracy while keeping
tier and when they should be evicted. While those strategies under-estimations under 2.21%. In comparison previous work
performed well in terms of I/O performance, the exhaustive achieved 98.84% accuracy and 21.86% under-estimations us-
monitoring used makes them hardly usable in large produc- ing their highly accurate loss function while the other loss
tion systems [15] [2] function yielded 90.91% accuracy and 0.88% under-estimations.
In [25], the authors solely relied on file names to infer Our solution is also more flexible since we can tune the
the time duration where files are accessed, thus reducing weights of the loss function to either favor high accuracy or
drastically the monitoring cost. In fact, they observed that low under-estimations.
in some CEA ( French Atomic Energy Commission) [10][6] We first go over the background of this study, by describ-
supercomputers[7] [8], 45.80% of the files were never ac- ing the storage system used at CEA along with the software
cessed 10 seconds after their creation [25]. They defined the used to manage data life cycle. Then we describe the solution
duration between the last access to a file and its creation time we designed. We evaluate and analyze our results. We then
as the file lifetime. They used machine learning approaches go over the related work and finally we conclude this paper
to predict file lifetimes based only on their paths. Paths were with our final analysis and future work.
chosen as input for two reasons: (1) the path of the file is
known at creation time and (2) it often contains metadata 2 Background
about the user and/or the application that created it. With
this prediction in hand, they placed each file in the high This section introduces the overall system architecture con-
performance tier once created and moved it to the capacity sidered and sketched in Fig. 1. The first section describes
tier when the file lifetime expires. the used HPC multi tiered storage system architecture rep-
resented by the two storage servers in the top left corner of
The performance of previous solution was measured ac-
Fig. 1 (high performance and high capacity tiers). The sec-
cording to two metrics: (1) the accuracy, it is defined as the
percentage of lifetime predictions that are within an order ond section introduces the Robinhood policy engine which
of magnitude from the real value; (2) the under-estimation, is represented by the right side of Fig. 1, Robinhood is a
defined as the percentage of predictions that are below the software that provides fast access to file metadata in HPC
real lifetime value. This is particularly important as underes- storage systems.
timated file lifetime would result in a premature file eviction
leading to a performance degradation. In this previous work, 2.1 The Storage Hierarchy in HPC
the authors used two Convolutional Neural Networks (CNN): Scientific applications such as weather simulations, genomic
(1) a CNN trained for accuracy (2) and another trained to sequencing or geoscience necessitate petabytes of storage
produce the least amount of under-estimations, for that they for datasets and results. These data are archived and their
used different cost functions. Then trade-offs were found. results keep growing larger. Caching and prefetching strate-
gies were put in place in order to speed up the process of
Predicting file lifetimes for data placement in multi-tiered storage systems for HPC CHEOPS ’21, April 26, 2021, Online, United Kingdom

reading and writing to storage memory which is several or-


ders of magnitude slower than main memory (DRAM) which
is itself orders of magnitude slower than cache and CPU
registers [18]. In order to get the best latency and through-
put, a common solution is to use Flash-based drives as an
intermediate between traditional magnetic drives and main
memory [38] [4]. The use of tape is the final piece of the
puzzle, as it is used to archive the data. Tapes have a latency
of around 20 seconds or more [24]. Data stored on tapes does
not consume any power which makes it an ideal solution
for long term storage. Finally, tape as a technology has the
best $/GB ratio and is a necessary technology to reach the
exabyte at a feasible economic cost. As an example, 60% of
the storage capacity of the TGCC [9], a CEA supercomputer,
relies on tapes.

Figure 2. View of the training process

3.1 Overall predicting system architecture


Our contribution consists in a file lifetime predicting system
that takes as input file paths. Fig. 1 shows a simplified view
of the existing data management system where our solution
would be integrated. In the top right corner are Lustre [12]
instances with 2 storage servers, those servers contain differ-
ent media types. The one on the right is a metadata server,
it contains metadata about the other Lustre file systems and
Figure 1. Simplified view of the system.
is connected to them with a secondary network called "man-
agement network" with lower latency and bandwidth than
2.2 Robinhood Policy Engine the High Performance Data Network. The applications are
running on compute nodes which are not represented in
Robinhood Policy Engine [21] is a versatile tool to manage this figure; applications are connected to the Lustre Storage
contents of large file systems, it is developed by the CEA [6]. Server through the High Performance Data Network.
It maintains a replicate of file system metadata in a database On the right side of the figure, the block labeled "Robin-
that can be queried at will. It makes it possible to schedule hood Policy Engine" [21] represents the software solution
mass action on file system entries by defining attribute-based used to create migration policies between the different tiers.
policies. Robinhood is linked to the file system through the Robinhood is connected to the management network and
changelog as shown on the right side of Fig. 1. It stores all reads metadata changes from the metadata server. This is
metadata information about the file system in a SQL database represented by the "changelog reader" in the figure. These
which makes it possible to retrieve the state of the file system metadata changes are processed and stored in a database.
at any time. In a production system, this translates to millions The database then receives SQL queries to check the state
of file entries, with their associated creation time, last access of the file systems without having to contact the metadata
and last modification time. We used the data extracted from server directly. An administrator can then create a policy by
Robinhood in the proposed contribution. periodically checking the state of the file system through the
Robinhood database; such policies include data migrations
3 Predicting file lifetime for data or purges.
placement Our solution would stand inside Robinhood and filters
This section aims to describe the full scope of our contribu- only the "File creation" events. The path of the file would
tion, beginning with an overview of our solution inside the be processed to generate a prediction of its lifetime. The
existing storage system. We explain how our training and prediction would then update the database entry related to
validation data is processed and describe how our solution the file. These predictions are to be used to create a data
works. migration policy, the files with expired lifetimes would be
CHEOPS ’21, April 26, 2021, Online, United Kingdom Thomas and Gougeaud, et al.

moved from high performance tier to capacity tier. In Fig. 1, of the path contains more relevant information such as the
this would translate in a migration from the SSD storage username, the file extension, etc. The transformed paths were
server (high performance tier) to the HDD storage server considered as inputs for our CNN.
(capacity or archival storage), thus granting us more space
for newly created files on the performance tier.

3.2 File lifetime training framework


In order to work, our machine learning solution has to be
trained on real examples. Training is an iterative process, we
refer to each training pass as an epoch. An epoch is defined
as the number of times the neural network has processed
the entire dataset.
Fig. 2 shows the overall CNN training process. In our case,
we used data from a production Robinhood database which
had to be processed in order to remove fields with missing
information. Those data are then transformed into a form Figure 3. Lifetime distribution.
that can be understood by a machine learning model. Then
we design a model using Tensorflow and Keras to describe the The second step of data processing is to define the life-
layers of our Convolutional Neural Network and its hyper time metric. There are three fields containing time infor-
parameters such as the learning rate or the number of epochs. mation available in Robinhood: the creation time, the time
The model then goes through training and its performance is of last access and the time of last modification. We came
evaluated by a loss function (also referred to as cost function). up with two possible definitions for the lifetime: 𝐷𝑒 𝑓 .1 =
The loss function is fundamental to the learning process as 𝑙𝑜𝑔10(𝑙𝑎𝑠𝑡 𝑚𝑜𝑑𝑖 𝑓 𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 − 𝑐𝑟𝑒𝑎𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒) and 𝐷𝑒 𝑓 .2 =
it will define how the neural network will change its param- 𝑙𝑜𝑔10(𝑙𝑎𝑠𝑡 𝑎𝑐𝑐𝑒𝑠𝑠 𝑡𝑖𝑚𝑒 − 𝑐𝑟𝑒𝑎𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒). The decimal loga-
eters. Previous work [25] used two different loss function rithm is used to discretize lifetime values. Fig.3 shows the
to achieve either high accuracy or low under-estimations, distribution of lifetimes for each definition. The x-axis repre-
we proposed a custom loss function using both of them. The sents the decimal logarithm (log10) of the lifetime in seconds
three main components are described in the next sections. and on the y-axis is the percentage of files close to that life-
time. The bars represent the amount of files belonging to a
3.3 Dataset processing certain lifetime slice in our dataset. It shows from left to right
Data processing is a necessary part of our training system, the lifetime distribution according to 𝐷𝑒 𝑓 .1 and 𝐷𝑒 𝑓 .2. We
its role is to transform the data from its source into a form decided to use 𝐷𝑒 𝑓 .2, i.e. the time of last access minus the
that a machine learning solution can use. In our case we creation time because it contains more relevant information
need fixed size inputs and file paths have variable sizes. We since the time of the last access can either be a write or a
need also in this phase to define and normalize the metric of read while the time of last modification is always a write op-
lifetime of files. eration. This means that a prediction based on 𝐷𝑒 𝑓 .1 might
Our solution uses data from a production system at CEA [10]. evict some files before a read and thus would result in a per-
They were extracted from a Robinhood Database used to formance penalty. So we defined the lifetime as the decimal
monitor a production file system. Raw data extracted in this logarithm of the difference between the time of last access
manner were the result of an SQL query, a Comma Sepa- and the creation time.
rated Value (CSV) file that can be viewed as a table. This
table contains fields with information on each file in the file 3.4 Model Architecture
system. We had access to the path, time of last access (or The CNN model architecture used in this paper was inspired
read), time of last modification (write), creation time and from a previous work [25] which itself relied on [31]. A sim-
other associated metadata for each file. plified representation of the model is shown on Fig.4. The
The first step of data processing consists in processing architecture of the model is made of 12 layers: 1 input layer,
each field of the CSV file in order to remove incoherent data 10 hidden layers and 1 output layer; Given the current trends
such as a creation time being later than the last access time. in computer vision, this is a small CNN by its number of
The path column of each field was then converted from a layers, but rather large in terms of parameters. For exam-
character string of variable length (ex: /foo/bar/user/project/...) ple, Inception v4 [33] contains 70 layers for 36.5 millions
to a fixed size array of 256 integers. This size was chosen parameters while our model contains around 100 millions
arbitrarily based on the commonly observed size of the paths. parameters (also referred as "weights"). We kept the same
When the path was longer than 256 characters it was trun- network layout as it performed adequately on this task in
cated from the start because we observed that the right part previous work [25] despite its number of parameters.
Predicting file lifetimes for data placement in multi-tiered storage systems for HPC CHEOPS ’21, April 26, 2021, Online, United Kingdom

the input matrices of the next convolutional layer. Our model


contains two convolutional layers and two pooling layers as
shown in Fig. 4. The process of finding the right amount of
layers is empirical and again was the result of experiments in
previous work [25]. The output of a convolutional layer can
be represented as a series of matrices containing features.
The layer that sits between the convolutional layers and
the dense layers in Fig. 4 is the flatten layer, its role is to
serialize the output of the last convolutional layer into a 1
dimensional array. Dense layers are defined by the number
of neurons they contain. Each value in the output of the
flatten layer is connected to each neuron in the dense layer.
The dense layers are fully connected to each others. Almost
Figure 4. Simplified view of the model. all of the learning parameters are located in the dense layers,
the weights in the previous layers (convolution, embedding)
The training starts with the input layer, it fetches data are negligible.
from the dataset and passes it to the embedding layer. The output of the network takes the output of a dense layer
made out of a single node and returns the lifetime of the file as
previously defined. This model solves a regression problem
since it returns a discrete value instead of probabilities to
belong to a certain class.

3.5 Loss function


The goal of training a CNN like any other networks is to
minimize a loss function. The loss function is used during
training to compare the changes needed to be applied to the
CNN in order to increase accuracy in the next iteration. It
takes as inputs the predicted lifetime value and the real one.
We present 4 loss function in this subsection, the first
Figure 5. Example of character embedding. is one of the most commonly used in regression tasks, the
following two were used by previous work [25] to train their
3.4.1 Embedding Layer. The embedding layer transforms CNNs. Finally we present our custom loss function which
the 256 bytes long input into a matrix of 256 by 32 floating makes use of the two from previous work, thus the need to
numbers, we kept this size from previous work [25] and explain how they work.
through multiple tries with different embedding values (16, One of the most commonly used loss function in regres-
32, 48, 64). Fig. 5 shows an example with an 8 byte long input sion tasks is Mean Squared Error or L2 [26]. MSE is the sum
and an embedding size of 4. Each character is mapped to of squared distances between the real value and predicted
an array of floating numbers, these numbers change during values, it is defined as:
training because they are learning parameters. We use em-
bedding for two reasons: first, in order to leverage CNN, we
𝑛
need to transform the file paths into matrices of fixed size. 1Õ
𝑀𝑆𝐸 = (𝑦𝑖 − 𝑦˜𝑖 ) 2
Second because the alternative method to embedding, which 𝑛 𝑖=1
is one hot encoding, takes too much space [30]. In our case
one hot encoding would have multiplied the size of the input
where 𝑦 represents the real value and 𝑦˜ the predicted
matrix by 3.
value. MSE should be used when the target values follow a
3.4.2 Convolutional Layers. The next layers are the con- normal distribution as this loss function would significantly
volutional layers, they extract spatial data from the embed- penalize large errors. We observed that our data do not follow
ding layer that makes sense to the neural network. Theo- a normal distribution which explains why previous work [25]
retically a convolutional layer produces a set of matrices chose another loss function that is less affected by outliers
containing features about the data it gets fed. Convolutional in the data.
layers are defined by two parameters 1) the number of output Previous work [25] used logcosh, which is another regres-
matrices and 2) the size of the kernel used (128 and 5 for sion loss function, in order to achieve high accuracy. It is
example). Pooling layers are used to reduce the dimension of defined as:
CHEOPS ’21, April 26, 2021, Online, United Kingdom Thomas and Gougeaud, et al.

Table 1. Comparison of previous work against our solution


𝑛

𝐿= log(cos(𝑦𝑖 − 𝑦˜𝑖 ))
𝑛 𝑖=1 evaluation metric logcosh quantile99 our solution
where 𝑦 represents the real value and 𝑦˜ the predicted value. accuracy 98.84% 90.91% 98.60%
It works like MSE except that it is less affected by outliers under-estimations 21.86% 0.88% 2.21%
in the dataset. Previous work [25] also used quantile99 to
achieve a low-level of under-estimations. quantile99 is the
name given by [25] for the quantile loss function when the Table 2. Comparison of different weighted loss function for
quantile is 0.99. Quantile loss function is defined as: each (𝑤 𝑎 , 𝑤𝑢 ) value

𝑛
1Õ eval. metric (0.5,0.5) (0.4,0.6) (0.3,0.7) (0.2,0.8) (0.1,0.9)
𝐿𝑞 = max(𝑞 · (𝑦𝑖 − 𝑦˜𝑖 ), (𝑞 − 1) · (𝑦𝑖 − 𝑦˜𝑖 ))
𝑛 𝑖=1 accuracy 98.83% 98.75% 98.74% 98.60% 97.95%
where 𝑞 represents the quantile value, 𝑦 the real value under-est. 6.22% 2.53% 2.48% 2.21% 1.34%
and 𝑦˜ the predicted value. This loss function gives different
values to over-estimations and under-estimations based on
the quantile value. The closer the quantile value is to 1 the without mixed precision). In order to reduce training time
more it will penalize the under-estimations. The same is true we used early-stopping which monitors if the network is
for over-estimation, the closer the quantile is to 0 the more it over-trained and if so stops the training.
will penalize over-estimations. A quantile value of 0.5 shows The dataset we used contains 5 millions entries of real data
the same behavior as MSE. from a production system in HPC. We pre-processed this
The loss function used to train our model is a custom one. dataset in order to keep only the path and the lifetime of each
We realized through [16] that using the weighted average of file using the method discussed in section 3. We used 70% of
quantile99 and logcosh allowed us to combine their properties data for training and 30% for validation. The validation data
while alowing for a higher flexibility thanks to the used were chosen randomly.
weights. The custom loss function can be represented as: The metrics used to evaluate the performance of the net-
work are the accuracy and the under-estimations.
𝑐𝑢𝑠𝑡𝑜𝑚𝑙𝑜𝑠𝑠 () = 𝑤 𝑎 · 𝑙𝑜𝑔𝑐𝑜𝑠ℎ() + 𝑤𝑢 · 𝑞𝑢𝑎𝑛𝑡𝑖𝑙𝑒99() 4.2 Results
where 𝑤 𝑎 + 𝑤𝑢 = 1. By modifying the value of the wieghts Table 1 shows the performance of logcosh and quantile99
𝑤 𝑎 and 𝑤𝑢 we can either favor high accuracy or low under- against our solution, best values are emphasized. In this table,
estimations. the weights of the loss function used for our solution were
The goal of this approach is to allow our solution to be 0.2 for logcosh and 0.8 for quantile99. While not reaching
adapted according to the system state, for instance, disk the accuracy of logcosh our solution manages to be 7.5%
usage, I/O bandwidth, performance delta between tiers, the more precise than quantile99. On the other hand the number
QoS required by the application, etc. of under-estimations is almost divided by ten compared to
We evaluated different weight configurations in our ex- logcosh.
perimentation, that is : By tuning the weights of our loss function we can fur-
(𝑤 𝑎 , 𝑤𝑢 ) ∈ {(0.5, 0.5), (0.4, 0.6), (0.3, 0.7), (0.2, 0.8), (0.1, 0.9)}, ther approach either the accuracy of logcosh or the under-
the results we obtained are discussed in Section 4. estimations of quantile99. Table 2 shows the influence of the
weights of our loss functions on the metrics we observed,
4 Experimental Evaluation best performances are emphasized. A (0.5, 0.5) split between
This section covers the methodology used to evaluate our so- logcosh and quantile99 yields an accuracy of only 0.1% lower
lution and the results we obtained against previous work [25]. than logcosh alone while dividing underestimations by more
We also discuss the flexibility of our solution by tuning the than 3. In term of under-estimations, our solution is still
weights of our custom loss function. outperformed by previous work using quantile99 by 0.46%
in the (0.1, 0.9) experiment. Overall we can observe that
4.1 Methodology logcosh seems to have a bigger influence on the evaluation
We trained the convolutional neural network in a docker metrics than quantile99. Accuracy declines slowly despite
container which had access to 4 Tesla V100 GPU. We used an the proportion of quantile99 getting larger. Since we defined
optimization called mixed-precision in order to reduce the an accurate prediction as a value an order of magnitude
training time on GPU. We used Tensorflow 2 [1] to conduct away from the real value then it is possible that the pre-
all our experiments. With this optimization enabled, it took dicted values are getting closer to leaving that interval as
2 hours to train our model (while the training lasted 4 hours the proportion of quantile99 increases. This would explain
Predicting file lifetimes for data placement in multi-tiered storage systems for HPC CHEOPS ’21, April 26, 2021, Online, United Kingdom

the steep decline in precision between the (0.2, 0.8) and the 6 Conclusions and Future Work
(0.1, 0.9) experiments. In this paper we propose a method to predict the lifetime of
The conclusion we have drawn is that minimizing under- a file based on its path using a Convolutional Neural Net-
estimation is costly in terms of accuracy; Through multiple work. We used data from a CEA production storage system
experiments, we were able to quantify the cost of reducing to train our solution. We designed a custom loss function to
under-estimations in term of accuracy. get our Convolutional Neural Network to achieve a trade-off
between high accuracy and low under-estimations. We then
evaluated our solution against previous work [25], which
used two loss functions: logcosh and quantile99. The results
5 Related work
we obtained show that when using weights of 0.8 for quan-
Predicting file lifetimes with Machine Learning was investi- tile99 and 0.2 for logcosh our solution was able to reach an
gated in [25]. It was the main inspiration of this work. The accuracy close to previous work using only logcosh (98.60%
authors used two machine learning algorithms, random for- compared to 98.84%) while reducing the underestimations
est and a convolutional neural network (CNN). It showed by almost 10 times to reach 2.21% (compared to 21.86%).
that, on a dataset of 5 million entries, random forest was Furthermore our solution is more flexible compared to pre-
outperformed by the CNN both in terms of accuracy using vious work since it allows system administrators to tune the
logcosh and in terms of under-estimations using quantile99 weights of the loss function during incremental training to
as loss functions. Another observation they made from the either favor high precision or low underestimations.
data they recovered was that 45.80% of the files had a lifetime In future work we would like to investigate the use of
under 10 seconds. our solution on a 3-tiered storage system (we used 2 tiers
eXpose [31] applies character embedding to URLs in a in this work). In such a context, we need to investigate new
cybersecurity context. The embedding was used to train a parameters to learn when to move data from one tier to
Convolutional Neural Network to recognize malicious URLs. another while still keeping the burden of monitoring low.
It is a highly influential work to the previously cited pa- Finally we would like to explore the use of a deeper CNN
per [25]. architecture in order to reduce the number of parameters,
Learning I/O Access Patterns to improve prefetching in much like what is done in computer vision.
SSDs [11] leverages the use of Long Short-Term Memory
(LSTM) which are an extension of Recurent Neural Networks
to predict at firmware level (inside a SSD) which Logical
Block Address will be used. In [14], the authors have eval-
References
[1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy
uated different learning algorithms, this time to model the
Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irv-
overall performance of SSDs for Container-based Virtual- ing, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga,
ization to avoid interference that would cause Service Level Sherry Moore, Derek G Murray, Benoit Steiner, Paul Tucker, Vijay
Objectives violations in a cloud environment Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang
Data-Jockey [32] is a data movement scheduling solution Zheng. 2016. TensorFlow: A system for large-scale machine learn-
ing. In Proceedings of the 12th USENIX Symposium on Operating
that aims to reduce the complexity of moving data across a
Systems Design and Implementation, OSDI 2016. USENIX, 265–283.
hierarchy of storage. It uses user-made policies to balance https://doi.org/10.5555/3026877.3026899 arXiv:1605.08695
the load across all storage nodes and jobs. It has a view of [2] Anthony Agelastos, Benjamin Allan, Jim Brandt, Paul Cassella, Jeremy
the entire supercomputer and the end goal is to create a Enos, Joshi Fullop, Ann Gentile, Steve Monk, Nichamon Naksineha-
scheduler analog to SLURM for data migration jobs. boon, Jeff Ogden, Mahesh Rajan, Michael Showerman, Joel Steven-
son, Narate Taerat, and Tom Tucker. 2014. The Lightweight Dis-
Archivist [29] is a data placement solution for hybrid em-
tributed Metric Service: A Scalable Infrastructure for Continuous
bedded storage, it uses a neural network to classify files Monitoring of Large Scale Computing Systems and Applications.
and outperforms baseline in latency access given certain In International Conference for High Performance Computing, Net-
conditions. working, Storage and Analysis, SC. IEEE Computer Society, 154–165.
UMAMI [23] uses a range of different monitoring tools https://doi.org/10.1109/SC.2014.18
[3] Djillali Boukhelef, Jalil Boukhobza, Kamel Boukhalfa, Hamza
to capture a holistic view of the system. They show that
Ouarnoughi, and Laurent Lemarchand. 2019. Optimizing the cost
their holistic approach allows them to analyse performance of DBaaS object placement in hybrid storage systems. Future Genera-
problems more efficiently. tion Computer Systems 93 (apr 2019), 176–187. https://doi.org/10.1016/
The work presented in this paper follows and increments j.future.2018.10.030
the approach in [25]. It upgrades the previous solution by 1) [4] Jalil Boukhobza and Pierre Olivier. 2017. Flash Memory Integration (1st
ed.). ISTE Press - Elsevier.
enhancing its performance in terms of accuracy and under
[5] Jalil Boukhobza, Stéphane Rubini, Renhai Chen, and Zili Shao. 2017.
estimations and 2) providing the administrator with a higher Emerging NVM: A Survey on Architectural Integration and Research
degree of flexibility by supplying a set of solutions allowing Challenges. ACM Trans. Des. Autom. Electron. Syst. 23, 2, Article 14
a trade-off between accuracy and under estimations. (Nov. 2017), 32 pages. https://doi.org/10.1145/3131848
CHEOPS ’21, April 26, 2021, Online, United Kingdom Thomas and Gougeaud, et al.

[6] CEA. 2020. CEA - HPC - Computing centers. Retrieved 2021-02-23 UMAMI: A recipe for generating meaningful metrics through holistic
from http://www-hpc.cea.fr/en/complexe/computing-ressources.htm I/O performance analysis. In Proceedings of PDSW-DISCS 2017 - 2nd
[7] CEA. 2020. CEA - HPC - TERA. Retrieved 2021-02-23 from http: Joint International Workshop on Parallel Data Storage and Data Intensive
//www-hpc.cea.fr/en/complexe/tera.htm Scalable Computing Systems - Held in conjunction with SC 2017: The
[8] CEA. 2020. CEA - HPC - TGCC. Retrieved 2021-02-23 from http: International Conference for High Performance Computing, Networking,
//www-hpc.cea.fr/en/complexe/tgcc.htm Storage a. 55–60. https://doi.org/10.1145/3149393.3149395
[9] CEA. 2020. CEA - HPC - TGCC Storage system. Retrieved 2021-02-23 [24] Jakob Lüttgau, Michael Kuhn, Kira Duwe, Yevhen Alforov, Eugen
from http://www-hpc.cea.fr/en/complexe/tgcc-storage-system.htm Betke, Julian Kunkel, and Thomas Ludwig. 2018. Survey of Storage
[10] CEA. 2020. English Portal - The CEA: a key player in technological Systems for High-Performance Computing. Supercomputing Frontiers
research. Retrieved 2021-02-23 from https://www.cea.fr/english/ and Innovations 5, 1 (April 2018), 31–58–58. https://doi.org/10.14529/
Pages/cea/the-cea-a-key-player-in-technological-research.aspx jsfi180103 Number: 1.
[11] Chandranil Chakraborttii and Heiner Litz. 2020. Learning [25] Florent Monjalet and Thomas Leibovici. 2019. Predicting File Lifetimes
I/O Access Patterns to Improve Prefetching in SSDs. In The with Machine Learning. In High Performance Computing, Vol. 11887
European Conference on Machine Learning and Principles and LNCS. Springer, 288–299. https://doi.org/10.1007/978-3-030-34356-
Practice of Knowledge Discovery in Databases (ECML-PKDD). 9_23
https://www.researchgate.net/publication/344379801_Learning_IO_ [26] Feiping Nie, Zhanxuan Hu, and Xuelong Li. 2018. An investigation for
Access_Patterns_to_Improve_Prefetching_in_SSDs loss functions widely used in machine learning. Communications in
[12] Sean Cochrane, Ken Kutzer, and L McIntosh. 2009. Solving the HPC I/O Information and Systems 18, 1 (2018), 37–52. https://doi.org/10.4310/
bottleneck: Sun™ Lustre™ storage system. Sun BluePrints™ Online 820 cis.2018.v18.n1.a2
(2009). http://nz11-agh1.ifj.edu.pl/public_users/b14olsze/Lustre.pdf [27] Hamza Ouarnoughi, Jalil Boukhobza, Frank Singhoff, and Stéphane Ru-
[13] Tom Coughlin. 2011. New storage hierarchy for consumer computers. bini. 2014. A multi-level I/O tracer for timing and performance storage
In 2011 IEEE International Conference on Consumer Electronics (ICCE). systems in IaaS cloud. In 3rd IEEE International Workshop on Real-Time
483–484. https://doi.org/10.1109/ICCE.2011.5722696 ISSN: 2158-4001. and Distributed Computing in Emerging Applications (REACTION). IEEE
[14] Jean Emile Dartois, Jalil Boukhobza, Anas Knefati, and Olivier Barais. Computer Society, 1–8.
2019. Investigating Machine Learning Algorithms for Modeling SSD [28] John K Ousterhout. 1990. Why Aren’t Operating Systems Getting
I/O Performance for Container-based Virtualization. IEEE Transactions Faster As Fast as Hardware? 1990 Summer USENIX Annual Technical
on Cloud Computing (2019). https://doi.org/10.1109/TCC.2019.2898192 Conference (1990), 247–256.
[15] Richard Evans. 2020. Democratizing Parallel Filesystem Monitoring. [29] Jinting Ren, Xianzhang Chen, Yujuan Tan, Duo Liu, Moming Duan,
In 2020 IEEE International Conference on Cluster Computing (CLUSTER). Liang Liang, and Lei Qiao. 2019. Archivist: A Machine Learning As-
454–458. https://doi.org/10.1109/CLUSTER49012.2020.00065 sisted Data Placement Mechanism for Hybrid Storage Systems. In 2019
[16] Ting Gong, Tyler Lee, Cory Stephenson, Venkata Renduchintala, IEEE 37th International Conference on Computer Design (ICCD). 676–679.
Suchismita Padhy, Anthony Ndirango, Gokce Keskin, and Oguz Elibol. https://doi.org/10.1109/ICCD46524.2019.00098 ISSN: 2576-6996.
2019. A Comparison of Loss Weighting Strategies for Multi task Learn- [30] Pau Rodríguez, Miguel Bautista, Jordi Gonzàlez, and Sergio Escalera.
ing in Deep Neural Networks. IEEE Access 7 (2019), 141627–141632. 2018. Beyond One-hot Encoding: lower dimensional target embedding.
https://doi.org/10.1109/ACCESS.2019.2943604 Image and Vision Computing 75 (05 2018). https://doi.org/10.1016/j.
[17] Takahiro Hirofuchi and Ryousei Takano. 2020. A Prompt Report on imavis.2018.04.004
the Performance of Intel Optane DC Persistent Memory Module. IEICE [31] Joshua Saxe and Konstantin Berlin. 2017. eXpose: A Character-Level
Transactions on Information and Systems E103.D, 5 (May 2020), 1168– Convolutional Neural Network with Embeddings For Detecting Mali-
1172. https://doi.org/10.1587/transinf.2019EDL8141 arXiv: 2002.06018. cious URLs, File Paths and Registry Keys. arXiv:1702.08568 [cs] (Feb.
[18] Bruce Jacob, Spencer Ng, and David Wang. 2007. Memory Systems: 2017). http://arxiv.org/abs/1702.08568 arXiv: 1702.08568.
Cache, DRAM, Disk. Morgan Kaufmann Publishers Inc., San Francisco, [32] Woong Shin, Christopher Brumgard, Bing Xie, Sudharshan Vazhkudai,
CA, USA. Devarshi Ghoshal, Sarp Oral, and Lavanya Ramakrishnan. 2019. Data
[19] Kathy Kincade. 2019. UniviStor: Next-Generation Data Stor- Jockey: Automatic Data Management for HPC Multi-tiered Storage
age for Heterogeneous HPC. Retrieved 2021-02-23 from Systems. In 2019 IEEE International Parallel and Distributed Processing
https://cs.lbl.gov/news-media/news/2019/univistor-a-next- Symposium (IPDPS). 511–522. https://doi.org/10.1109/IPDPS.2019.
generation-data-storage-tool-for-heterogeneous-hpc-storage/ 00061 ISSN: 1530-2075.
[20] S Klasky, Hasan Abbasi, M Ainsworth, Jong Youl Choi, Matthew Curry, [33] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexan-
T Kurc, Q Liu, Jay Lofstead, Carlos Maltzahn, Manish Parashar, Norbert der A. Alemi. 2017. Inception-v4, inception-ResNet and the im-
Podhorszki, Eric Suchyta, F Wang, M Wolf, C.S. Chang, R. Churchill, pact of residual connections on learning. In 31st AAAI Conference
and Stéphane Ethier. 2016. Exascale Storage Systems the SIRIUS Way. on Artificial Intelligence, AAAI 2017. AAAI press, 4278–4284. https:
Journal of Physics: Conference Series 759 (Oct. 2016), 012095. https: //doi.org/10.5555/3298023.3298188 arXiv:1602.07261
//doi.org/10.1088/1742-6596/759/1/012095 [34] Bharti Wadhwa, Surendra Byna, and Ali Butt. 2018. Toward Trans-
[21] Thomas Leibovici. 2015. Taking back control of HPC file systems parent Data Management in Multi-Layer Storage Hierarchy of HPC
with Robinhood Policy Engine. International Workshop on the Lustre Systems. In 2018 IEEE International Conference on Cloud Engineering
Ecosystem: Challenges and Opportunities (2015). arXiv:1505.01448 http: (IC2E). 211–217. https://doi.org/10.1109/IC2E.2018.00046
//arxiv.org/abs/1505.01448 [35] Lipeng Wan, Zheng Lu, Qing Cao, Feiyi Wang, Sarp Oral, and Bradley
[22] Zhen Liang, Johann Lombardi, Mohamad Chaarawi, and Michael Settlemyer. 2014. SSD-optimized workload placement with adap-
Hennecke. 2020. DAOS: A Scale-Out High Performance Storage tive learning and classification in HPC environments. In 2014 30th
Stack for Storage Class Memory. In Lecture Notes in Computer Sci- Symposium on Mass Storage Systems and Technologies (MSST). 1–6.
ence (including subseries Lecture Notes in Artificial Intelligence and https://doi.org/10.1109/MSST.2014.6855552
Lecture Notes in Bioinformatics), Vol. 12082 LNCS. Springer, 40–54. [36] Wenguang Wang. 2004. Storage Management for Large Scale Sys-
https://doi.org/10.1007/978-3-030-48842-0_3 tems. Ph.D. Dissertation. CAN. https://doi.org/10.5555/1123838
[23] Glenn K. Lockwood, Wucherl Yoo, Suren Byna, Nicholas J. Wright, AAINR06171.
Shane Snyder, Kevin Harms, Zachary Nault, and Philip Carns. 2017.
Predicting file lifetimes for data placement in multi-tiered storage systems for HPC CHEOPS ’21, April 26, 2021, Online, United Kingdom

[37] HPC Wire. 2020. Fujitsu and RIKEN Take First Place Worldwide in [38] Orcun Yildiz, Amelie Zhou, and Shadi Ibrahim. 2017. Eley: On the
TOP500, HPCG, and HPL-AI with Supercomputer Fugaku. Retrieved Effectiveness of Burst Buffers for Big Data Processing in HPC Systems.
2021-01-25 from https://www.hpcwire.com/off-the-wire/fujitsu-and- In 2017 IEEE International Conference on Cluster Computing (CLUSTER).
riken-take-first-place-worldwide-in-top500-hpcg-and-hpl-ai-with- 87–91. https://doi.org/10.1109/CLUSTER.2017.73
supercomputer-fugaku/

You might also like