You are on page 1of 5

Boosting Data Access Based on Predictive Caching

Chen Han Liao JianGuo Zheng


Glory Sun Management school
Glory Sun Management school
DongHua University
DongHua University
Shanghai China 200051
Shanghai China 200051
chenhanliao@googlemail.com
zjg@dhu.edu.cn

ABSTRACT- Storage system behaviors are recorded in trace period a predictor for file frequency is constructed from observations
files. The file system trace monitors the file operations from of file system activity in trace. The file system can then use this
model to predict the frequency of newly-created files.
time to time. We show that once a file is created with a set of
attributes, such as name, type, permission mode, owner and
owner group, its future access frequency is predictable. A
In this paper, we first show that the future file access
regression-tree-based predictive model is established to predict frequency is predictable and strongly related to the file
whether a file will be frequently accessed or not. By consulting attributes at creation time. Then, we construct a regression­
with the rules generated from the predictive model over diverse tree-based predictive model to infer a new file's future
real-system NFS traces, it can predict a newly created file's frequency class (high/hot, low/cold). Finally, we present an
future access frequency with a sufficient accuracy. We further evolutionary storage system integrating the above predictor.
introduce an evolutionary storage system, which the predicted The rest of the paper is organized as follows: Section 2
frequency information could be used to decide what files to discusses the related works. Section 3 describes and analyses
keep in a flash memory. The trace-driven experimental results three NFS traces used in our analysis and the statistical
indicate that the performance speedup due to the prediction­ evidence of file access frequency prediction. Section 4
enabled optimization is 2-4. presents FFP (File Frequency Predictor). Section 5 validates
FFP. Section 6 presents an evolutionary storage system
I. INTRODUCTION integrating FFP. Section 7 concludes.
The file system designers have suggested that the
IT. RELATED WORK
knowledge behind the system behaviors and file attributes
such as file access pattern, size, and name can conduct the Since the gap of processing speed between I/O and CPU
improvement and the design of file caching policies and disk performance increases every few years, many efforts have
layout. The recent researches [1] [2] have illustrated that file been devoted to address it through optimizing various
size, lifespan and access mode (write, read, lookup) are aspects of file system. One of those efforts is to reduce the
predictable to improve file pre-fetching and caching. number of disk requests by improving the caching efficiency.
Unfortunately, no work has been reported in predicting file Another solution is to reduce the disk requesting time
access frequency. Since the file access frequency can through re-organizing the disk layout of the relative files'
directly reflect a file's usage and decide whether or not the blocks.
file is hot, therefore, the storage system may use the file One of such heuristics is the Fast File System (FFS) [5].
access frequency information to improve its performance It was designed to handle files in different manners
such as file caching and disk layout. In this paper, we show regarding to their size. It places small files on disk so that
that applications already give useful hints to a file system, in they are near their metadata. Furthermore, it assumes that
the form of file attributes at creation time, and that the file files in the same directory are most likely accessed together
system can successfully predict its future access frequencies in a short period.
from these hints (Fig.1). The file access frequency in our In addition to file size, other properties such as access
work is defined as the frequency of various file operations, type: write-mostly or read-mostly, have been confirmed
which include read, write, execute, lookup and rename. useful to design various file system policies. For example,
Log-structured File System (LFS) [14] proposed by John K.
Ousterhout and Fred Douglis, treats the disk as a circular log
and writes sequentially to the head of the log. The design of
log-structured file systems is based on the hypothesis that
write latency is the bottle neck of modern file systems. As a
result, a Hybrid file system structure designed against the
nature of read and write has been found useful to form high
performance file system [12].
Besides above heuristics, it is noticed that file access
frequency is also useful. A data file that is currently being
heavily accessed by users is a high frequency file. In a
Figure I. Regression-tree-based predictor FFP is a bridge linking the past Hierarchical Storage Management system (HSM), file 110
(attributes at creation) and the future (frequency). During the training
may be constantly monitored in order to migrate the high

978-1-61284-486-2/111$26.00 ©2011 IEEE

93
frequency files to the fastest storage devices or to internal Table I gives a summary of the average hourly operation
memory for better performance. HSM was first implemented counts and mixes for the workloads captured in the traces.
by IBM on their mainframe computers [16] to reduce the These show that there are differences between these
cost of data storage, and to simplify the retrieval of data workloads, at least in terms of the operation mix. CAMPUS
from slower media. The user would not need to know where is dominated by reads and more than 85% of the operations
the data was stored and how to get it back, the computer are either reads or writes. DEAS03 has proportionally fewer
would retrieve the data automatically. The only difference to reads and writes and more metadata requests (getattr, lookup,
the user was the speed at which data is returned. and access) than CAMPUS, but reads are still the most
Based on our investigation, the problem is that all of common operation. On EECS03, meta-data operations
those heuristics are embedded into file system it selves, comprise the majority of the workload.
which means if the heuristics are wrong or dynamic the
system performance will degrade [2]. In addition, file
system workloads may change over time, the designed
policies may take critical time to adapt to it. In previous
work, a solution to this problem is that enable applications
to supply hints to the file systems [5] [13 ]. Unfortunately,
this solution requires the modification of applications, which
hasn't been widely deployed.
In fact, the workloads, user behaviors and applications
already provide very rich hints to conduct file system
behaviors. An attributes based file properties prediction
method has been introduced recently [2]. It utilizes machine
learning approach to analysis the previous file system
behaviors and then makes decisions advising file systems
Figure 2. The cumulative file accesses versus the cumulative file number,
the estimated future properties of a new file, such as size,
both expressed as percentages. It was found that more than eighty
life span and even write-mostly or read-mostly.
percent of the file accesses are probably going to less than ten
However, to the best of our knowledge, there is no
percent of the files.
reported investigation against file access frequency
prediction so far. In this work, we will show that once a file
In Fig.2, we plot the cumulative file accesses versus the
is created with a set of attributes, such as name, type,
cumulative file number, both expressed as percentages. The
permission mode, owner and owner group, its future access
file accesses in our work include read, write, execute, lookup
frequency is predictable. A regression-tree-based predictive
and rename operations. The curve ofDEAS03 in Fig.2 shows
model will then be established. The predicted frequency
94.9% of the file accesses going to 10% of the files. This
information will be used to decide what files to keep in a
result with curve of EECS03 and CAMPUS become less
flash memory. Hence the work is believed to be timely and
skewed. In EECS03 88.9% of the file accesses are going to
novel considering file access frequency is useful to realize
less than 10% of the total files whereas in CAMPUS 84.1%
most of the performance gains.
of the file accesses are going to less than 10% of the total
files. CAMPUS is similar to EECS03, but contains a lot of
III. THE TRACES AND THEIR SKEWS
sequential access patterns. We can see that some file accesses
The Traces we used were taken from three servers that are highly skewed and experience shows that only a small
have different users and behaviors: DEAS03 [1], EECS03 [2] fraction of files in a file system is actively and frequently
and CAMPUS [3 ]. used. To identifY the associations between the create-time
For the privacy issue, some information of these three attributes of a file and its future access frequency, we first
traces was encoded by using the method described in [8]. scan the traces to obtain the previous knowledge and
The encoded traces contain anonymized UIDs, GIDs, and IP calculate the file access frequency. In these 3 NFS traces, the
address remapped to the new values. There are no attribute "fid" uniquely identifies a file; therefore, we can
information lost among those identifiers and other variables. record the access frequency of a file by accumulating the
During the anonymization UIDs, GIDs, and host IP numbers number of occurrences of a certain "fid". Once we obtain the
are simply remapped to new values, so no information is lost access frequency of a file in a short period, say peak hours
about the relationship between these identifiers and other (1Oam-l pm), we are able to measure the statistical
variables in the data. All three traces were collected with association between each attribute and the relative file access
nfsdump rI 8 . frequency.
Host read write lookup getattr access The fact is that some of the associations are clear and
DEAS03 48.7% 15.7% 3.4% 29.2% 1.4% easily to be identified. For example, the lecturers regularly
EECS03 24.3% 12.3% 27.0% 3 .2% 20.0% update his/her lecture notes in the file server. Consequently,
CAMPUS 64.5% 21.3% 5.8% 2.3% 2.9% in the next short period, the files created with
UID="professor A" will be accessed frequently. In contrast,
TABLE I. THE AVERAGE PERCENTAGE OF READ. WRITE. LOOKUP. GET
for example, if a student submits his/her coursework to the
ATTRIBUTES, AND ACCESS OPERATIONS FOR A FOURTEEN DAY TRACE.

91
file server, in the next short period, the files created with can easily capture the trends of future file access based on the
UID="student B" will not hold a good probability of being current regression-tree's layout. Secondly, regression-tree
accessed as frequent as previous one regarding to the model can generate important insights describing a situation
popularity reason. (its alternatives, probabilities, and costs) and their
Similarly to this, file mode can also provide useful preferences for outcomes. These information make the
information to file access frequency. According to the predicting results more reliable.
observation in [2], any files with a mode of 777 in DEAS03 CART decision tree method compared to other
trace is likely to live for less than one second and contain advantages: not only simple, but the beginning is not
zero bytes of data. It suggests that if a file is created with a generating an appropriate size tree, but generate a sufficiently
mode that makes it readable and writeable by any user on the large classification and regression tree, the tree thus obtained
system is actually never read or written by anyone. is better robustness and higher accuracy.
To statistically confirm the degree that how strong an To determine the frequency boundary between High &
attribute of a file relates to the file's access frequency, we use Low, the capacity of the caching devices is considered. Since
Chi-square test [9] to roughly analyze the association our implementation of FFP is based on a flash memory­
between high access frequency files and each attribute. Chi­ oriented storage system which stores frequently used files in
square test lets us know the degree of confidence we can flash memory for caching and prefetching, the boundary of
have in accepting or rejecting a hypothesis. Concretely, the file access frequency is determined in such a way that the
hypothesis of our Chi-square test is whether or not a high amount of 110 caused by accessing High frequency files
access frequency file's attributes and its access frequency should fit in with the maximal space of the flash memory. As
class (high) are associated enough. Given access frequency shown in figure2 of section 3, the average workloads data
boundaries are 599, 19, and 17 respectively regarding to the skew is around 90%. In fact, various workloads may have
10% top frequent accessed files out of the total files. different data skew. For example, in our implementation,
It has shown that the confidence values supported by each according to the workloads data skew, we choose the flash
attribute are different in each trace. However, some attributes memory with capacity equal to 10% of the hard disk storage
such as UID always show great confidence to the file access capacity. That is, if hard disk is 10 GB, then flash memory is
frequency, which is one of the representative cases we have 1GB. Consequently, the file access frequency boundary is
discussed at the beginning of this section. Chi-square test determined accordingly.
provides us with sufficient evidence that file type, file mode, According to the average data skew in these three traces,
UID, GlD and access frequencies there is a considerable 599 is set to the frequency boundary between low frequency
confidence. Especially the UID is always shown great and high frequency of the sample from DEAS03. Then, we
confidence. In the next section, we discuss our regression­ can obtain the frequency class in accordance of their real
tree based predictive model. access frequency.
The Chi-square test has provided us the evidence that TABLE II. TRAINING SAMPLE FROM DEAS03.
files attributes are strongly related to files access frequency.
File File Access
It suggests us that if we know a file's attributes, we then can type mode
tHO GID
frequency
predict a file's frequency class (high or low) based on a
I 180 18b34 6 7964
predefined frequency boundary. Apart from those
representative attributes mentioned above, some other I 180 18ad4 6 242
attributes such as the file's creation time, file name and suffix 1 180 18aa2 6 299
may also support file access frequency. In this paper, we
1 180 18e09 18b3b 567
focus on 4 attributes: file mode, file type, user ID, and group
ID. 2 180 18e09 18b3b 21546
Taking into account our projected targets -- the file access 2 7ff 0 0 6689
frequency -- is continuous data. To accurately evaluate the
2 5eO 18a89 18a89 7145
confidence given by files' attributes, we use regression-tree
as our predictive model. We first obtain training sample to I 5e9 18aa2 18ae2 54
build the predictive model, and then validate the model with
2 leO 18aee 18ad5 3657
new files to check whether or not the predictions are accurate.
There are many learning algorithms aim to classify The process of constructing classification and regression
continuous data such as neural networks(NN) and support tree of the given training samples is as follows:
vector machines(SVM), This paper argues that the most I) Starting from the root tl, so that i=1
appropriate method is Classification and regression 2) tl at the root node, search questions, find the next node
tree(CART) [7], which utilizes the information gain to in the result set makes the biggest decrease in impurity
evaluate the confidence supported by an attribute. Compared splitting variable. Variable to split the node with the split into
with other supervised learning models, regression-tree has a ti-I and ti-2.
few advantages that are of benefits to our problem. First of 3) If a node ti place, never again have a further significant
all, regression-tree is simple to understand and interpret reduction in impurity is a leaf node to node ti. Otherwise, go
unlike neural networks and support vector machines. Due to to Step i = i-2
the change of storage system behaviors, system administrator

95
4) By the purity of rules or regulations to determine the
misclassification cost of the class label the leaf nodes. The SVM 82.3% 91.9% 80.1%
division process is to generate a sufficiently large tree shown CART 96.4% 91.7% 83.1%
in Fig.6. Specifically, in the classification and regression tree
at each node of the search problem sets, calculation of nodes
Validation results show that the regression trees, neural
produced on each split reduces the value of the non-purity,
networks and support vector machines can be used for file
and select the lower value of the largest non-purity of the
access frequency prediction, but the regression tree than the
split optimum split for the node. According to the optimal
neural network and support vector machine with high
split to split the result set to generate a sufficiently large tree.
accuracy.
The following tables show the accuracy of the predictions
during the peak hours on Tuesday, Wednesday and Thursday.
Here we predict that the frequency of visits to ask the top
10% of the high frequency of file access. Of course, you can
. .
flash memory capacIty to change th'IS ratIo.
Frequency DEAS03 EECS03 CAMPUS
class
High 90.1% 92.5% 93.3%
Low 98.2% 91.2% 90.6%

TABLE IV. PREDICTION ACCURACY OF FFP ON TUESDAY.

Frequency DEAS03 EECS03 CAMPUS


class
High 88.4% 88.0% 90.1%
Low 93 .6% 89.6% 87.7%

TABLEY. PREDICTION ACCURACY OF FFP ON WEDNESDAY

Frequency DEAS03 EECS03 CAMPUS


Figure 3. A Regression-tree of the Training Sample.
class
The classifier we built in this section is based on the High 82.6% 87.7% 84.1%
regression-tree classification algorithm. However, the Low 88.6% 85.3% 82.2%
training sample we used in the example for building the
TABLE VI. PREDICTION ACCURACY OF FFP ON THURSDAY.
classifier is very small. Generally, more training data used
for building the classifier more precise the result reaches the The validation results show that the accuracy of the
reality. In next section, we will discuss the validation of FFP. classifier decreases after few days. This is caused by the
slightly changed user behaviors and files' life span. For
IV. THE VALIDATION OF FFP
instance, some lecture notes might be accessed frequently in
For other time-irrelevant classification problems, training the next few days after the corresponding lectures were given.
samples and test samples are normally taken from same However, those lecture notes' access frequency may be no
workloads to guarantee that both samples have same data longer frequent once some new lectures are given and their
characteristics. However, in our problem, those traces are relative lecture notes are created. Hence, there is a need to
obtained in an uninterrupted time series over several days. build a new predictive model to adapt the file system change.
Thus, each hour's trace may differ from each other in terms The possible way is to rebuild the predictive model when the
of file access frequency. It is impossible to split the whole precision of the model becomes unacceptable.
time series workload into first half and second half to build
classifier and validate it. As we can imagine, in peak hours, V. AN EVOLUTIONARY STORAGE SYSTEM
file access frequency could be higher than rest of hours. INTEGRATING FFP

Taking into account the supervised learning algorithm, Degradation in performance over time is a serious and
neural network and Support Vector Machine can also be the common concern in magnetic disks. Sometimes, people
target of continuous data classification, we use the neural benchmark their disks when new, and then many months
network and Support Vector Machine instead of regression later, and are surprised to find that the disk is getting slower.
trees to predict file access frequency and compared the In fact, the disk most likely has not changed at all, but the
accuracy of three algorithms. The following table show the second benchmark may have been run on tracks closer to the
accuracy of the predictions during the three algorithms. center of the disk [17]. The EvoStor system uses a flash
TABLE III. ALGORITHM COMPARISON AMONG NN, SVM AND memory-oriented file system by extending EXT2. A new
CART device driver for flash memories has been written, equipped
DEAS03 EECS03 CAMPUS with garbage collection and other flash-related management
strategies [19]. Fig.9 shows a typical cylinder access
NN 83.6% 87.9% 85.2%

96
distribution over 25,000 requests on a Quantum Atlas disk. In [2] Daniel Ellard, Michael Mesnier, EnO Thereska, G. R. Ganger, and
Margo Seltzer, "Attribute-Based Prediction of File Properties",
most environments, disk accesses display significant spatial
Harvard Computer Science Group Technical Report TR-14-03,
and temporal locality. Indeed, as the narrowness of the peak
December 2003.
in FigA (a) shows, only a small number of cylinders are
[3] Michael Mesnier, Eno Thereska, Daniel Ellard, Gregory R. Ganger,
accessed with any significant frequency. FigA (b) shows the and Margo Seltzer, "File Classification in Selt:* Storage Systems",
effect of allocating the most frequently-accessed files Proceedings of the First International Conference on Autonomic
predicted by FFP together into the flash memory with a Computing (lCAC-04), New York, May 2004.
capacity of 10% of the disk space. [4] Carl Staelin and H. Garcia-Molina, "Clustering active disk data to
improve disk performance," Tech. Rep. CS-TR-283-90, Dept. of
Computer Science, Princeton Univ., Princeton, N. J. Sept. 1990.
[5] Pei Cao, Edward W. Felten, Anna R. Karlin, and Kai Li,
"Implementation and Performance of Integrated Application­
Controlled File Caching, Prefetching, and Disk Scheduling," ACM
Transactions on Computer Systems, 14(4):311-343, 1996.
[6] Daniel Ellard, Jonathan Ledlie, Pia Malkani, and Margo Seltzer.
"Passive NFS Tracing of Email and Research Workloads," In
Proceedings of the Second USENIX Conference on File and Storage
Technologies (FAST'03), pages 203-216, San Francisco, CA, March
2003.
[7] Daniel Ellard, Jonathan Ledlie, and Margo Seltzer, 'The Utility of File
Names," Technical Report TR-05-03, Harvard University Division of
Engineering and Applied Sciences, 2003.
[8] Gregory R. Ganger and M. Frans Kaashoek, "Embedded lnodes and
Explicit Grouping: Exploiting Disk Bandwidth for Small Files," In
USENIX Annual Technical Conference, pages 1-17, 1997.
[9] James T. McClave, Frank H. Dietrich II, and Terry Sincich,
"Statistics," Prentice Hall, 1997.
[10] B. Bouchon-Meunier, Ronald R Yager and Lotfi Asker Zadeh, "Fuzzy
logic and soft computing," Singapore, River Edge, NJ:World Scientific,
1995.
Figure 4. KDD in engineering: the predicted frequency information could [11] Sape Mullender and Andrew Tanenbaum, "Immediate Files," In
be used to decide what files to keep in a flash memory. Measured on Software - Practice and Experience, number 4 in 14, April 1984.
a Quantum Atlas disk.
[12] Keith Muller and Joseph Pasquale, "A High Performance Multi­
Structured File System Design," In Proceedings of the 13th ACM
VI. CONCLUSION Symposium on Operating Systems Principles (SOSP-91), pages 56--{'7,
Asilomar, Pacific Grove, CA, October 1991.
In this paper, we have described KDD technology
[13] R. Hugo Patterson, Garth A Gibson, Eka Ginting, Daniel Stodolsky,
embedded in a commercial product (An Evolutionary Storage
and Jim Zelenka. "Informed Prefetching and Caching," In ACM SOSP
System integrating file frequency predictor, FFP). We Proceedings, 1995.
showed that in some systems file attributes are strongly [14] Mendel Rosenblum and John K. Ousterhout, 'The Design and
related to its access frequency. We have also presented a Implementation of a Log-Structured File System," ACM Transactions
method of predicting file access frequency. During the on Computer Systems, 1O(1 ):26-52, 1992.
training period a predictor for file frequency is constructed [15] SOS project home page, htlp//www.eecs.harvard. edu/soslindex.html,
from observations of file system activity in trace. The file 2006.
system can then use this model to predict the frequency of [16] IBM Company home page, "IBM Mainframe Introduction",
newly-created files. The frequency information could be used http//www-
03. ibm. comlibmlhistory/exhibits/mainframe/mainframe intro.html,
to decide what files to keep in a flash memory. The trace­
_

2006
driven experimental results indicate that the performance
[17] Frank Wang, Y Deng, N. Helian, V Khare, S. Wu, C Liao & A
speedup due to the prediction-enabled optimization is 2-4. Parker, "Speeding Up a Magnetic Disk by Clustering Frequent Data",
IEEE Transactions on Magnetics, eg13, 2006.
ACKNOWLEDGEMENTS
[18] Daniel Ellard and Margo Seltzer. "New NFSTracing Tools and
We would like to thank Jonathan Ledlie and other SOS Techniques for System Analysis" In Proceedings of the Seventeenth
Annual Large Installation System Administration Conference
[11] project researchers for providing us the simulation traces
(LISA'03), pages 73-85, San Diego, CA, October 2003.
and the heuristics used in our design. This project was
supported by the National Natural Science Foundation of
China (70971020) and the Natural Science Foundation of
Shanghai (06ZRI4004).

REFERENCE

[1] T. M. Kroeger and D. D. E. Long, "The Case for Eflicient File Access
Pattern Modeling," in Proceedings of the Seventh Workshop on Topics
in Operating systems, 1999.

97

You might also like