Mismanagement of the persistent state of a system\u2014all the executable files, configuration
settings and other data that govern how a system functions\u2014causes reliability problems,
security vulnerabilities, and drives up operation costs. Recent research traces persistent
state interactions\u2014how state is read, modified, etc.\u2014to help troubleshooting, change
management and malware mitigation, but has been limited by the difficulty of collecting,
storing, and analyzing the 10s to 100s of millions of daily events that occur on a single
machine, much less the 1000s or more machines in many computing environments.
observed file system workloads and common systems management queries. Our lossless
log format compresses logs to only 0.5-0.9 bytes per interaction. In this log format, 1000
machine-days of logs\u2014over 25 billion events\u2014can be analyzed in less than 30 minutes.
We report on our deployment of FDR to 207 production machines at MSN, and show that a
single centralized collection machine can potentially scale to collecting and analyzing the
complete records of persistent state interactions from 4000+ machines. Furthermore, our
tracing technology is shipping as part of the Windows Vista OS.
Misconfigurations and other persistent state (PS) problems are among the primary causes
of failures and security vulnerabilities across a wide variety of systems, from individual
desktop machines to large-scale Internet services. MSN, a large Internet service, finds that,
in one of their services running a 7000 machine system, 70% of problems not solved by
rebooting were related to PS corruptions, while only 30% were hardware failures. In ,
Oppenheimer et al. find that configuration errors are the largest category of operator
mistakes that lead to downtime in Internet services. Studies of wide-area networks show
that misconfigurations cause 3 out of 4 BGP routing announcements, and are also a
significant cause of extra load on DNS root servers [4,22]. Our own analysis of call logs from
a large software company\u2019s internal help desk, responsible for managing corporate
desktops, found that a plurality of their calls (28%) were PS related. Furthermore, most
reported security compromises are against known vulnerabilities\u2014administrators are
wary of patching their systems because they do not know the state of their systems and
cannot predict the impact of a change [1,26,34].
settings to avoid the misconfigurations and inconsistencies that cause these reliability and
security problems. Recent work has shown thatse le c ti ve ly logging how processes running
on a system interact with PS (e.g., read, write, create, delete) can be an important tool for
quickly troubleshooting configuration problems, managing the impact of software patches,
analyzing hacker break-ins, and detecting malicious websites exploiting web browsers
[17,35-37]. Unfortunately, each of these techniques is limited by the current infeasibility of
collecting and analyzing the complete logs of 10s to 100s of millions of events generated by
a single machine, much less the 1000s of machines in even a medium-sized computing and
There are three desired attributes in a tracing and analysis infrastructure. First is low
performance overhead on the monitored client, such that it is feasible to always be
collecting complete information for use by systems management tools. The second desired
attribute is an efficient method to store data, so that we can collect logs from many
machines over an extended period to provide a breadth and historical depth of data when
managing systems. Finally, the analysis of these large volumes of data has to be scalable, so
that we can monitor, analyze and manage today\u2019s large computing
environments. Unfortunately, while many tracers have provided low-overhead, none of the
state-of-the-art technologies for \u201calways-on\u201d tracing of PS interactions provide for efficient
storage and analysis.
We present the Flight-Data Recorder (FDR), a high-performance, always-on tracer that
provides complete records of PS interactions. Our primary contribution is a domain-
specific, queryable and compressed log file format, designed to exploit workload
characteristics of PS interactions and key aspects of common-case queries\u2014primarily that
most systems management tasks are looking for \u201cthe needle in the haystack,\u201d searching for
a small subset of PS interactions that meet well-defined criteria. The result is a highly
We evaluate FDR\u2019s performance overhead, compression rates, query performance, and
scalability. We also report our experiences with a deployment of FDR to monitor 207
production servers at various MSN sites. We describe howalway s- on tracing and analysis
improve our ability to do after-the-fact queries on hard-to-reproduce incidents, provide
insight into on-going system behaviors, and help administrators scalably manage large-
scale systems such as IT environments and Internet service clusters.
In the next section, we discuss related work and the strengths and weaknesses of current
approaches to tracing systems. We present FDR\u2019s architecture and log format design in
sections 3 and 4, and evaluate the system in Section 5. Section 6 presents several analysis
techniques that show how PS interactions can help systems management tasks like
troubleshooting and change management. In Section 7, we discuss the implications of this
work, and then conclude.
Throughout the paper, we use the term PSe n tri e s to refer to files and folders within the file system, as well as their equivalents within structured files such as the Windows Registry. A PSi n te rac ti on is any kind of access, such as an open, read, write, close or delete operation.
In this section, we discuss related research and common tools for tracing system
behaviors. We discuss related work on analyzing and applying these traces to solve
systems problems in Section 6. Table 1compares the log-sizes and performance overhead
of FDR and other systems described in this section for which we had data available
Now bringing you back...
Does that email address look wrong? Try again with a different email.
This action might not be possible to undo. Are you sure you want to continue?