You are on page 1of 7

doi:10.1145/2076450.

2 0 7 6 4 6 6

Article development led by


queue.acm.org

Logs contain a wealth of information to help


manage systems.
By Adam Oliner, Archana Ganapathi, and Wei Xu

Advances and
Challenges in
Log Analysis
Computer-system logs provide a glimpse into the
states of a running system. Instrumentation occasion-
ally generates short messages that are collected in
a system-specific log. The content and format of logs
can vary widely from one system to another and even
among components within a system. A printer driver

might generate messages indicating the logs that might be analyzed and the
that it had trouble communicating methods of analyzing them, and eluci-
with the printer, while a Web server dates some of the lingering challenges.
might record which pages were re- Log analysis is a rich field of research;
quested and when. while it is not our goal to provide a lit-
As the content of the logs is varied, erature survey, we do intend to provide a
so are their uses. The printer log might clear understanding of why log analysis
be used for troubleshooting, while the is both vital and difficult.
Web-server log is used to study traffic Many logs are intended to facilitate
patterns to maximize advertising rev- debugging. As Brian Kernighan wrote
enue. Indeed, a single log may be used in Unix for Beginners in 1979, “The most
for multiple purposes: information effective debugging tool is still care-
about the traffic along different net- ful thought, coupled with judiciously
work paths, called flows, might help a placed print statements.” Although to-
user optimize network performance day’s programs are orders of magnitude
or detect a malicious intrusion; or call- more complex than those of 30 years
detail records can monitor who called ago, many people still use printf to log
whom and when, and upon further to console or local disk, and use some
analysis can reveal call volume and drop combination of manual inspection and
rates within entire cities. regular expressions to locate specific
This article provides an overview messages or patterns.
of some of the most common applica- The simplest and most common use
tions of log analysis, describes some of for a debug log is to grep for a specific

f e br uary 2 0 1 2 | vo l. 5 5 | n o. 2 | c omm u n i c at i o n s o f t he ac m 55
practice

message. If a server operator believes not completely capture the intents and
that a program crashed because of a expectations of the developer.
network failure, then he or she might The static interleaving scenario is
try to find a “connection dropped” mes- more challenging because different
sage in the server logs. In many cases, it
is difficult to figure out what to search Log analysis can modules may be written by different
developers. Thus, a single log message
for, as there is no well-defined mapping
between log messages and observed
help optimize may have multiple interpretations.
For example, a “connection lost” mes-
symptoms. When a Web service sud- or debug system sage might be of great importance to
denly becomes slow, the operator is un-
likely to see an obvious error message
performance. the author of the system networking
library, but less so for an application
saying, “ERROR: The service latency in- Understanding author who is shielded from the error
creased by 10% because bug X, on line
Y, was triggered.” Instead, users often
a system’s by underlying abstractions. It is often
impossible for a shared-library author
perform a search for severity keywords performance to predict which messages will be use-
such as “error” or “failure”. Such sever-
ity levels are often used inaccurately, is often related ful to users.
Logging usually implies some inter-
however, because a developer rarely has
complete knowledge of how the code
to understanding nal synchronization. This can compli-
cate the debugging of multithreaded
will ultimately be used. how the resources systems by changing the thread-inter-
Furthermore, red-herring messages
(for example, “no error detected”) may
in that system leaving pattern and obscuring the prob-
lem. (This is an example of a so-called
pollute the result set with irrelevant are used. heisenbug.) A key observation is that a
events. Consider the following message program behaves nondeterministically
from the BlueGene/L supercomputer: only at certain execution points, such as
clock interrupts and I/O. By logging all
YY-MM-DD-HH:MM:SS NULL RAS the nondeterministic execution points,
BGLMASTER FAILURE ciodb exit- you can faithfully replay the entire pro-
ed normally with exit code 0 gram.7,14 Replay is powerful because you
can observe anything in the program by
The FAILURE severity word is unhelp- modifying the instrumentation prior
ful, as this message may be generated to a replay. For concurrent programs
during nonfailure scenarios such as sys- or those where deterministic execution
tem maintenance. depends on large amounts of data, how-
When a developer writes the print ever, this approach may be impractical.
statement of a log message, it is tied Log volume can be excessive in a
to the context of the program source large system. For example, logging every
code. The content of the message, how- acquire and release operation on a lock
ever, often excludes this context. With- object in order to debug lock contention
out knowledge of the code surround- may be prohibitively expensive. This dif-
ing the print statement or what led ficulty is exacerbated in multimodule
the program onto that execution path, systems, where logs are also heteroge-
some of the semantics of the message neous and therefore even less amenable
may be lost—that is, in the absence of to straightforward analysis. There is an
context, log messages can be difficult inherent cost to collecting, storing, sort-
to understand. ing, or indexing a large quantity of log
An additional challenge is that log messages, many of which might never
files are typically designed to represent be used. The return on investment for
a single stream of events. Messages debug logging arises from its diagnostic
from multiple sources, however, may be power, which is difficult to measure.
interleaved both at runtime (from mul- Some users need aggregated or sta-
tiple threads or processes) and statically tistical information and not individual
(from different modules of a program). messages. In such cases, they can log
For runtime interleaving, a thread ID only aggregated data or an approxima-
does not solve the problem because a tion of aggregated data and still get a
thread can be reused for independent good estimate of the required statis-
tasks. There have been efforts to in- tics. Approximation provides statisti-
clude message contexts automatically cally sound estimates of metrics that
(X-Trace,4 Dapper12) or to infer them are useful to machine-learning analy-
from message contents,15 but these can- ses such as PCA (principal component

56 comm unications of t h e acm | fe bruary 2 0 1 2 | vo l . 5 5 | no. 2


practice

analysis) and SVM (support vector ma- that system are used. Some logs are the autonomous vehicles such as Stanley13
chine8). These techniques are critical same as in the case of debugging, such (where it helped diagnose a danger-
in networked or large-scale distributed as logging lock operations to debug a ous swerving bug10) to supercomputers
systems, where collecting even a single bottleneck. Some logs track the use of such as BlueGene/L1 (where it was able
number from each component carries individual resources, producing a time to analyze logs from more than 100,000
a heavy performance cost. This illus- series. Resource-usage statistics often components in real time9).
trates the potential benefits of tailoring come in the form of cumulative use per Methods that trace a message or re-
instrumentation to particular analyses. time period (for example, b bits trans- quest as it is processed by the system
Machine-learning techniques, es- mitted in the last minute). One might are able to account for the order of
pecially anomaly detection, are com- use bandwidth data to characterize net- events and the impact of workload. For
monly used to discover interesting log work or disk performance, page swaps example, requests of one type might be
messages. Machine-learning tools usu- to characterize memory effectiveness, easily serviceable by cached data, while
ally require input data as numerical or CPU utilization to characterize load- requests of another type might not be.
feature vectors. It is nontrivial to convert balancing quality. Such tracing methods often require
free-text log messages into meaningful Like the debugging case, perfor- supporting instrumentation but can be
features. Recent work analyzed source mance logs must be interpreted in useful for correctness debugging in ad-
code to extract semi-structured data context. Two types of contexts are espe- dition to understanding performance.
automatically from legacy text logs and cially useful in performance analysis: A salient challenge in this area is the
applied anomaly detection on features the environment in which the perfor- risk of influencing the measurements
extracted from logs.15 On several open mance number occurs and the work- by the act of measuring. Extensive log-
source systems and two Google produc- load of the system. ging that consumes resources can com-
tion systems, the authors were able to Performance problems are often plicate the task of accounting for how
analyze billions of lines of logs, accurate- caused by interactions between compo- those resources are used in the first
ly detect anomalies often overlooked by nents, and to reveal these interactions place. The more we measure, the less
human eyes, and visualize the results in a you may have to synthesize information accurately we will understand the per-
single-page decision-tree diagram. from heterogeneous logs generated by formance characteristics of the system.
Challenges remain in statistical multiple sources. Synthesis can be chal- Even conservative tracing mechanisms
anomaly detection. Even if some mes- lenging. In addition to heterogeneous typically introduce unacceptable over-
sages are abnormal in a statistical log formats, components in distributed head in practice.
sense, there may be no further evi- systems may disagree on the exact time, One approach to reduce the perfor-
dence on whether these messages are making the precise ordering of events mance impact of logging is to sample.
the cause, the symptom, or simply across multiple components impossi- The danger is that sampling may miss
innocuous. Also, statistical methods ble to reconstruct. Also, an event that is rare events. If you have millions or even
rely heavily on log quality, especially benign to one component (for example, billions of sampled instances of the
whether “important” events are logged. a log flushing to disk) might cause seri- same program running, however, you
The methods themselves do not define ous problems for another (for example, may be able to maintain a low sampling
what could be “important.” because of the I/O resource conten- rate while still capturing rare events.
Static program analysis can help dis- tion). As the component causing the An efficient implementation of sam-
cover the root cause of a specific mes- problem is unlikely to log the event, it pling techniques requires the ability to
sage by analyzing paths in the program may be hard to capture this root cause. turn individual log sites on and off with-
that could lead to the message. Static These are just a few of the difficulties out restarting execution. Older systems
analysis can also reveal ways to improve that emerge. such as DTrace require statically instru-
log quality by finding divergence points, One approach to solving this prob- mented log sites.2 Recent advances in
from which program execution might lem is to compute influence, which in- program rewriting can be used to instru-
enter an error path; such points are ex- fers relationships between components ment arbitrary sites in program binaries
cellent candidates for logging instru- or groups of components by looking for at runtime. One recent effort in this
mentation.16 Static analysis techniques surprising behavior that is correlated in direction is Fay, a platform for the col-
are usually limited by the size and com- time.10 For example, bursty disk writes lection, processing, and analysis of soft-
plexity of the target system. It takes might correlate in time with client com- ware execution traces3 that allows users
hours to analyze a relatively simple pro- munication errors; a sufficiently strong to specify the events they want to mea-
gram such as Apache Web Server. Heu- correlation suggests some shared influ- sure, formulated as queries in a declara-
ristics and domain knowledge of the ence between these two parts of the sys- tive language; Fay then inserts dynamic
target system usually make such analy- tem. Influence can quantify the interac- instrumentation into the running sys-
ses more effective. tion between components that produce tem, aggregates the measurements, and
heterogeneous logs, even when those provides analysis mechanisms, all spe-
Performance logs are sparse, incomplete, and with- cific to those queries. When applied to
Log analysis can help optimize or debug out known semantics and even when benchmark codes in a distributed sys-
system performance. Understanding a the mechanism of the interaction is tem, Fay showed single-digit percentage
system’s performance is often related unknown. Influence has been applied overheads. Dynamic program rewriting
to understanding how the resources in to production systems ranging from combined with sampling-based logging

f e b r ua ry 2 0 1 2 | vo l . 5 5 | n o. 2 | c o m m u n i c at i o n s o f t he acm 57
practice

will likely be a key solution to problems A log analyst interested in security correct operation. An analysis cannot
requiring detailed logs at scale. may then ask the deceptively simple do much about incomplete logs. Devel-
question: Does this SSH session consti- opers can try to improve logging cover-
Security tute a security breach? age,16 making it more difficult for adver-
Logs are also used for security appli- The answer may depend on a num- saries to avoid leaving evidence of their
cations, such as detecting breaches ber of factors, among them: Have there activities, but this does not necessarily
or misbehavior, and for performing been an abnormally large number of make it easier to distinguish a “healthy”
postmortem inspection of security in- failed login attempts recently? Is the log from a “suspicious” one.
cidents. Depending on the system and IP address associated with user47 fa-
the threat model, logs of nearly any miliar? Did user47 perform any sus- Prediction
kind might be amenable to security picious actions while the session was Log data can be used to predict and pro-
analysis: logs related to firewalls, login active? Is the person with username vision for the future. Predictive models
sessions, resource utilization, system user47 on vacation and thus should help automate or provide insights for
calls, network flows, and so on. not be logging in? resource provisioning, capacity plan-
Intrusion detection often requires Note that only some of these ques- ning, workload management, schedul-
reconstructing sessions from logs. tions can be answered using data in the ing, and configuration optimization.
Consider an example related to in- logs. You can look for a large number of From a business viewpoint, predictive
trusion detection—that is, detecting failed login attempts that precede this models can guide marketing strategy,
unauthorized access to a system (see session, for example, but you cannot ad placement, or inventory management.
the figure here). When a user logs infer user47’s real identify, let alone his Some analytical models are built
into a machine remotely via SSH, that or her vacation schedule. Thus, a par- and honed for a specific system. Ex-
machine generates log entries corre- ticular analysis works on logs that are perts manually identify dependencies
sponding to the login event. On Mac commensurate with the type of attack and relevant metrics, quantify the re-
OS X, these look like the messages as they wish to detect; more generally, the lationships between components, and
depicted in the accompanying figure power of an analysis is limited by the in- devise a prediction strategy. Such mod-
(timestamp and hostname omitted) formation in the logs. els are often used to build simulators
that show a user named user47 access- Log analysis for security may be sig- that replay logs with anticipated work-
ing the machine interactively from a nature based, in which the user tries to load perturbations or load volumes in
specific IP address and port number. detect specific behaviors that are known order to ask what-if questions. Exam-
Common sense says these logout to be malicious; or anomaly based, ples of using analytical models for per-
messages match the previous login in which the user looks for deviation formance prediction exist on I/O sub-
messages because the hexadecimal from typical or good behavior and flags systems, disk arrays, databases, and
session numbers match (0x3551e2); this as suspicious. Signature methods static Web servers. This approach has
we know the second of these lines, can reliably detect attacks that match a major practical drawback, however,
which does not include the session known signatures, but are insensitive to in that real systems change frequently
number, is part of the logout event attacks that do not. Anomaly methods, and analysis techniques must keep up
only because it is sandwiched between on the other hand, face the difficulty of with these changes.
the other two. There is nothing syntac- setting a threshold for calling an anom- Although the modeling techniques
tic about these lines that would reveal, aly suspicious: too low, and false alarms may be common across various sys-
a priori, that they are somehow associ- make the tool useless; too high, and at- tems, the log data mined to build the
ated with the lines generated at login, tacks might go undetected. model, as well as the metrics predicted,
let alone each other. Security applications face the distin- may differ. For example, I/O subsystem
In other words, each message is evi- guishing challenge of an adversary. To and operating-system instrumentation
dence of multiple semantic events, in- avoid the notice of a log-analysis tool, containing a timestamp, event type,
cluding the following: the execution of an adversary will try to behave in such a CPU profile, and other per-event met-
a particular line of code, the creation way that the logs generated during the rics can be used to drive a simulator to
or destruction of an SSH session, and attack look—exactly or approximately— predict I/O subsystem performance.
the SSH session as a whole. the same as the logs generated during Traces that capture I/O request rate,
request size, run count, queue length,
The first three messages. and other attributes can be leveraged to
build analytical models to predict disk-
array throughput.
sshd[12109]: Accepted keyboard-interactive/pam for user47 from
171.64.78.25 port 49153 ssh2 Many analytical models are single
com.apple.SecurityServer[22]: Session 0x3551e2 created tier: one model per predicted metric. In
com.apple.SecurityServer[22]: Session 0x3551e2 attributes 0x20 other scenarios a hierarchy of models is

com.apple.SecurityServer[22]: Session 0x3551e2 dead
required to predict a single performance
com.apple.SecurityServer[22]: Killing auth hosts metric, based on predictions of other
com.apple.SecurityServer[22]: Session 0x3551e2 destroyed performance metrics. For example, Web
server traces—containing timestamps,
request type (GET vs. POST), bytes re-

58 comm unications of t h e acm | fe bruary 2 0 1 2 | vo l . 5 5 | no. 2


practice

quested, URI, and other fields—can be istics such as operators used and esti-
leveraged to predict storage response mated data cardinality.6 The same tech-
time, storage I/O, and server memory. nique was adapted to model and predict
A model to predict server response time performance of map-reduce jobs.5
under various load conditions can be
composed of models of the storage Predictive models Although these techniques show the
power of statistical learning techniques
metrics and server memory. As another often provide a for performance prediction, their use

range of values
example, logs tracking record accesses, poses some challenges.
block accesses, physical disk transfers, Extracting feature vectors from
throughputs, and mean response times
can be used to build multiple levels of
rather than a events logs is a nontrivial, yet critical,
step that affects the effectiveness of a
queuing network models to predict the single number; this predictive model. Event logs often con-
effect of physical and logical design de-
cisions on database performance.
range sometimes tain non-numeric data (for example,
categorical data), but statistical tech-
One drawback of analytical models represents a niques expect numeric input with some
is the need for system-specific domain
knowledge. Such models cannot be confidence interval, notion of distributions defined on the
data. Converting non-numeric informa-
seamlessly ported to new versions of the meaning the true tion in events into meaningful numeric
system, let alone to other systems. As
systems become more complex, there value is likely to lie data can be tedious and requires do-
main knowledge about what the events
is a shift toward using statistical mod-
els of historical data to anticipate future
within that interval. represent. Thus, even given a predic-
tion, it can be difficult to identify the
workloads and performance. correct course of action.
Regression is the simplest statistical Predictive models often provide a
modeling technique used in prediction. range of values rather than a single
It has been applied to performance number; this range sometimes repre-
counters, which measure execution sents a confidence interval, meaning
time and memory subsystem impact. the true value is likely to lie within that
For example, linear regression ap- interval. Whether or not to act on a pre-
plied to these logs was used to predict diction is a decision that must weigh
execution time of data-partitioning the confidence against the costs (that
layouts for libraries on parallel proces- is, whether acting on a low-confidence
sors, while logistic regression was used prediction is better than doing noth-
to predict a good set of compiler flags. ing). Acting on a prediction may depend
CART (classification and regression on whether the log granularity matches
trees) used traces of disk requests speci- the decision-making granularity. For
fying arrival time, logical block number, example, per-query resource-utilization
blocks requested, and read/write type to logs do not help with task-level sched-
predict the response times of requests uling decisions, as there is insufficient
and workloads in a storage system. insight into the parallelism and lower-
Both simple regression and CART level resource-utilization metrics.
models can predict a single metric per
model. Performance metrics, howev- Reporting and Profiling
er, often have interdependencies that Another use for log analysis is to pro-
must each be predicted to make an in- file resource utilization, workload,
formed scheduling or provisioning de- or user behavior. Logs that record
cision. Various techniques have been characteristics of tasks from a clus-
explored to predict multiple metrics ter’s workload can be used to profile
simultaneously. One method adapts resource utilization at a large data
canonical correlation analysis to build center. The same data might be lever-
a model that captures interdependen- aged to understand inter-arrival times
cies between a system’s input and per- between jobs in a workload, as well as
formance characteristics, and leverages diurnal patterns.
the model to predict the system’s per- In addition to system management,
formance under arbitrary input. Recent profiling is used for business analytics.
work used KCCA (kernel canonical cor- For example, Web-server logs character-
relation analysis) to model a parallel ize visitors to a Web site, which can yield
database system and predict execution customer demographics or conversion
time, records used, disk I/Os, and other and drop-off statistics. Web-log analysis
such metrics, given query character- techniques range from simple statistics

f e br uary 2 0 1 2 | vo l. 5 5 | n o. 2 | c omm u n i c at i o n s o f t he ac m 59
practice

that capture page popularity trends to Logging Infrastructures


sophisticated time-series methods that A logging infrastructure is essential for
describe access patterns across multi- supporting the variety of applications
ple user sessions. These insights inform described here. It requires at least two
marketing initiatives, content hosting,
and resource provisioning. Profiles do not features: log generation and log storage.
Most general-purpose logs are unstruc-
A variety of statistical techniques translate directly tured text. Developers use printf and

to operationally
have been used for profiling and re- string concatenations to generate mes-
porting on log data. Clustering algo- sages because these primitives are well
rithms such as k-means and hierarchi-
cal clustering group similar events.
actionable understood and ubiquitous. This kind
of logging has drawbacks, however.
Markov chains have been used for pat- insights. The task First, serializing variables into text is ex-
tern mining where temporal ordering
is essential.
of interpreting pensive (almost 80% of the total cost of
printing a message). Second, the analy-
Many profiling and alerting tech- a profile and using sis needs to parse the text message,
niques require hints in the form of
expert knowledge. For example, the k- it to make business which may be complicated and expensive.
On the storage side, infrastructures
means clustering algorithm requires decisions, to modify such as syslog aggregate messages
the user either to specify the number of
clusters (k) or to provide example events the system, from network sources. Splunk indexes
unstructured text logs from syslog and
that serve as seed cluster centers. Other
techniques require heuristics for merg-
or even to modify other sources, and it performs both
real time and historical analytics on the
ing or partitioning clusters. Most tech- the analysis, data. Chukwa archives data using Ha-
niques rely on mathematical represen-
tations of events, and the results of the
usually falls doop to take advantage of distributed
computing infrastructure.11
analysis are presented in similar terms. to a human. Choosing the right log-storage solu-
It may then be necessary to map these tion involves the following trade-offs:
mathematical representations back ˲˲ Cost per terabyte (upfront and
into the original domain, though this maintenance)
can be difficult without understanding ˲˲ Total capacity
the log semantics. ˲˲ Persistence guarantees
Classifying log events is often ˲˲ Write access characteristics (for ex-
challenging. To categorize system ample, bandwidth and latency)
performance, for example, you may ˲˲ Read access characteristics (ran-
profile CPU utilization and memory dom access vs. sequential scan)
consumption. Suppose you have a per- ˲˲ Security considerations (access
formance profile for high CPU utiliza- control and regulation compliance)
tion and low memory consumption, ˲˲ Integration with existing infra-
and a separate profile of events with structure
low CPU utilization and high memory There is no one-size-fits-all policy for
consumption; when an event arrives log retention. This makes choosing and
containing low CPU utilization and configuring log solutions a challenge.
low memory consumption, it is un- Logs that are useful for business intel-
clear to which of the two profiles (or ligence are typically considered more
both) it should belong. If there are important than debugging logs and
enough such events, the best choice thus are kept for a longer time. In con-
might be to include a third profile. trast, most debug logs are stored for as
There is no universally applicable rule long as possible but without any reten-
for how to handle events that straddle tion guarantee, meaning they may be
multiple profiles or how to create such deleted under resource pressure.
profiles in the first place. Log-storage solutions are more use-
Although effective for grouping ful when coupled with alerting and
similar events and providing high- reporting capabilities. Such infrastruc-
level views of system behavior, pro- tures can be leveraged for debugging,
files do not translate directly to op- security, and other system-manage-
erationally actionable insights. The ment tasks. Various log-storage solu-
task of interpreting a profile and tions facilitate alerting and reporting,
using it to make business decisions, to but they leave many open challenges
modify the system, or even to modify the pertaining to alert throttling, report ac-
analysis, usually falls to a human. celeration, and forecasting capabilities.

60 comm unications of t h e acm | fe bruary 2 0 1 2 | vo l . 5 5 | no. 2


practice

Conclusion main a part of the process of interpret- References


The applications and examples in ing and acting on logs for the foresee- 1. BlueGene/L Team. An overview of the BlueGene/L
Supercomputer. IEEE Supercomputing and IBM
this article demonstrate the degree able future, advances in visualization Research Report (Nov. 2002).
to which system management has be- techniques should prove worthwhile. 2. Cantrill, B.M., Shapiro, M.W. and Leventhal, A.H.
Dynamic instrumentation of production systems.
come log-centric. Whether used for Program analysis methods, both Usenix 2004 Annual Technical Conference (Boston,
debugging problems or provisioning static and dynamic, promise to in- MA, June 2004); http://www.usenix.org/event/
usenix04/tech/general/full_papers/cantrill/cantrill.pdf.
resources, logs contain a wealth of in- crease our ability to automatically 3. Erlingsson, Ú., Peinado, M., Peter, S., Budiu and M.
formation that can pinpoint, or at least characterize the interactions and cir- Fay: Extensible distributed tracing from kernels to
clusters. In Proceedings of the 23rd ACM Symposium
implicate, solutions. cumstances that caused a particular on Operating Systems Principles, Cascais, Portugal
(Oct. 2011); http://research.google.com/pubs/
Although log-analysis techniques sequence of log messages. Recent work archive/37199.pdf.
have made much progress recently, aims to modify existing instrumenta- 4. Fonseca, R., Porter, G., Katz R., Shenker, S. and Stoica,
I. X-Trace: A pervasive network-tracing framework.
several challenges remain. First, as sys- tion so that logs are either more ame- Usenix Symposium on Networked Systems Design and
tems become increasingly composed of nable to various kinds of analysis or Implementation (Cambridge, MA , Apr. 2007).
5. Ganapathi, A., Chen, Y., Fox, A., Katz, R. H. and
many, often distributed, components, provide more comprehensive informa- Patterson, D. A. Statistics-driven workload modeling
using a single log file to monitor events tion. Although such modifications are for the cloud. Workshop on Self-Managing Database
Systems at ICDE (2010), 87−92.
from different parts of the system is dif- not always possible, insights into how 6. Ganapathi, A., Kuno, H. A., Dayal, U., Wiener, J. L.,
ficult. In some scenarios logs from en- to generate more useful logs are often Fox, A., Jordan, M. I. and Patterson, D. A. Predicting
multiple metrics for queries: Better decisions enabled
tirely different systems must be cross- accompanied by insights in how to by machine learning. International Conference on
correlated for analysis. For example, analyze existing logs. Mechanisms to Data Engineering (2009) 592−603.
7. Gautam, A. and Stoica, I. ODR: output-deterministic
a support organization may correlate validate the usefulness of log messages replay for multicore debugging. ACM Symposium on
phone-call logs with Web-access logs to would improve log quality, making log Operating System Principles (2009), 193−206.
8. Nguyen, X., Huang, L. and Joseph, A. Support vector
track how well the online documenta- analysis more efficient. machines, data reduction, and approximate kernel
tion for a product addresses frequently As many businesses become in- matrices. In Proceedings of the European Conference
on Machine Learning and Knowledge Discovery in
asked questions and how many custom- creasingly dependent on their com- Databases (2008), 137−153.
9. Oliner, A.J. and Aiken, A. Online detection of multi-
ers concurrently search the online doc- puting infrastructure—not to mention component interactions in production systems. In
umentation during a support call. Inter- businesses where the computing infra- Proceedings of the International Conference on
Dependable Systems and Networks (Hong Kong, 2011);
leaving heterogeneous logs is seldom structure or the services they provide http://adam.oliner.net/files/oliner_dsn_2011.pdf.
straightforward, especially when time- are the business itself—so does the 10. Oliner, A.J., Kulkarni, A.V. and Aiken, A. Using
correlated surprise to infer shared influence. In
stamps are not synchronized or present importance of this relationship. We Proceedings of the International Conference on
across all logs and when semantics are have seen a rise in tools that try to infer Dependable Systems and Networks (Chicago, IL,
2010), 191−200; http://adam.oliner.net/files/oliner_
inconsistent across components. how the system influences users: how dsn_2010.pdf.
Second, the logging process itself latency affects purchasing decisions; 11. Rabkin, A. and Randy, K. Chukwa: A system for reliable
large-scale log collection. USENIX Conference on
requires additional management. how well click patterns describe user Large Installation System Administration (2010), 1−15.
Controlling the verbosity of logging is satisfaction; and how resource-sched- 12. Sigelman, B., Barroso, L., Burrows, M., Stephenson,
P., Plakal, M., Beaver, D., Jaspan, S. and Shanbhag,
important, especially in the event of uling decisions change the demand C. Dapper, a large-scale distributed systems tracing
spikes or potential adversarial behav- for such resources. Conversely, recent infrastructure. Google Technical Report; http://research.
google.com/archive/papers/dapper-2010-1.pdf.
ior, to manage overhead and facili- work suggests that user activity can be 13. Thrun, S. et al. Stanley: The robot that won the DARPA
tate analysis. The logging mechanism useful for system debugging. Further Grand Challenge. Journal of Field Robotics 23, 9
(2006), 661−692.
should also not be a channel to prop- exploration of the relationships be- 14. Xu, M. et al. A “flight data recorder” for enabling
full-system multiprocessor deterministic replay.
agate malicious activity. It remains a tween user behavior (workload) and In Proceedings of the 30th annual International
challenge to minimize instrumenta- system behavior may prove useful for Symposium on Computer Architecture (San Diego, CA,
June 2003).
tion overhead while maximizing infor- understanding what logs to use, when, 15. Xu, W., Huang, L., Fox, A., Patterson, D. and Jordan,
mation content. and for what purpose. M. Detecting large-scale system problems by
mining console logs. In Proceeding of the 22nd ACM
A third challenge is that although These research directions, as well as Symposium on Operating Systems Principles (Big Sky,
various analytical and statistical mod- better logging standards and best prac- MT, Oct. 2009).
16. Yuan, D., Zheng, J., Park, S., Zhou, Y. and Savage,
eling techniques can mine large quan- tices, will be instrumental in improving S. Improving software diagnosability via log
tities of log data, they do not always the state of the art in log analysis. enhancement. In Proceedings of Architectural Support
for Programming Languages and Operating Systems
provide actionable insights. For exam- (Newport Beach, CA, Mar. 2011); http://opera.ucsd.
ple, statistical techniques could reveal edu/paper/asplos11-logenhancer.pdf.

an anomaly in the workload or that the Related articles


Adam Oliner is a postdoctoral scholar in electrical
system’s CPU utilization is high but on queue.acm.org engineering and computer sciences at UC Berkeley,
not explain what to do about it. The in- working with Ion Stoica and the AMP (Algorithms,
Modern Performance Monitoring Machine and People) Lab.
terpretation of the information is sub- Mark Purdy Archana Ganapathi is a research engineer at Splunk,
jective, and whether the information http://queue.acm.org/detail.cfm?id=1117404 where she focuses on large-scale data analytics. She has
is actionable or not depends on many Network Forensics
spent much of her research career analyzing production
datasets to model system behavior.
factors. It is important to investigate Ben Laurie
Wei Xu is a software engineer at Google, where he works
techniques that trade off efficiency, ac- http://queue.acm.org/detail.cfm?id=1016982 on Google’s debug logging and monitoring infrastructure.
curacy, and actionability. The Pathologies of Big Data
His research interest is in cluster management and
debugging.
There are several promising research Adam Jacobs
directions. Since humans will likely re- http://queue.acm.org/detail.cfm?id=1563874 © 2012 ACM 0001-0782/12/02 $10.00

f e br uary 2 0 1 2 | vo l. 5 5 | n o. 2 | c ommu n i c ati o n s o f the ac m 61

You might also like