2010 IEEE Symposium on Security and Privacy

Outside the Closed World:
On Using Machine Learning For Network Intrusion Detection

Robin Sommer Vern Paxson
International Computer Science Institute, and International Computer Science Institute, and
Lawrence Berkeley National Laboratory University of California, Berkeley

Abstract—In network intrusion detection research, one pop- deployments in the commercial world. Examples from other
ular strategy for finding attacks is monitoring a network’s domains include product recommendations systems such
activity for anomalies: deviations from profiles of normality as used by Amazon [3] and Netflix [4]; optical character
previously learned from benign traffic, typically identified
using tools borrowed from the machine learning community. recognition systems (e.g., [5], [6]); natural language trans-
However, despite extensive academic research one finds a lation [7]; and also spam detection, as an example closer to
striking gap in terms of actual deployments of such systems: home [8].
compared with other intrusion detection approaches, machine In this paper we set out to examine the differences
learning is rarely employed in operational “real world” settings. between the intrusion detection domain and other areas
We examine the differences between the network intrusion
detection problem and other areas where machine learning where machine learning is used with more success. Our main
regularly finds much more success. Our main claim is that claim is that the task of finding attacks is fundamentally
the task of finding attacks is fundamentally different from different from other applications, making it significantly
these other applications, making it significantly harder for the harder for the intrusion detection community to employ
intrusion detection community to employ machine learning machine learning effectively. We believe that a significant
effectively. We support this claim by identifying challenges
particular to network intrusion detection, and provide a set part of the problem already originates in the premise, found
of guidelines meant to strengthen future research on anomaly in virtually any relevant textbook, that anomaly detection is
detection. suitable for finding novel attacks; we argue that this premise
Keywords-anomaly detection; machine learning; intrusion does not hold with the generality commonly implied. Rather,
detection; network security. the strength of machine-learning tools is finding activity
that is similar to something previously seen, without the
I. I NTRODUCTION need however to precisely describe that activity up front (as
Traditionally, network intrusion detection systems (NIDS) misuse detection must).
are broadly classified based on the style of detection they are In addition, we identify further characteristics that our do-
using: systems relying on misuse-detection monitor activity main exhibits that are not well aligned with the requirements
with precise descriptions of known malicious behavior, while of machine-learning. These include: (i) a very high cost of
anomaly-detection systems have a notion of normal activity errors; (ii) lack of training data; (iii) a semantic gap between
and flag deviations from that profile.1 Both approaches have results and their operational interpretation; (iv) enormous
been extensively studied by the research community for variability in input data; and (v) fundamental difficulties
many years. However, in terms of actual deployments, we for conducting sound evaluation. While these challenges
observe a striking imbalance: in operational settings, of may not be surprising for those who have been working
these two main classes we find almost exclusively only in the domain for some time, they can be easily lost on
misuse detectors in use—most commonly in the form of newcomers. To address them, we deem it crucial for any
signature systems that scan network traffic for characteristic effective deployment to acquire deep, semantic insight into
byte sequences. a system’s capabilities and limitations, rather than treating
This situation is somewhat striking when considering the system as a black box as unfortunately often seen.
the success that machine-learning—which frequently forms We stress that we do not consider machine-learning an
the basis for anomaly-detection—sees in many other areas inappropriate tool for intrusion detection. Its use requires
of computer science, where it often results in large-scale care, however: the more crisply one can define the context
in which it operates, the better promise the results may hold.
1 Other styles include specification-based [1] and behavioral detec- Likewise, the better we understand the semantics of the
tion [2]. These approaches focus respectively on defining allowed types detection process, the more operationally relevant the system
of activity in order to flag any other activity as forbidden, and analyzing
patterns of activity and surrounding context to find secondary evidence of will be. Consequently, we also present a set of guidelines
attacks. meant to strengthen future intrusion detection research.

1081-6011/10 $26.00 © 2010 IEEE 305
DOI 10.1109/SP.2010.25

In Section IV we present limited. artificial immune. In Sec- tion II. Consider product recommendation systems hard to find. to then be that research papers often envision. anomaly that is our main area of expertise. traffic. the academic world do not live up to the requirements of we believe it captures what many in the field intuitively operational settings. in [16]. We believe that this “success Anomaly detection systems find deviations from expected discrepancy” arises because the intrusion detection domain behavior. provide a survey of anomaly detection colleagues who work with machine learning on a daily basis. we frame our mindset around point: that much of the difficulty with anomaly detection on the goal of using an anomaly detection system effec. and attackers can evade detection by gradually such as that used by Amazon [3]: it employs collaborative teaching a system to accept malicious activity as benign. many primarily for illustration. for fraudulent activity. harder than in many other contexts. security researchers. i. meaningful outliers as required by an anomaly detection ate numerous false positives. proven to work with great success. machine-learning algorithms excel much card transactions is of relatively low dimensionality and se. Based on a notion of normal activity. such as monitoring credit card spending patterns a more formal analysis would yield. In our discussion. and are regularly used in commercial settings where large quantities of data render II. While highly helpful. Often. We also note that we are network machines [13]. such as detection faces a similarly adversarial environment as in- information theory [11]. C HALLENGES OF U SING M ACHINE L EARNING We structure the remainder of the paper as follows. and we briefly summarize in Section V. for machine learning.e. they borrow spectrum for many of the properties discussed (e. However. and systems [15]. such as Arbor’s Peakflow [19] and Lanscope’s on machine-learning. genetic algorithms [14]. the space for representing credit Fundamentally. Since then. based on discussions with Chandola et al. the data tends to be much more structured. For ease of Those systems found in operational deployment are most exposition we will use the term anomaly detection somewhat commonly based on statistical profiles of heavily aggre- narrowly to refer to detection approaches that rely primarily gated traffic. M ACHINE L EARNING IN I NTRUSION D ETECTION manual inspection infeasible. in large-scale. neural networks [12]. as there is of course a continuous more approaches have been pursued. We academic research efforts on anomaly detection. 306 . including other areas where similar approaches we believe these intuitive arguments match well with what are used. similar arguments hold for host-based systems. ity exhibits characteristics not observed for normal usage— In the following we identify these differences. the very same machine learning recommendations that we hope will help to strengthen future tools that form the basis of anomaly detection systems have research. Throughout the discussion.. In other domains. IDES (and its successor NIDES [10]) used a com.2 We see this situation deployed on previously unseen input for the actual detection as suggestive that many anomaly detection systems from process. environments. associate with the term “anomaly detection”. Outlier Detection also looking for outliers. While our terminology is deliberately a bit vague. we begin with a brief discussion of machine learning It can be surprising at first to realize that despite extensive as it has been applied to intrusion detection in the past. Our 2 We note that for commercial solutions it is always hard to say what discussion in this paper aims to develop a different general they do exactly. they report exhibits particular characteristics that make the effective deviations from that profile as alerts. not experts on machine-learning. operational learning community in inappropriate ways. rather than discovering well-recognized problems [18]: the detectors tend to gener. We note that our examples from other domains are bination of statistical metrics and profiles. better at finding similarities than at identifying activity that mantically much more well-defined than network traffic [17].g. support vector trusion detection does). attack-free data for training is system [21]. By “machine-learning” we mean algo. For example. we focus on thus we argue mostly at an intuitive level rather than attempt- anomaly detection systems that utilize such machine learning ing to frame our statements in the formalisms employed approaches. The basic assumption deployment of machine learning approaches fundamentally underlying any anomaly detection system—malicious activ. III.. spam schemes from the machine learning community. systems stems from using tools borrowed from the machine tively in the “real world”. such devices oper- rithms that are first trained with reference input to “learn” ate with a much more specific focus than with the generality its specifics (either supervised or unsupervised). We focus on network intrusion detection as Compared to the extensive body of research. as specifics of their internals are rarely publicly available. does not belong there: the classic machine learning appli- Anomaly detection approaches must grapple with a set of cation is a classification problem. While in such applications one is A. StealthWatch [20]. though we believe that detection has not obtained much traction in the “real world”. the success then in Section III identify the specific challenges machine of such systems in operational environments has been very learning faces in our domain. To capture normal lenges anomaly detection faces when operating on network activity. with an aim was first introduced by Denning in her seminal work on of raising the community’s awareness of the unique chal- the host-based IDES system [9] in 1987. and many more.

allowing for postprocessing of a system’s life problems because they rarely involve “closed” worlds initial output [25]. contemporary automated language translation unsolicited mail. one might imagine such specimens of all classes. problem: there are two classes. . In such settings. where similarity is False negatives. yet exhibiting less reliability for new variations hitherto unseen. then it would suffice • OCR technology can likewise tolerate errors much more to have a model just crisp enough to separate the classes. While this corresponds to of successfully applying machine learning to a classification verifying NIDS alerts manually. a basic rule of beyond a lost opportunity to have made a more enticing machine-learning is that one needs to train a system with recommendation. leading achieve a much more reliable decision process. “normal” and “not normal”. it is much quicker for a problem. and thus discrepancy can allow for “lopsided” tuning.) If attacks. While for the seller a good recommendation has the and the objective is determining which of the two more potential to increase sales. with results. To readily than an anomaly detection system. If the cause serious damage to an organization: even a single system instead operated like an anomaly detection system. not to expected perfect documents but to proofread Spam detection is an example from the security domain where accuracy is important. crucially. rated) items with other similar products. with the hope of pointing customers to be large [22]. . [The assumption] is not of much practical use in real. This are known and with normal background traffic. Yet for anomaly detection aiming to find novel products they would not have otherwise considered. have the potential to determined by products that tend be bought together. one category to compare new activity against. High Cost of Errors reliably. weeding out the obvious mistakes. rather than take a damaging step such as switching In other words. As Greg Linden said (author of the detection system with the opposite of what it is supposed recommendation engine behind Amazon): “Recommen- to find—a setting certainly not ideal. the relative cost of any misclassi- primarily aims to find novel attacks. 307 . and thus having only interest. A false positive requires spending trade-off. the number of systems deliberately making more unlikely guesses representatives found in the training set for each class should on occasion. conclusion that anomaly detection is likely in fact better • Spam detection faces a highly unbalanced cost model: suited for finding variations of known attacks. always be high. they will most likely just continue shopping. an anomaly detection system faces a much more activity. For an anomaly detection system that In intrusion detection. even a very small rate of stringent limit on the number of errors that it can tolerate. users have been trained in which you can be certain that all cases are covered. Similar ham have evolved into a standard tool for reliably identifying to OCR. Originally proposed by Graham [8]. [21]: The idea of specifying only grammar checkers are commonly employed to clean up positive examples and adopting a standing assumption that results. As argued by Axelsson. operates at relatively large errors rates [7]. one often winds up training an anomaly to different seller. However. expensive analyst time examining the reported incident only to eventually determine that it reflects benign underlying Overall. rather than false positives (i.” [24] on the other hand. say. • Product recommendation systems can readily tolerate In some sense. Spelling and quote from Witten et al.e. on the other hand. one had a classification problem with multiple alternatives to choose from. statistical language models associate probabilities . ally. high costs with the impact of misclassifications in other as according to [3]. In addition. it compromised system can seriously undermine the integrity would look for items that are typically not bought together— of the IT infrastructure. matching each of a user’s purchased (or positively false positives can quickly render an NIDS unusable [23]. as it requires having dations involve a lot of guesswork. and.. More gener- the rest are negative is called the closed world assumption. to systems that emphasize finding obvious spam fairly B. such performance fication is extremely high compared to many other machine on new variations rarely constitutes an appropriate learning applications. nobody would for such true classification problems then leads to the expect more than a rough translation. outlier detection is also a classification errors as these do not have a direct negative impact. a bad choice rarely hurts likely matches an observation. Our error rate will a perfect model of normality for any reliable decision. one very expensive. many product pairs have no common domains: customers. ham declared as spam) can prove previously unknown malicious activity. It is illuminating to compare such a different kind of question with a much less clear answer. but false negatives (spam not identi- can train the system with specimens of the attacks as they fied as such) do not have a significant impact. Bayesian human eye to check spelling of a word than to validate frameworks trained with large corpora of both spam and a report of. but only on normal traffic. by definition one cannot train on the attacks of recommendations do not align well with the customers’ interest.filtering. (In fact. If. and while The observation that machine learning works much better recent progress has been impressive. a web server compromise.

in financial terms. rendering them unpre- department policies can differ widely. the network’s most basic characteristics—such as Activity that is fine in an academic setting can be banned in bandwidth. one consideration is Network traffic often exhibits much more diversity than the incorporation of local security policies. as the Lawrence Berkeley National that it remains “below the radar” in terms of volume. [27] infer the language spoken during training with normal traffic. Reporting just the usage of P2P applications is likely not particularly useful. In our experience. The a NIDS to accommodate such differences. and even inside a single organization. the next tions a NIDS can develop from them. one needs to explicitly demonstrate it. As it turns out. The basic challenge with regard to the semantic gap in the intrusion detection community we find a tendency is understanding how the features the anomaly detection to limit the evaluation of anomaly detection systems to system operates on relate to the semantics of the network an assessment of a system’s capability to reliably identify environment. common in many environments.4 But an anomaly detection system and thus a detector that does not allow for bridging this gap developed in the absence of such descriptions has little hope is unlikely to meet operational expectations. whether benign or not. Those fying information (PII).g. and a lot like US social security numbers.3 As Answering this question goes to the heart of the difference another example. since reviews in a separate database. more often than not security policies are knowledge informed by very careful examination of the particular domain not defined crisply on a technical level. a decision out of reach for any of today’s systems. which leads to misconceptions neglected in academic research. social security numbers as one cannot stop at that point. one cannot on encrypted VOIP sessions. or due to publicity malicious behavior but just report what has not been seen or political fallout). as it has the potential for causing major acknowledge that such systems are not targeting to identify damage (either directly. e. a distinguishing one from the other. Semantic Gap usage. [29] identify users in the anonymized Netflix datasets via correlation with their public to different sites. in operational settings. However these examples all demonstrate the power of exploiting structural Unfortunately. 308 . We argue however that are not that hard to describe. Yen et al. can verify automatically. an of study—results not obtainable by simply expecting an anomaly detection environment might tolerate peer-to-peer traffic as long as system to develop inferences about “peculiar” activity. Diversity of Network Traffic results’ interpretation that matters. the intrusion detection-specific challenges that we report a violation of such a policy. [30] determine from lossy the core issue concerns that such variations can prove diverse and remote packet traces the number of disks attached to systems infected and easy to overlook. While often people intuitively expect. when examining only NetFlow records. 4 With limitations of course. and application mix— an enterprise network.However. Unfortunately. final step. duration of connections. The common of finding PII. [28] identify the particular web simply assert this as the solution to the question of adapting browser a client uses from flow-level data. Japanese phone numbers look it is not used for distributing inappropriate content. the objective of well bank account numbers follow specific schemes that one deploying an intrusion detection system is to find attacks. Thus. as well as their uptime to millisecond precision. In many threat models. When addressing the semantic gap. and Kumar et al. 3 We note that in fact the literature holds some fairly amazing demon- For an anomaly detection system. it is crucial for dictable over short time intervals (seconds to hours). unless the environment flat-out bans such C. However. the natural strategy strations of how much more information a dataset can provide than what to address site-specifics is having the system “learn” them we might intuitively expect: Wright et al. and sometimes originate in ferring their results into actionable reports for the network the imprecise legal language found in the “terms of service” operator. yet for the operator it is the D. We deem this unfortunate “appropriate” or “egregiously large” in that particular envi- combination as the primary reason for the lack of success ronment. such vague guidelines are actually Anomaly detection systems face a key challenge of trans. a fundamental observation about what anomaly detection technology can realistically about operational networks is the degree to which they achieve in operational environments. Returning to the P2P step then needs to interpret the results from an operator’s example. Narayanan et al. machine learning algorithm does not make any mistakes within its model of normality. consider exfiltration of personally identi- between finding “abnormal activity” and “attacks”. it is hard point of view—“What does it mean?” to imagine how one might spot inappropriate content. In many studies. which we term the semantic gap. the anomaly detection discuss here all tend to increase error rates—even above system would need to have a notion of what is deemed the levels for other domains. by the “Witty” worm. for any given choice of features deviations from the normal profile. network. While doing so indeed there will be a fundamental limit to the kind of determina- comprises an important ingredient for a sound study. To Laboratory noticed when monitoring for them in email [31]. For example. can exhibit immense variability.. In particular. On a technical level. some forms of PII before. After all. and even given examples of PII and non- experience with anomaly detection systems producing too PII will likely have difficulty distilling rules for accurately many false positives supports this view: by definition. we observe a lack of this crucial to which users must agree [26]. loss of PII familiar with anomaly detection are usually the first to ranks quite high. Even within a single differ: many security constraints are a site-specific property.

particularly regarding the degree to efforts: a failure to examine whether simpler. the traffic volume is Cup dataset derived from them [43]—are now a decade old. This last observation goes to the heart (Indeed. researchers have pursued two al- than building the detector itself. the results of an anomaly detection As demonstrated by the DARPA dataset. stability when observed over longer time periods (hours to The two publicly available datasets that have pro- days. such as “volume per hour” or close to reflecting contemporary attacks. an organization’s business secrets.g. Syntactically. recorded in 1995. network traffic system are harder to predict than for a misuse detector. That said. researchers frequently encounter many promising approaches turn out in practice to fall short insurmountable organizational and legal barriers when they of one’s expectations. we note that traffic diversity is not restricted the DARPA traces into features for machine learning. It is understandable that in is particularly crucial to perform.widespread prevalence of strong correlations and “heavy. and no longer even highly aggregated information. For of activity. DARPA/Lincoln Labs packet traces [41]. primary reason clearly arises from the data’s sensitive nature: Semantically. for evaluations in more than 90 papers published between 1995 and 2007— or both. features derived from application protocols the inspection of network traffic can reveal highly sensitive can be just as fluctuating as network-layer packets (see. Internet traffic (i) finding the right data. as it makes it number of private mail archives can already yield a large difficult to find a stable notion of “normality”. posefully leave room for interpretation. We generated by simulation can have the major benefit of discuss evaluation challenges in terms of the difficulties for being free of sensitivity concerns. and in heterogeneous Given the lack of publicly available data. For an anomaly detection system. 5 We note that the lack of public network data is not limited to the 1) Difficulties of Data: Arguably the most significant intrusion detection domain. Getting suitable thing unusual. On the other hand. In our to employ aggregation.. but extends to application-layer only does it inherit the shortcomings of the DARPA data. protocol specifications often pur. both in terms of syntactic and semantic but the features have also turned out to exhibit unfortunate variability. threshold schemes). Not dataset contains multiple weeks of network activity from a coincidentally. domain. for automatic language translation 13 of these in 2007 [47]! 309 . sophisticated methods have been One way to reduce the diversity of Internet traffic is devised to generate ground-truth automatically [40]. twice as large as during the corresponding time slots last and no longer adequate for any current study. incidents found so extensively studied over the years that most members of by these systems tend to be rather noisy anyway—and often the intrusion detection community deem it wholly uninter- straight-forward to find with other approaches (e. nor any appropriate. “a large training set of the input-output behavior that we tailed” data transfers [32]. information. [42] and the KDD terns: if during today’s lunch break. devising sound evaluation attempt to provide datasets to the community. For an anomaly detection system. For OCR. and in fact turns out to be more difficult Given this difficulty. [36]). that likely reflects something unusual occurring. we often have neither standardized test to-medium time intervals. traffic properties tend to greater sets. or its users’ network access patterns. Any breach of such information E. the DARPA data faced pointed criticisms not long of what can often undermine anomaly detection research after its release [44]. [33] regularly leads to large bursts seek to automate is available to us in the wild” [37]. collections of “ham” is more difficult. in most networks vided something of a standardized setting in the past—the time-of-day and day-of-week effects exhibit reliable pat. however even a small such variability can prove hard to deal with.g. We see effects similar to the overuse of the challenge an evaluation faces is the lack of appropriate DARPA dataset in empirical network research: the ClarkNet-HTTP [46] public datasets for assessing anomaly detection systems. of a NIDS. Difficulties with Evaluation can prove catastrophic not only for the organization itself. While researchers at ClarkNet stopped using these logs for their own studies in 1997. Not to packet-level features. one form of anomaly detection system we simulated Air Force network. dedicated “spam feeds” [38] provide large such variability occurs regularly. generated in 1998 and refined do find in operation deployment is those that operate on in 1999. it is natural to traffic streams there is ample opportunity for corner-case ask why we find such a striking gap in our community. in total researchers have used the traces available. we often find either standardized test suites server. For example. schemes is not easy. the detection process. non-machine which simulated data can be appropriate for the evaluation learning approaches might work equally well. or the possibility to collect an appropriate corpus..5 The situations to manifest (see the discussion of “crud” in [34]). corpus [39]. The DARPA week. In dataset contains two weeks’ worth of HTTP requests to ClarkNet’s web other domains. sometimes weeks). as experience shows that the face of such high risks. a thorough evaluation but also for affected third parties. Not only is this data synthetic. It is crucial to acknowledge that in networking spam detectors. it does not represent any. information as well. Due to the opacity of ternative routes in the past: simulation and anonymization. readily available data. simple esting if a NIDS now reliably detects the attacks it contains. e. cations. For example. collections of spam free of privacy concerns.) The KDD dataset represents a distillation of Finally. including confidential or personal communi- [35]. however. and then (ii) interpreting results. While highly variable over small. artifacts [45]. however. but it also has been “connections per source”. However.

time. in general this adjusting their activity to avoid detection. Barreno et al. if the detector reports a mail as spam. removing to understand a single parameter’s impact. (As vironment such systems operate in. on theoretical grounds it is one’s evaluation dataset might impose. Network by definition such systems look precisely for the kind of intrusion detection. The intrusion detection other settings based on just numbers. The specifics of network environments improve the state of anomaly detection research. R ECOMMENDATIONS FOR U SING M ACHINE again. In addition to correctly identifying attacks. the lack of data for evaluation purposes. addressing evasion is a it is thus crucial to (i) acknowledge shortcomings that stimulating topic to explore. there is not L EARNING much room for interpretation left. Conclusions drawn from analyzing a small match a system’s normal profile. We return to domains. even with an or anonymizing potentially sensitive information [49]. tive. While evasion is not an easy task. machine learning. and (ii) consider what separates intrusion detection most clearly from other alternatives specific to the particular setting. In [58].) Furthermore. More generally. mostly one the intrusion detection domain concerns the adversarial en- suspects for the fear that information can still leak. However. Evaluating an anomaly detection system that operator can judge a detector’s potential to support different strives to find novel attacks using only simulated activity operational concerns as required. yet only reports it victim to a sophisticated evasion attack is small in many as “HTTP traffic of host did not match the normal profile”. However. nor will even if a scrubbed dataset is available. Tan et al. of the system. Suppose a system correctly finds a seeking targets— the risk of an anomaly detector falling previously unknown web server exploit.. researchers are forced particular. as most lack access to appropriately poses a fundamentally hard problem for any NIDS [57]. even if already having sufficient trust in using machine learning effectively. parameters for which the system happens to work best for If we could give only one recommendation on how to a particular input. In light of the points developed above. e. despite intensive efforts [52]. in [59] environment cannot be generalized to settings of larger scale. and expertise on the attacker’s side. however. we now formulate We argue that when evaluating an anomaly detection guidelines that we hope will help to strengthen future system. publishing 3) Adversarial Setting: A final characteristic unique to such datasets has garnered little traction to date. as results tend to be intuitive there. there as the blind spots every system will necessarily have— is certainly room for further discussion within the wider is much more valuable than identifying a concrete set of intrusion detection community. [53]. One Due to the lack of public data. we argue that from a practical perspec- these points in Section IV-D1. Assuming such a threat model. conceptually simple anomaly detection system [56] . In contrast. present a taxonomy of attacks on machine- There is unfortunately no general answer to countering learning systems. sized networks. its use with an Amazon customers have much incentive (or opportunity) to anomaly detection system can be quite problematic. Returning to spam detection IV.g. Yet. demonstrated the amount of effort it can require One can also sanitize captured data by. Exploiting the to perform an explicit final step that tends to be implicit specifics of a machine learning implementation requires in other domains: changing perspective to that of a user significant effort. it appears pru- The operator will spend significant additional effort figuring dent to focus first on addressing the many other challenges in out what happened. indiscriminantly assessment of its impact. serious concern in this regard is evasion: attackers to assemble their own datasets. in response to the other side devising new techniques. a network itself [48]. must grapple with a classic artifacts that tend to be removed during the anonymization arms-race: attackers and defenders each improve their tools process [55]. as they affect a system’s the system to take its alerts seriously. the impact of the adversarial setting is not necessarily 2) Mind the Gap: The semantic gap requires any study as significant as one might initially believe. understanding the system’s semantic properties— research on anomaly detection. it would be: differ too widely to allow for predicting performance in Understand what the system is doing. since mislead the company’s recommendation system. we do not see a comparable problem. However. In other applications of operational performance more severely. Fogla and Lee aggregate traffic seen upstream where NIDSs are commonly present an automated approach to mutate attacks so that they deployed [26]. [50]. as well guidelines as touchstones rather than as firm rules. Considering that most of today’s attacks are however not an anomaly detection system also needs to support the deliberately targeting handpicked victims—yet simply ex- operator in understanding the activity and enabling a quick ploit whatever sites they find vulnerable. [51]. users of OCR [54] demonstrates.is already exceedingly difficult to simulate realistically by into the conceptual capabilities of a system. For any study From a research perspective. That said. We note that we view these the operationally relevant activity that it can detect. It is crucial to realize that activity found in anomaly detection faces further risks due to the nature a small laboratory network differs fundamentally from the of underlying machine learning. we note that will often lack any plausible degree of realism or relevance. with insight community does not benefit any further from yet another 310 . environments. this fear is well justified. systems won’t try to conceal characters in the input.

the next step is learning systems deployed in other domains—yet the high a neutral assessment of what constitutes the right sort of tool cost associated with each error often conflicts with effective 311 . Reducing the Costs It is crucial to have a clear picture of what problem Per the discussion in Section III-B. The why the particular choice promises to perform well in the point we wish to convey however is that we are working in intended setting—not only on strictly mathematical grounds. Laying out However. after identifying the activity to report. there are “no context-independent [. Questions to address how the proposed approach avoids similar problems. We argue that such a particular setting. capabilities (in terms of revealing the targeted activity) goes academic environments impose different requirements a long way towards reliable detection. . As discussed by results. when achieving better results When settling on a specific machine-learning algorithm on the same data than anybody else. Duda et al. Intuitively. Keeping The Scope Narrow C. the authors attacker. ] reasons to favor one learning [. one would expect this as the appropriate tool. The discussion indiscriminant “background radiation” activity. the activity. a particular machine-learning approach) a variation that works slightly better than anything else in and then looking for a problem to solve. this fact can be easily solution to a problem. A feature set. [22]. into the features’ significance (in terms of the domain) and lenges than for a large enterprise or backbone network. the typical length of a query’s parameters tends to There are no perfect detectors in intrusion detection— be short. it can be to consider the anticipated threat model. ] method over another” A. (e. A good example for the kind of mindset we deem vital for • What skills and resources will attackers have? If a site sound anomaly detection studies is the work on web-based deems itself at high risk for explicit targeting by an attacks by Kruegel et al. an anomaly detection system anteed to appropriately match a particular detection task. requires long shellcode sequences and padding). attackers might analyze defense techniques and seek and the authors clearly motivate the choice of features by to circumvent them determines the robustness require- comparing characteristics of benign and malicious requests ments for any detector. an area where insight matters much more than just numerical but considering domain-specific properties. Understanding the Threat Model (emphasis added). in some cases it will be an anomaly detector. [60]. yet differ in their • What concern does evasion pose? The degree to which specifics sufficiently to make writing signatures impractical). . lost on newcomers. while obvious for those a starting point is biased and thus rarely leads to the best working in the domain for some time. is the excessive number of false positives they commonly Of course machine-learning is not a “silver bullet” guar. However. B. . report. one should have an answer for to be a definite contribution to the progress of the field. A common pitfall than commercial enterprises. as that establishes illuminating to understand their shortcomings to motivate the framework for choosing trade-offs. worse. if one • What do missed attacks cost? Possible answers ranges cannot make a solid argument for the relation of the features from “very little” to “lethal. From the outset. it follows that one a system targets: what specifically are the attacks to be obtains enormous benefit from reducing the costs associated detected? The more narrowly one can define the target with using an anomaly detection system. it needs to anticipate much more sophisticated focus on a very specific class of attacks: exploiting web attacks than those incurred by potential victims of servers with malformed query parameters. deployed attack detectors. here is the temptation to base the feature set on a dataset that happens to be at hand for evaluation. the resulting study risks foundering depend on its security demands as well as on other on serious flaws. common pitfall is starting with the premise to use machine- The nature of our domain is such that one can always find learning (or.” A site’s determination will to the attacks of interest.g. one needs Note that if existing systems target similar activity. . operators can make informed decisions only when the land like this sets up the stage for a well-grounded study.. but combination of a machine learning scheme with a particular in others a rule-based approach might hold more promise. the better one can tailor a detector to its specifics number one complaint about anomaly detection systems and reduce the potential for misclassifications. convincingly argues for the need of anomaly detection (such attacks share conceptual similarities. include: A substantive part of answering the Why? question is • What kind of environment does the system target? identifying the feature set the detector will work with: insight Operation in a small network faces very different chal. Unfortunately. does not necessarily make more mistakes than machine Thus. they call this the “no free lunch theorem”. while a successful buffer overflow attempt likely hence one always must settle for less-than-ideal solutions. As we have seen. a system’s threat model is clearly stated. applied to something like the DARPA dataset. Before starting to develop an anomaly detector.study measuring the performance of some previously untried for the task. Anecdotally.

We discuss and detection on different traffic even when one has only a evaluation separately in terms of working with data. but we might often find the set of researchers send their analysis programs to data providers ports on which it accepts incoming connections to be stable who then run them on their behalf and return the output. When evaluating an anomaly detection system. limiting false positives must be a top an environment as possible6 . an internal vulnerability Anagnostakis et al. the #1 reason hopelessly broken). one must train the costs by providing the analyst with additional information system with different data than used for ultimate evaluation. Once acquired. See [65] for further results of anomaly detectors with an instrumented copy discussion of issues relating to working with network data.’s “BotHunter” system uses a “statistical how it is flawed. Evaluation system can adapt to different environments through learning requires evaluation using data from multiple sources. typical client systems. designed to accelerate the manual inspection process. As a simple extended stay. must not only understand what the data contains. For ex. from different networks. stress that. Per of importance to the operators. and a typical scan detector). Their only role capabilities: What can it detect. with the benefit that one can choose trade-offs more conservatively work with. We When evaluating an anomaly detection system. selecting subsets of the available data via random sampling. Another approach is to quantities of the data of interest. the DARPA and KDD mary objective should be to develop insight into the system’s Cup traces cannot serve as viable datasets. ideally. The setup of the underlying machine-learning problem also or by exchanging the access for work on an unrelated area has a direct impact on the number of false positives. results collected in small environments rarely apply to a dataset containing real network traffic from as large directly to large ones. either. see however [21] for a set of stan- dard techniques one can apply when having only limited data available). 312 . that conference submissions on anomaly detection fail arises Subdividing is a standard approach for performing training from a failure to adequately explore these issues. the best way to obtain such data is to provide a clear system can achieve a tolerable amount of false positives benefit in return to the network’s operators. to test whether an approach is Where does it break? In our experience. Work with actual traffic greatly Likely the most important step towards fewer mistakes is strengthens a study. For example. bring the experiment to the data. without unacceptably compromising on its detection rate.e. the datasets require a careful assessment Finally. i.operation. as the evaluation can then demonstrate reducing the system’s scope. strategically by sending a student or staff member for an some will be more invariant than others. over extended periods of time. assuming the threat traffic. Alternatively. setting.’s “Shadow Honeypots” validate the scan run by the security department). machine-learning works best when trained Note that the options for obtaining data differ with the using activity similar to that targeted for detection. but also ample. To interpret results correctly. and ideally multiple of these priority for any anomaly detection system. and single dataset from the examined environment. though they cannot provide insight into how mali- tion III-D). However. hon- An anomaly detection system also requires a strategy eypots [63] can provide data (usually) free of sensitivity to deal with the natural diversity of network traffic (Sec- concerns.. the pri. Section III-A. by research that aims to directly help to improve operations. Perhaps less obviously. we might still be able to reduce ways needs multiple datasets. No dataset is perfect: often measurements payload anomaly detection engine” as one tool among others include artifacts that can impact the results (such as filtering (Snort signatures. one al- processing infeasible. and why not? How reliably does it operate? cross-checking results (i. The “gold standard” here is obtaining access in the latter. as discussed in Section IV-B. one them with the support of additional information. If we find automated post. First. In our experi- Arguably.. one might need to plan carefully examine the features for their particular properties. 1) Working with data: The single most important step 6 Results from large environments usually transfer directly to smaller net- for sound evaluation concerns obtaining appropriate data to works. and why? What can it in contemporary research is for basic functionality tests and not detect. Thus. mediated trace access can be flow-level example.g. Gu et al.. filter out if readily identified (e. It works by interpreting results. without a clear objective no anomaly detection ence. how well the system should work in practice. aggregating or averaging features over cious traffic manifests differently from benign “background” suitable time intervals proves helpful.e. or unrelated noise that one can safely stage correlates the output of all of them [61]. or when working with companies that control large model allows for coarser granularity. the set of destination ports a particular a viable strategy [64]: rather than bringing the data to internal host contacts will likely fluctuate quite a bit for the experimenter. of the protected system [62]. to demonstrate that the D. as noted in Section III-E1. and it often pays to consider potential data sources early on when designing the detector. yet over- looked surprisingly often. we can reduce false positives by post-processing of their characteristics. (This is a basic requirement for sound science. Often. Likewise. and a final or unintended loss).

313 . By examining with too many false positives to manually examine. which positives to the semantics of the traffic. This need arises from the is instead to understand the new system. If faced For example.. and of the data can arguably be labeled in this fashion with high less so for settings requiring low-latency real-time detection. That is. requires significant experience with the particular system. For example. Many machine learning only be as good as this other technique. It can be highly tion by leveraging the structural properties of the domain. Note. subset for direct inspection. (The idea here is questionable to pursue the work at all. while all others had a blue sky.) Another solution is manual A separate consideration concerns how an evaluation labeling—often however infeasible given the large amount compares results with other systems found in the literature. but rather to understand the significance of the system’s operation.g. it produces correct results. Bayesian-based analysis. it first step a comparative study needs to reproduce the results is often not apparent what the system learned even when reported in the literature for the “foreign” system. Received headers. Researchers need to manually examine space. The suc- set of attacks deemed representative of the kind the anomaly cessful operation of an anomaly detection system typically detection system should detect. Nevertheless. then one which phrases a Bayesian classifier employs most effectively can employ random sampling to select an appropriately sized one might discover that certain parts of messages (e. easier to implement in a streaming fashion even at high data ment functions. If so. one can extrapolate from performance on the rates. but instead building on separate Nevertheless. the neural network had simply flow basis. then provided that the subset is formed in a Non-machine-learning detectors often prove significantly fashion independent from how the detector under develop. accuracy. One approach is to use a different mechanism to label We note that such an approach can also help overcome the input. the most convincing real-world test of any problem comes from a Pentagon project from 1980s [66]: anomaly detection system is to solicit feedback from opera- a neural network was trained to detect tanks in photos. later cross-checking revealed. A final compromise is to inject a Doing so requires care to ensure fair treatment. however. even if it otherwise similar to that employed in Principle Component Analysis. subject lines. beneficial to consider the question of ground-truth already Thus. consider spam classification. it is hardly helpful to then eventually serve as the basis for a non-machine-learning frame them in the mathematical terms of the detection logic detector. rather than an why the system incorrectly reported a particular instance. In this contrived example. then it becomes themselves based on different principles. false positives because they require reliable ground-truth. An important but often overlooked additional consider.) works. A classic illustration of this Finally. If when doing so one cannot determine as potentially providing a means to an end. when operating on a per. packet-sample. machine learning can sometimes serve very effectively at the beginning of a study. MIME tags) provide False negatives often prove harder to investigate than disproportionate detection power. It compelling support for the study.Subdividing can work well if it is performed in advance of turned out. changing the emphasis interpreting results is to understand their origins. one might then realize that a detector that directly exam- which can be notoriously hard to obtain for an anomaly ines those components—perhaps not employing any sort of detection system that aims to spot previously unseen activity. We can in fact turn around the notion of understanding the 2) Understanding results: The most important aspect of origins of anomaly detection results. such an assessment forms a crucial part of domain knowledge—can provide more effective classifica- the story and merits careful attention. with the obvious disadvantage that such input will potential performance bottlenecks. If one cannot find a sound way to “point the way” to how to develop detectors that are to obtain ground-truth for the evaluation. of data a NIDS operates on. as a opacity of the decision process: with machine learning. Note however that the splitting must be evaluation shared a subtle property: photos of tanks were unbiased with regards to the features the anomaly detection taken on a cloudy day. If they genuinely and in the initial evaluation it was indeed able to correctly deem the system helpful in their daily routine. as it needs to be tuned to the local setting—experience that ation is to include in an evaluation inspection of the true can prove cumbersome to collect if the underlying objective positives and negatives as well. subset to broader performance. that provides separate photos depicting tanks from those which did not. One must collect ground-truth which aims to find which among a wide set of features via a mechanism orthogonal (unrelated) to how the detector contribute the most to particular clusters of activity [22]. (Sometimes a subset algorithms are best suited for offline batch operation. A sound from gaining insight into how an anomaly detection system evaluation frequently requires relating input and output on achieves its results to instead illuminating the problem a very low-level. As system examines. (“activity exceeded the distance metric’s threshold”). end itself: one employs it not to ultimately detect malicious this indicates a lack of insight into the anomaly detection activity. one needs to relate such false different features of benign and malicious activity. machine learning is often underappreciated false positives. appears on a solid foundation. tors who run the system in their network. that the datasets used for training and the actual study. one should flow-sample the dataset rather than learned to detect the color of the sky.

1993. and we look forward to “HIDE: a Hierarchical Network Intrusion Detection System the intrusion detection community engaging in an ongoing Using Statistical Preprocessing and Neural Network Classifi- cation.” in Proc. 7. Smith. however do not contribute to the progress of the field without any semantic understanding of the gain. Vemuri. Linden. Kumar. consider our discussion as final. Such results Rep. we provide a set of guide- [8] P. S. It is crucial to acknowledge 1987. Tech. no. 314 . vol. Ucles. Manikopoulos. We argue that this discrepancy stems in [2] D. while Computing. pp. lines for applying machine learning to network intrusion O’Reilly. J. Ellis. C. York.” Ph.. Hu. and S. International Conference on Document Analysis and evaluation. that the nature of the domain is such that one can always [10] H. and Security. Hofmeyr. Javitz and A.” University of Minnesota. pp. Smith. Li. (iv) the enormous variability of on a Massive Scale. “An Immunological Model of Distributed Detection and its Application to Computer Security. Pierce. by the Director. Y. which render [4] J. error rates as encountered in other domains unrealistic. and conclusions or recommendations are those of the authors and do not necessarily reflect the views of the [15] S. Netflix. making it difficult to find stable notions of normality. Attwood. 30. it significantly harder to apply machine learning effectively than in many other areas of computer science where such [3] G. Liao. 2007. “Execution Monitor- ing of Security-Critical Programs in Distributed Systems: A extensive amount of research on machine learning-based Specification-based Approach. R. of the U. DE-AC02-05CH11231. “Amazon. Office of Advanced Scientific Computing Research. This work was also supported dissertation. vol. 2001. and J. versus the lack of operational deployments of such systems. and J. 2003. Och and H.D. Security and Privacy. we deem their [7] F. 417–449.” in Hackers & Painters. Department of [16] V. B. L. The domain-specific dations: Item-to-Item Collaborative Filtering. J. “Google Book Search: Document Understanding operational interpretation. (v) significant challenges with performing sound [6] R. Levitt. Lee and D.” in Proc. IEEE Symposium on anomaly detection pursued in the academic intrusion detec. vol. future research on anomaly detection by pinpointing the fundamental challenges it faces. S.” in Proc. 2003. Ney. “The Alignment Template Approach to unfortunate combination in this domain as a primary reason Statistical Machine Translation. an operational point of view. 2004. Matzner. 2007. “An Application of valuable suggestions. Banerjee. V. 2001. large part from specifics of the problem domain that make “A Behavioral Approach to Worm Detection. [11] W. 2. Zhang. KDD Cup and Workshop. no. for its lack of success. “Information-Theoretic Measures for Anomaly Detection. 4. Linguist. 2004. tion community. 76–80. machine learning instead performs better at finding similari- ties. no. Sinclair.. “The Netflix Prize.” in Proc. and K. While none of these render machine learning an inappropriate tool for intrusion detection. tion: A Survey. “An Overview of the Tesseract OCR Engine. Office of Science. Bennett.” in Proc. International We would like to thank Gerald Friedland for discussions Conference on Machine Learning. and S. 13. D.. as well as the anonymous reviewers for their [14] C. Graham. Chandola. 2004.” in Proc. setting. Vincent. In particular. Opinions. E. “Anomaly Detec- Energy under Contract No.” in Proc. Ruschitzka. (ii) very high costs of classification errors. Tenaglia.” 2007. 2007. 1999. and V. University of New Mexico. S. (iii) a semantic gap between detection results and their [5] L. “The NIDES Statistical Compo- find schemes that yield marginally better ROC curves than nent: Description and Justification. R. and (vi) the need to operate in an adversarial Recognition. benign traffic. we argue for the importance of obtaining insight into the operation of an anomaly detection [9] D. “A Plan for Spam. anything else has for a specific given setting. M. National Science Foundation. Lanning. Xiang. IEEE Workshop on Information Assurance dialog on this topic. ACM CCS WORM Workshop. Valdes. A.” IEEE Internet challenges include: (i) the need for outlier detection. To overcome these challenges. This work was supported in part by Machine Learning to Network Intrusion Detection. “Robust Anomaly Detec- tion Using Support Vector Machines. 1999. We stress that we do not [12] Z. 222–232.” Comput. K. 1997.S. J. J. A. G.” in Proc. findings. Computer Security Applications Conference. Tech. Rep. “An Intrusion-Detection Model. NSF Awards NSF-0433702 and CNS-0905631. Aiken. pp. IEEE Symposium on Security We hope for this discussion to contribute to strengthening and Privacy. C ONCLUSION R EFERENCES Our work examines the surprising imbalance between the [1] C. ACKNOWLEDGMENTS [13] W. and feedback. Jorgenson.” IEEE Trans- system in terms of its capabilities and limitations from actions on Software Engineering. 1. Denning.com Recommen- schemes are used with greater success. and N. Ko. and V.” SRI International. detection.

no. vol. Reiter. Nazir. 2009.” http://kdd. and W.” in Proc. 4. E. Ballard. “Results of Implementation and Application of Automata.com/products/. P. McHugh. 2008. 2008. J.” in Proc. Monrose. no. 1999. Riley.edu/databases/kddcup99/ y Roberto or Alice and Bob?” in Proc. and N. Arlitt. O. J. 2005. Fried. D. H.html.” in Proc. Sep 2008.lbl.” in Proc. P. and F. 2435–2463. K. in Real-Time. [32] W. Fleizach. pp. [18] C. [25] C. C. J. and M. Gilbert. J.” Computer Networks. no. 3. no. [50] J. ACM SIGCOMM. Weaver. “Challenging the Anomaly Detection Paradigm: A Provocative Discussion.com/2006/ [40] J. Das. 1997. S. Data Mining: Practical Machine [37] A. 1999. 2002. M. Voelker.gov/html/contrib/ Event. [29] A.” in Proc. Yen. M.” http://ita. DAS 2008. Lynam. 31. and K. http://glinden. J. [39] G. [22] R. Shafait. 34. Cormack and T. ment Workshop. “On the De- [33] A.ee. Zissman. K. “Testing Intrusion detection systems: A critique [28] T. dissertation. 2001. Lippmann. Transactions on Networking. Symposium. Monrose. vol. Chuah.-N. V. 1. [47] Martin Arlitt. W. [24] “Make Data Useful. Workshop on [35] P. S.arbornetworks.” IEEE Intelligent Systems. 2005. and D. and M. “The Unreasonable Learning Tools and Techniques (2nd edition). K. S. Cunningham.” IEEE/ACM Transactions on Networking.-F. Stork. ACM SIGCOMM Internet Measure- Internet WAN Traffic. Hart. Narayanan and V.” IEEE/ACM [49] “tcpdpriv. “Automated 12/slides-from-my-talk-at-stanford. 2009. Witten and E. V. Fan. Paxson. 25. Pereira. Gates and C.” http://ita. the 1998 DARPA Offline Intrusion Detection Evaluation. Skut. 3. 2007. V. V. M.” Statistical Science. Taylor. Wiley Interscience. 2007. vol. 1998.lancope. Morgan Effectiveness of Data. Taqqu. [30] A. Wilson. D. F. Stanford University.blogspot. Paxson. Huang. C. 2006. Paxson.[17] R. “Language Identification of Encrypted VoIP traffic: Alejandra [43] “KDD Cup Data. and C. Conference on Detection of In- Transactions on Information and System Security. M. [19] “Peakflow SP. Chan. Willinger. Duda. Korba. 5. Sommer. 2003. and A.D. 262–294. no. Transducer Library. TU M¨unchen. van Beusekom. K. Shmatikov.” [20] “StealthWatch. Frank. pp.” ACM Transactions on Information Sys- tems.” in Proc. J. OCR Ground Truth Generation. Kumar. R. ACM Confer. USENIX Security Symposium. “Browser of the 1998 and 1999 DARPA intrusion detection system Fingerprinting from Coarse Traffic Summaries: Techniques evaluations as performed by Lincoln Laboratories.” in Proc. vol. ACM SIGCOMM Internet Measurement Conference.” Computer Networks. Lippmann. Z. A. (2nd edition). Norvig.htm%l. L. and S. and M. F. “An Analysis of the 1999 of Large Sparse Datasets. Wright.” Greg Linden.gov/html/contrib/tcpdpriv. in Proc. Willinger. International Conference on Kendall. pp.” http://www. Schalkwyk. “Statistical Fraud Detection: A [34] V. E. SIGCOMM Internet Measurement Conference. 579–595. 2007. “Exploiting Underlying Structure for Detailed Reconstruction of an Internet-scale [46] “ClarkNet-HTTP. via personal communication.” in Proc. and T. 2007. October [27] C. trusions and Malware & Vulnerability Assessment. 2007. Anomaly Detection. “Spamscatter: Characterizing Internet Scam Hosting Infras- tructure. Nov. Ammar. Sherman. Anderson. J. [21] I. W. [48] S. 2009.” in Proc.lbl. 3. via personal communication. 2008. Bolton and D. and G. [36] A. Xu. Breuel. 9. Recent Advances in Intrusion Detection. vol. Savage. “Difficulties in Simulating the Internet. [31] Jim Mellander. Data Mining Seminar. R. Pattern Classification [38] D. and G. Lawrence Berkeley National Laboratory. Allauzen. 1999. [23] S. Feldmann. S. Axelsson.html. vol. I. A. Webster. ClarkNet-HTTP. Recent Advances in Intrusion Detection. M. “Online Supervised Spam ence on Computer and Communications Security. USENIX Security kddcup99.uci. 315 . no. M. Graf. 2008. “Viable Network Intrusion Detection in High-Performance Environments. [42] R. R.com/en/ peakflow-sp. “Self-Similarity Through High-Variability: Statistical Analy- sis of Ethernet LAN Traffic at the Source Level. Masson. sign and Performance of Prefix-Preserving IP Traffic Trace works As Cascades: Investigating the Multifractal Nature of Anonymization. 2005. R. Filter Evaluation. “Bro: A System for Detecting Network Intruders Review. 2001. 23–24. X. Fried.” ACM and Implications. Li. “Robust De-anonymization [45] M. Mahanti. G. ence. “The Base-Rate Fallacy and Its Implications for the Difficulty of Intrusion Detection. Haines.ee. Floyd and V. and D. J. 17. “YouTube Traffic New Security Paradigms. 2000. Characterization: A View From the Edge.” in Proc. “Data Net. Hand. “The 1999 DARPA Off-line Intrusion Detection Evaluation. “Unveiling Facebook: A Measurement Study of Social Network Based Applications.html.” in Proc. November 2000.” in ACM SIGCOMM Internet Measurement Confer. vol. F. 2001.html. M. Mahoney and P. Gill.ics. S.” http://www. 4. no. March/April Kaufmann. Raza. “OpenFst: A General and Efficient Weighted Finite-state [41] R. Mohri. Haley. [44] J. V.” Ph. Moon.” in IEEE Symposium on Security DARPA/Lincoln Laboratory Evaluation Data for Network and Privacy. [26] R. 4.

” in Proc. “Insertion.” Proc. Keromytis. [52] “The Internet Traffic Archive (ITA). Lee. F. Provos and T. Network and Distributed Security Symposium. P. K. “”Why 6?” Defining the Operational Limits of Stide. V. P..” in Proc. “Can Machine Learning Be Secure?” in Proc. H. Oct. Porras. P.” http://ita. C. [63] N. R. 2007. Ptacek and T. “Detecting Targeted Attacks Using Shadow Honeypots. Paxson. M.” http://neil. B. D.[51] R. A. “Playing Devil’s Advocate: Inferring Sensitive Information from Anonymized Network Traces. [61] G.” Secure Networks. August 2007. Sears. Nelson. Markatos. [58] P.” in Proc.” in Proc. Paxson.” in Computer Communication Review. A. IEEE Symposium on Security and Privacy. Barreno. Maxion. N. Joseph. Gu. Sidiroglou. Yegneswaran. V. USENIX Security Sym- posium. Lee.org. Wright. and M. 2006. Computer and Communications Security. 2009. Reiter. E. M. “Strategies for Sound Internet Measurement. Tan and R. Fogla and W. January 1998.predict. 2003.lbl. Pang. Newsham. Rep. Holz. and W. Inc. Addison Wesley. [56] K. Sommer. “Toward Realistic and Artifact-Free Insider-Threat Data. “The Devil and Packet Trace Anonymization. “Evading Network Anomaly Detection Systems: Formal Reasoning and Practical Techniques. 2006. E. 2007.name/ writing/tank.ee. Fraser. ACM Conference on Computer and Communications Security.gov. V.” http://www. Tygar. ACM Conference on Computer and Communications Security.” in Proc. [60] C.. Tech. A. [55] K. Allman. Collins. [59] M. an Anomaly-Based Intrusion Detector. G. and Denial of Service: Eluding Network Intrusion Detection. 2007. Fong.fraser. [65] V. R. S. Coull. Paxson. Lee. USENIX Security Symposium. “Anomaly Detection of Web- based Attacks. ACM Symposium on Information.” in Proc. K. 2004. Maxion. Evasion. 2002. Vigna. Virtual Honeypots: From Botnet Tracking to Intrusion Detection. [57] T. and A. S. “Neural Network Follies. “Se- curing Mediated Trace Access Using Black-box Permutation Analysis. [54] S. [66] N. [62] K. Computer Security Applications Conference. [53] “PREDICT. V. Anagnostakis. 316 . Winterrowd. Akritidis. Kruegel and G. D. and J. Mittal. [64] P.” in ACM SIGCOMM Internet Measurement Conference.” in Proc. Monrose. and J. M. Killourhy and R. 2006. M. and M. Xinidis. ACM Workshop on Hot Topics in Net- works. D. 2005. 1998. “BotHunter: Detecting Malware Infection Through IDS- Driven Dialog Correlation.