Professional Documents
Culture Documents
February 2011
1
Application of Machine Learning and Crowdsourcing to Detection of Cybersecurity Threats
Abstract
We are applying machine learning and crowdsourcing to cybersecurity, with the purpose to
develop a toolkit for detection of complex cyber threats, which are often undetectable by
traditional tools. It will serve as an “extra layer of armor” that supplements the standard
defenses. The initial results include (1) an architecture for sharing security warnings among users
and (2) machine learning techniques for identifying malicious websites. The public release of the
2
Application of Machine Learning and Crowdsourcing to Detection of Cybersecurity Threats
Introduction
We can broadly divide cybersecurity threats in two categories. The first is the vulnerabilities
caused by factors outside the end user’s control, such as security flaws in applications and
protocols. The traditional remedies include using firewalls and antivirus software, distributing
patches that fix newly discovered problems, and amending protocols. While the defense against
such threats is still an ongoing battle, software engineers have been effective in countering most
The second category, which has historically received less attention, includes the problems
caused by “careless” user actions. For example, an attacker may convince inexperienced users to
install a fake antivirus, which in reality corrupts their computers. As another example, an
attacker may use deceptive email and web advertisements, as well as phishing [Kumaraguru at
al., 2009], to trick users into falling victims of scams that go beyond the traditional software
attacks, such as disclosing sensitive information or paying for fake product offers. The number of
such threats has grown in recent years, as more and more people conduct their daily activities
through the Internet, thus providing the attackers with opportunities to exploit the user naïveté.
While web browsers and operating systems now include some defenses against such threats, they
are often insufficient. The attackers have been effective in finding ways to trick the users into
bypassing the security barriers. The detection of such threats is difficult for both humans and
automated systems because malicious websites tend to look legitimate and use effective
deception techniques.
3
Application of Machine Learning and Crowdsourcing to Detection of Cybersecurity Threats
combined with machine learning and natural language processing. We are working on a
distributed system that enables users to report threats spotted on the web, and applies machine
learning to integrate their reports. This idea is analogous to user-review mechanisms, where
people share their experiences with specific products. The novel characteristics of the developed
• Integration with crowdsourced question answering, similar to Yahoo Answers, which helps to
From the user’s point of view, the developed system acts as a personal security assistant. It
gathers relevant information, learns from the user’s feedback, and helps the user to identify
The initial work has lead to the development of a crowdsourcing architecture, as well as
machine learning algorithms for detection of two specific security threats: scam websites and
4
Application of Machine Learning and Crowdsourcing to Detection of Cybersecurity Threats
Crowdsourcing architecture
We have developed an architecture, called SmartNotes, that helps users to share their experience
related to web threats, and integrates the wisdom gathered from all its users. It enables users to
rate websites, post comment, and ask and answer related questions. Furthermore, it combines
User interface: The system’s main screen (Figure 1) allows making comments and
asking questions about a specific website. The user can select a rating (positive, neutral, or
negative), add comments, and post questions to be answered by other users. By default, the
comments are for the currently open web page, but the user can also post comments for the entire
web domain. For instance, when she is looking at a specific product on Amazon, she may enter
notes about that product page or about the entire amazon.com service. The user can specify
whether her notes are private, visible to her friends, or public. When the user visits a webpage,
she can read notes by others about it. She can also search the entire database of notes about all
5
Application of Machine Learning and Crowdsourcing to Detection of Cybersecurity Threats
webpages. In addition, the user can invoke automated scam detection, which calculates the
DATA SOURCES
Main components: The distributed system consists of three components (solid boxes in
Figure 2), which communicate through HTTP requests (dashed lines in Figure 2).
JavaScript and uses the Chrome extension API to interact with the browser.
• SmartNotes web service is written in C#.NET and includes a SQL Server database. It exposes
methods for reading and writing notes, and supports other actions available to the users, such
• Host Analyzer web service is also written in C#.NET. It includes all data-analysis algorithms,
such as scam detection, parsing of user comments, and integration of user opinions with the
6
Application of Machine Learning and Crowdsourcing to Detection of Cybersecurity Threats
Web scam is fraudulent or intentionally misleading information posted on the web, such as false
promises to help find work at home and cure various diseases, usually with the purpose to trick
people into sending money or disclosing sensitive information. The challenge of detecting such
scams is largely unaddressed. For legal reasons, search engines are reluctant to block scammers
unless they have specific strong proof of fraudulent activity, such as confirmed instances of
malware distribution. The initial research on scam detection includes the work of Anderson et
al. [2007], who analyzed spam email to extract addresses of scam websites; and that of Cormack
et al. [2010], who addressed the problem of preventing scammers from tricking search engines
Currently, the most common approach to fighting web scam is blacklisting. Several
online services maintain lists of suspicious websites, usually compiled through user reports. For
example, Web of Trust (mywot.com) allows users to rate webpages on vendor reliability,
trustworthiness, privacy, and child safety, and displays the average ratings. As another example,
hosts-file.net and smapcop.net provide databases of malicious sites. The blacklisting however has
several limitations. In particular, a list may not include recently created scam websites, as well as
old sites moved to new domain names. Also, it may mistakenly include legitimate sites because
We are developing a system that reduces the omissions and biases in blacklists by
measurements that are hard to manipulate. We have created a web service, called Host Analyzer
(Figure 2), for gathering information about websites from various trusted online sources that
provide such data. It currently collects forty-three features describing websites from eleven
7
Application of Machine Learning and Crowdsourcing to Detection of Cybersecurity Threats
sources. Examples of these features include ratings and traffic ranks for a given website;
geographic location of the website server; and the number of positive and negative comments
evaluate the chances that a specific website poses a security threat. The learning module
constructs a classifier based on a database of known ligitimate and malicious websites, and the
system then uses it to estimate the probability that previously unseen websites are malicious. We
have tested it using ten-fold cross-validation on a database of 837 manually labeled websites.
The precision of this technique is 98.0%; the recall is 98.1%; and the AUC measure, defined as
the area under the ROC curve, is 98.6%. Intuitively, these results mean that the system correctly
A cross-site request forgery (CSRF) is an attack through a web browser, in which a malicious
website uses a trusted browser session to send unauthorized requests to a target site [Barth et al.,
2008]. For example, Zeller and Felten [2008] described CSRF attacks that stole the user’s email
address and performed unauthorized money transfers. When a user visits a website, the browser
creates a session cookie that accompanies all subsequent requests from all browser windows
while the session is active, thus enabling web applications to maintain the state of their
interaction with the user. The browser provides the session information even if the request is
generated by a different website. If the user has an active session with site1.com, all requests sent
to site1.com include that information. If the user opens a (possibly malicious) site2.com, which
generates a (possibly unauthorized) request to site1.com, it will also include the site1.com session
8
Application of Machine Learning and Crowdsourcing to Detection of Cybersecurity Threats
information. This functionality is essential because some sites, such as advertising and payment-
processing servers, maintain the transaction state of requests from multiple domains; however, it
creates the vulnerability exploited by CSRF. A web application cannot determine whether a
request comes from the user or from a malicious site, since it contains the same session
information in both cases. The existing defenses require the developers of web applications to
adopt certain protocols. While these defenses are effective, developers occasionally fail to
… …
Email News
Ads
Malicious
Bank
… …
Figure 3. Example graph of cross-site requests, where the nodes are domains and the edges are
requests. The solid nodes are the domains visited by the user, whereas the unfilled nodes are
accessed indirectly through cross-site requests. The dashed lines are CSRF attacks.
We are working on a machine learning technique for enhancing standard defenses, which
prevents attacks against unprotected sites by spotting malicious HTTP requests. It learns patterns
of legitimate requests, detects deviations from these patterns, and warns the user about
We represent patterns of requests by a directed graph, where the nodes are web domains
and the edges are HTTP requests. We show an example in Figure 3, where the solid nodes are
domains visited by the user, and the unfilled nodes are domains accessed indirectly, through
requests from the visited domains. In the example of Figure 3, all sites except Bank show
advertising materials from the Ads server. Furthermore, both Email and Bank show a news bar,
9
Application of Machine Learning and Crowdsourcing to Detection of Cybersecurity Threats
which requires cross-site requests to News. A CSRF attack occurs when the Malicious site sends
If there are no active browser sessions when the system starts building the graph, a CSRF
attack cannot occur on the first visit to a website. Therefore, when the system adds a new node,
its first incoming edge is a legitimate request. In the naïve version, we allow no incoming
requests for the directly accessed (solid) nodes and only one incoming edge for every indirectly
accessed (unfilled) node. If the system detects requests that do not match this pattern, it considers
them suspicious. In the example of Figure 3, the system would only allow requests from the solid
nodes to their “nearby” unfilled nodes within the same “corner” of the graph. It would give
warnings for requests between different corners, such as a request from Bank to News. The
justification for this approach comes from the observation that most legitimate requests are due
to the web application design in which the contents are distributed across servers.
While the naïve approach is effective for spotting attacks, it produces numerous false
positives, that is, warnings for legitimate requests. In the example of Figure 3, it would produce
warnings when multiple sites generate requests to Ads and News. To prevent such false positives,
we use the observation that, when a site receives legitimate requests from multiple domains, it
usually receives requests from a large number of domains. Thus, the most suspicious case is
when a domain receives requests from two or three sites, whereas the situation when it receives
requests from tens of sites is usually normal. The system thus identifies domains with a large
number of incoming edges and does not give warnings for HTTP requests sent to them. We also
in the previous section, and does not warn about any requests from trustworthy domains.
10
Application of Machine Learning and Crowdsourcing to Detection of Cybersecurity Threats
• Sensitive data: The system identifies sessions that are likely to involve sensitive data, and
uses stricter thresholds for spotting potentially malicious requests that affect these sessions. It
views a session as sensitive if either (1) the user has entered a password when starting this
session or (2) the related website uses the HTTPS protocol rather than HTTP.
System release
available at http://cyberpsa.com. This public release includes mechanisms for the manual rating
of websites and sharing free-text comments about potential threats, as well as the initial
automated mechanism for evaluating the chances that a website poses a threat.
Future work
We will continue the work on application of machine learning and crowdsourcing to automated
and semi-automated detection of various threats. The specific goals are as follows.
• Detection of newly evolving threats, which are not yet addressed by the standard defenses.
• Detection of cyber attacks by their observed “symptoms” in addition to using the traditional
approach of directly analyzing the attacking code, which will help to identify new re-
misleading claims with the purpose to defraud users rather than corrupting their computers.
11
Application of Machine Learning and Crowdsourcing to Detection of Cybersecurity Threats
References
[Anderson et al., 2007] David S. Anderson, Chris Fleizach, Stefan Savage, and Geoffrey M.
Voelker. Spamscatter: Characterizing Internet scam hosting infrastructure. In Proceedings of the
Sixteenth USENIX Security Symposium, 2007.
[Cormack et al., 2010] Gordon V. Cormack, Mark D. Smucker, and Charles L. A. Clarke.
Efficient and effective spam filtering and re-ranking for large web datasets. Department of
Computer Science, University of Waterloo, 2010. Unpublished manuscript.
[Barth et al., 2008] Adam Barth, Collin Jackson, and John C. Mitchell. Robust defenses for
cross-site request forgery. In Proceedings of the Fifteenth ACM Conference on Computer and
Communications Security, pages 75–88, 2008.
[Schmidt et al., 2007] Mark Schmidt, Glenn Fung, and Rómer Rosales. Fast optimization
methods for L1 regularization: A comparative study and two new approaches. In Proceedings of
the European Conference on Machine Learning, pages 286–297, 2007.
[Sharifi et al., 2010] Mehrbod Sharifi, Eugene Fink, and Jaime G. Carbonell. Learning of
personalized security settings. In Proceedings of the IEEE International Conference on Systems,
Man, and Cybernetics, 3428–3432, 2010.
[Zeller and Felten, 2008] William Zeller and Edward W. Felten. Cross-site request forgeries:
Exploitation and prevention. Computer Science Department, Princeton University, 2008.
Unpublished manuscript.
12