Machine Learning and Crowdsourcing for Cybersecurity Threat Detection

Application of Machine Learning and Crowdsourcing to Detection of Cybersecurity Threats
Application of Machine Learning and Crowdsourcing
to Detection of Cybersecurity Threats
February 2011
Eugene Fink, Mehrbod Sharifi, and Jaime G. Carbonell

eugenefink@cmu.edu, mehrbod@cs.cmu.edu, jgc@cs.cmu.edu
Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213
phone: (412) 268-6593
Research sponsor: Department of Homeland Security
1
Abstract
We are applying machine learning and crowdsourcing to cybersecurity, with the purpose to
develop a toolkit for detection of complex cyber threats, which are often undetectable by
traditional tools. It will serve as an “extra layer of armor” that supplements the standard
defenses. The initial results include (1) an architecture for sharing security warnings among users
and (2) machine learning techniques for identifying malicious websites. The public release of the
developed system is available at http://cyberpsa.com. This project is part of the work on
advanced data analysis at the CCICADA Center of Excellence.
Keywords: Cybersecurity, web scam, machine learning, crowdsourcing.
2
Application of Machine Learning and Crowdsourcing
to Detection of Cybersecurity Threats
Introduction
We can broadly divide cybersecurity threats in two categories. The first is the vulnerabilities
caused by factors outside the end user’s control, such as security flaws in applications and
protocols. The traditional remedies include using firewalls and antivirus software, distributing
patches that fix newly discovered problems, and amending protocols. While the defense against
such threats is still an ongoing battle, software engineers have been effective in countering most
threats and reducing the risk to an acceptable level in most cases.
The second category, which has historically received less attention, includes the problems
caused by “careless” user actions. For example, an attacker may convince inexperienced users to
install a fake antivirus, which in reality corrupts their computers. As another example, an
attacker may use deceptive email and web advertisements, as well as phishing [Kumaraguru at
al., 2009], to trick users into falling victims of scams that go beyond the traditional software
attacks, such as disclosing sensitive information or paying for fake product offers. The number of
such threats has grown in recent years, as more and more people conduct their daily activities
through the Internet, thus providing the attackers with opportunities to exploit the user naïveté.
While web browsers and operating systems now include some defenses against such threats, they
are often insufficient. The attackers have been effective in finding ways to trick the users into
bypassing the security barriers. The detection of such threats is difficult for both humans and
automated systems because malicious websites tend to look legitimate and use effective
deception techniques.
3
To improve defenses against these threats, we have taken a crowdsourcing approach,
combined with machine learning and natural language processing. We are working on a
distributed system that enables users to report threats spotted on the web, and applies machine
learning to integrate their reports. This idea is analogous to user-review mechanisms, where
people share their experiences with specific products. The novel characteristics of the developed
system are as follows.
• Integration with crowdsourced question answering, similar to Yahoo Answers, which helps to
encourage user participation.
• Application of machine learning and language processing to analyze user feedback.
• Synergy of user feedback with automated threat detection.
From the user’s point of view, the developed system acts as a personal security assistant. It
gathers relevant information, learns from the user’s feedback, and helps the user to identify
websites that may pose a threat.
The initial work has lead to the development of a crowdsourcing architecture, as well as
machine learning algorithms for detection of two specific security threats: scam websites and
cross-site request forgery.
4
Figure 1. The main screen of the SmartNotes architecture.
Crowdsourcing architecture
We have developed an architecture, called SmartNotes, that helps users to share their experience
related to web threats, and integrates the wisdom gathered from all its users. It enables users to
rate websites, post comment, and ask and answer related questions. Furthermore, it combines
human opinions with automated threat detection.
User interface: The system’s main screen (Figure 1) allows making comments and
asking questions about a specific website. The user can select a rating (positive, neutral, or
negative), add comments, and post questions to be answered by other users. By default, the
comments are for the currently open web page, but the user can also post comments for the entire
web domain. For instance, when she is looking at a specific product on Amazon, she may enter
notes about that product page or about the entire amazon.com service. The user can specify
whether her notes are private, visible to her friends, or public. When the user visits a webpage,
she can read notes by others about it. She can also search the entire database of notes about all
5
webpages. In addition, the user can invoke automated scam detection, which calculates the
chances that a given webpage poses a threat.
Web Browser MULTIPLE

SmartNotes USERS
ratings and Browser Extension

scam
comments warnings
SmartNotes Host Analyzer

Web Service Web Service
DATA SOURCES
Figure 2. The distributed crowdsourcing architecture. The SmartNotes service collects

comments of multiple users. The Host Analyzer service gathers data about websites from trusted
online sources and uses them to calculate the chances that a given website poses a threat.
Main components: The distributed system consists of three components (solid boxes in
Figure 2), which communicate through HTTP requests (dashed lines in Figure 2).
• SmartNotes browser extension provides a graphical user interface, which is written in
JavaScript and uses the Chrome extension API to interact with the browser.
• SmartNotes web service is written in C#.NET and includes a SQL Server database. It exposes
methods for reading and writing notes, and supports other actions available to the users, such
as login and account administration.
• Host Analyzer web service is also written in C#.NET. It includes all data-analysis algorithms,
such as scam detection, parsing of user comments, and integration of user opinions with the
automated threat detection.
6
Detection of scam websites
Web scam is fraudulent or intentionally misleading information posted on the web, such as false
promises to help find work at home and cure various diseases, usually with the purpose to trick
people into sending money or disclosing sensitive information. The challenge of detecting such
scams is largely unaddressed. For legal reasons, search engines are reluctant to block scammers
unless they have specific strong proof of fraudulent activity, such as confirmed instances of
malware distribution. The initial research on scam detection includes the work of Anderson et
al. [2007], who analyzed spam email to extract addresses of scam websites; and that of Cormack
et al. [2010], who addressed the problem of preventing scammers from tricking search engines
into giving them undeservedly high rankings.
Currently, the most common approach to fighting web scam is blacklisting. Several
online services maintain lists of suspicious websites, usually compiled through user reports. For
example, Web of Trust (mywot.com) allows users to rate webpages on vendor reliability,
trustworthiness, privacy, and child safety, and displays the average ratings. As another example,
hosts-file.net and smapcop.net provide databases of malicious sites. The blacklisting however has
several limitations. In particular, a list may not include recently created scam websites, as well as
old sites moved to new domain names. Also, it may mistakenly include legitimate sites because
of inaccurate or intentionally biased reports.
We are developing a system that reduces the omissions and biases in blacklists by
integrating information from various heterogeneous sources, particularly focusing on quantitative
measurements that are hard to manipulate. We have created a web service, called Host Analyzer
(Figure 2), for gathering information about websites from various trusted online sources that
provide such data. It currently collects forty-three features describing websites from eleven
7
sources. Examples of these features include ratings and traffic ranks for a given website;
geographic location of the website server; and the number of positive and negative comments
provided through Web of Trust and other similar services.
We have applied logistics regression with L1 regularization [Schmidt et al., 2007] to
evaluate the chances that a specific website poses a security threat. The learning module
constructs a classifier based on a database of known ligitimate and malicious websites, and the
system then uses it to estimate the probability that previously unseen websites are malicious. We
have tested it using ten-fold cross-validation on a database of 837 manually labeled websites.
The precision of this technique is 98.0%; the recall is 98.1%; and the AUC measure, defined as
the area under the ROC curve, is 98.6%. Intuitively, these results mean that the system correctly
determines whether a website is malicious in 49 out of 50 cases.
Detection of cross-site request forgery
A cross-site request forgery (CSRF) is an attack through a web browser, in which a malicious
website uses a trusted browser session to send unauthorized requests to a target site [Barth et al.,
2008]. For example, Zeller and Felten [2008] described CSRF attacks that stole the user’s email
address and performed unauthorized money transfers. When a user visits a website, the browser
creates a session cookie that accompanies all subsequent requests from all browser windows
while the session is active, thus enabling web applications to maintain the state of their
interaction with the user. The browser provides the session information even if the request is
generated by a different website. If the user has an active session with site1.com, all requests sent
to site1.com include that information. If the user opens a (possibly malicious) site2.com, which
generates a (possibly unauthorized) request to site1.com, it will also include the site1.com session
8
information. This functionality is essential because some sites, such as advertising and payment-
processing servers, maintain the transaction state of requests from multiple domains; however, it
creates the vulnerability exploited by CSRF. A web application cannot determine whether a
request comes from the user or from a malicious site, since it contains the same session
information in both cases. The existing defenses require the developers of web applications to
adopt certain protocols. While these defenses are effective, developers occasionally fail to
implement them properly.
… …
Email News
Ads
Malicious
Bank
… …
Figure 3. Example graph of cross-site requests, where the nodes are domains and the edges are
requests. The solid nodes are the domains visited by the user, whereas the unfilled nodes are
accessed indirectly through cross-site requests. The dashed lines are CSRF attacks.
We are working on a machine learning technique for enhancing standard defenses, which
prevents attacks against unprotected sites by spotting malicious HTTP requests. It learns patterns
of legitimate requests, detects deviations from these patterns, and warns the user about
potentially malicious sites and requests.
We represent patterns of requests by a directed graph, where the nodes are web domains
and the edges are HTTP requests. We show an example in Figure 3, where the solid nodes are
domains visited by the user, and the unfilled nodes are domains accessed indirectly, through
requests from the visited domains. In the example of Figure 3, all sites except Bank show
advertising materials from the Ads server. Furthermore, both Email and Bank show a news bar,
9
which requires cross-site requests to News. A CSRF attack occurs when the Malicious site sends
forged requests, shown by dashed lines, to Email and Bank.
If there are no active browser sessions when the system starts building the graph, a CSRF
attack cannot occur on the first visit to a website. Therefore, when the system adds a new node,
its first incoming edge is a legitimate request. In the naïve version, we allow no incoming
requests for the directly accessed (solid) nodes and only one incoming edge for every indirectly
accessed (unfilled) node. If the system detects requests that do not match this pattern, it considers
them suspicious. In the example of Figure 3, the system would only allow requests from the solid
nodes to their “nearby” unfilled nodes within the same “corner” of the graph. It would give
warnings for requests between different corners, such as a request from Bank to News. The
justification for this approach comes from the observation that most legitimate requests are due
to the web application design in which the contents are distributed across servers.
While the naïve approach is effective for spotting attacks, it produces numerous false
positives, that is, warnings for legitimate requests. In the example of Figure 3, it would produce
warnings when multiple sites generate requests to Ads and News. To prevent such false positives,
we use the observation that, when a site receives legitimate requests from multiple domains, it
usually receives requests from a large number of domains. Thus, the most suspicious case is
when a domain receives requests from two or three sites, whereas the situation when it receives
requests from tens of sites is usually normal. The system thus identifies domains with a large
number of incoming edges and does not give warnings for HTTP requests sent to them. We also
use two heuristics to improve identification of legitimate requests.
• Trusted websites: The system automatically estimates domain trustworthiness, as described
in the previous section, and does not warn about any requests from trustworthy domains.
10
• Sensitive data: The system identifies sessions that are likely to involve sensitive data, and
uses stricter thresholds for spotting potentially malicious requests that affect these sessions. It
views a session as sensitive if either (1) the user has entered a password when starting this
session or (2) the related website uses the HTTPS protocol rather than HTTP.
System release
We have implemented the initial crowdsourcing system as a Chrome browser extension,
available at http://cyberpsa.com. This public release includes mechanisms for the manual rating
of websites and sharing free-text comments about potential threats, as well as the initial
automated mechanism for evaluating the chances that a website poses a threat.
Future work
We will continue the work on application of machine learning and crowdsourcing to automated
and semi-automated detection of various threats. The specific goals are as follows.
• Detection of newly evolving threats, which are not yet addressed by the standard defenses.
• Detection of cyber attacks by their observed “symptoms” in addition to using the traditional
approach of directly analyzing the attacking code, which will help to identify new re-
implementations of known malware.
• Detection of “nontraditional” threats that go beyond malware attacks, such as posting
misleading claims with the purpose to defraud users rather than corrupting their computers.
11
References
[Anderson et al., 2007] David S. Anderson, Chris Fleizach, Stefan Savage, and Geoffrey M.
Voelker. Spamscatter: Characterizing Internet scam hosting infrastructure. In Proceedings of the
Sixteenth USENIX Security Symposium, 2007.
[Cormack et al., 2010] Gordon V. Cormack, Mark D. Smucker, and Charles L. A. Clarke.
Efficient and effective spam filtering and re-ranking for large web datasets. Department of
Computer Science, University of Waterloo, 2010. Unpublished manuscript.
[Barth et al., 2008] Adam Barth, Collin Jackson, and John C. Mitchell. Robust defenses for
cross-site request forgery. In Proceedings of the Fifteenth ACM Conference on Computer and
Communications Security, pages 75–88, 2008.
[Kumaraguru et al., 2009] Ponnurangam Kumaraguru, Justin Cranshaw, Alessandro Acquisti,

Lorrie Cranor, Jason Hong, Mary Ann Blair, and Theodore Pham. School of phish: A real-world
evaluation of anti-phishing training. In Proceedings of the Fifth Symposium on Usable Privacy
and Security, pages 1–12, 2009.
[Schmidt et al., 2007] Mark Schmidt, Glenn Fung, and Rómer Rosales. Fast optimization
methods for L1 regularization: A comparative study and two new approaches. In Proceedings of
the European Conference on Machine Learning, pages 286–297, 2007.
[Sharifi et al., 2010] Mehrbod Sharifi, Eugene Fink, and Jaime G. Carbonell. Learning of
personalized security settings. In Proceedings of the IEEE International Conference on Systems,
Man, and Cybernetics, 3428–3432, 2010.
[Zeller and Felten, 2008] William Zeller and Edward W. Felten. Cross-site request forgeries:
Exploitation and prevention. Computer Science Department, Princeton University, 2008.
Unpublished manuscript.
12

Machine Learning and Crowdsourcing for Cybersecurity Threat Detection

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning and Crowdsourcing for Cybersecurity Threat Detection

Uploaded by

Copyright:

Available Formats

Application of Machine Learning and Crowdsourcing to Detection of Cybersecurity Threats

Application of Machine Learning and Crowdsourcing

to Detection of Cybersecurity Threats

Eugene Fink, Mehrbod Sharifi, and Jaime G. Carbonell

Research sponsor: Department of Homeland Security

developed system is available at http://cyberpsa.com. This project is part of the work on

advanced data analysis at the CCICADA Center of Excellence.

Keywords: Cybersecurity, web scam, machine learning, crowdsourcing.

Application of Machine Learning and Crowdsourcing

to Detection of Cybersecurity Threats

threats and reducing the risk to an acceptable level in most cases.

To improve defenses against these threats, we have taken a crowdsourcing approach,

system are as follows.

encourage user participation.

• Application of machine learning and language processing to analyze user feedback.

• Synergy of user feedback with automated threat detection.

websites that may pose a threat.

cross-site request forgery.

Figure 1. The main screen of the SmartNotes architecture.

human opinions with automated threat detection.

chances that a given webpage poses a threat.

Web Browser MULTIPLE

ratings and Browser Extension

SmartNotes Host Analyzer

Figure 2. The distributed crowdsourcing architecture. The SmartNotes service collects

• SmartNotes browser extension provides a graphical user interface, which is written in

as login and account administration.

automated threat detection.

Detection of scam websites

into giving them undeservedly high rankings.

of inaccurate or intentionally biased reports.

integrating information from various heterogeneous sources, particularly focusing on quantitative

provided through Web of Trust and other similar services.

We have applied logistics regression with L1 regularization [Schmidt et al., 2007] to

determines whether a website is malicious in 49 out of 50 cases.

Detection of cross-site request forgery

implement them properly.

potentially malicious sites and requests.

forged requests, shown by dashed lines, to Email and Bank.

use two heuristics to improve identification of legitimate requests.

• Trusted websites: The system automatically estimates domain trustworthiness, as described

We have implemented the initial crowdsourcing system as a Chrome browser extension,

implementations of known malware.

• Detection of “nontraditional” threats that go beyond malware attacks, such as posting

[Kumaraguru et al., 2009] Ponnurangam Kumaraguru, Justin Cranshaw, Alessandro Acquisti,

You might also like