You are on page 1of 12

Application of Machine Learning and Crowdsourcing to Detection of Cybersecurity Threats

Application of Machine Learning and Crowdsourcing

to Detection of Cybersecurity Threats

February 2011

Eugene Fink, Mehrbod Sharifi, and Jaime G. Carbonell


eugenefink@cmu.edu, mehrbod@cs.cmu.edu, jgc@cs.cmu.edu
Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213
phone: (412) 268-6593

Research sponsor: Department of Homeland Security

1
Application of Machine Learning and Crowdsourcing to Detection of Cybersecurity Threats

Abstract

We are applying machine learning and crowdsourcing to cybersecurity, with the purpose to

develop a toolkit for detection of complex cyber threats, which are often undetectable by

traditional tools. It will serve as an “extra layer of armor” that supplements the standard

defenses. The initial results include (1) an architecture for sharing security warnings among users

and (2) machine learning techniques for identifying malicious websites. The public release of the

developed system is available at http://cyberpsa.com. This project is part of the work on

advanced data analysis at the CCICADA Center of Excellence.

Keywords: Cybersecurity, web scam, machine learning, crowdsourcing.

2
Application of Machine Learning and Crowdsourcing to Detection of Cybersecurity Threats

Application of Machine Learning and Crowdsourcing

to Detection of Cybersecurity Threats

Introduction

We can broadly divide cybersecurity threats in two categories. The first is the vulnerabilities

caused by factors outside the end user’s control, such as security flaws in applications and

protocols. The traditional remedies include using firewalls and antivirus software, distributing

patches that fix newly discovered problems, and amending protocols. While the defense against

such threats is still an ongoing battle, software engineers have been effective in countering most

threats and reducing the risk to an acceptable level in most cases.

The second category, which has historically received less attention, includes the problems

caused by “careless” user actions. For example, an attacker may convince inexperienced users to

install a fake antivirus, which in reality corrupts their computers. As another example, an

attacker may use deceptive email and web advertisements, as well as phishing [Kumaraguru at

al., 2009], to trick users into falling victims of scams that go beyond the traditional software

attacks, such as disclosing sensitive information or paying for fake product offers. The number of

such threats has grown in recent years, as more and more people conduct their daily activities

through the Internet, thus providing the attackers with opportunities to exploit the user naïveté.

While web browsers and operating systems now include some defenses against such threats, they

are often insufficient. The attackers have been effective in finding ways to trick the users into

bypassing the security barriers. The detection of such threats is difficult for both humans and

automated systems because malicious websites tend to look legitimate and use effective

deception techniques.

3
Application of Machine Learning and Crowdsourcing to Detection of Cybersecurity Threats

To improve defenses against these threats, we have taken a crowdsourcing approach,

combined with machine learning and natural language processing. We are working on a

distributed system that enables users to report threats spotted on the web, and applies machine

learning to integrate their reports. This idea is analogous to user-review mechanisms, where

people share their experiences with specific products. The novel characteristics of the developed

system are as follows.

• Integration with crowdsourced question answering, similar to Yahoo Answers, which helps to

encourage user participation.

• Application of machine learning and language processing to analyze user feedback.

• Synergy of user feedback with automated threat detection.

From the user’s point of view, the developed system acts as a personal security assistant. It

gathers relevant information, learns from the user’s feedback, and helps the user to identify

websites that may pose a threat.

The initial work has lead to the development of a crowdsourcing architecture, as well as

machine learning algorithms for detection of two specific security threats: scam websites and

cross-site request forgery.

4
Application of Machine Learning and Crowdsourcing to Detection of Cybersecurity Threats

Figure 1. The main screen of the SmartNotes architecture.

Crowdsourcing architecture

We have developed an architecture, called SmartNotes, that helps users to share their experience

related to web threats, and integrates the wisdom gathered from all its users. It enables users to

rate websites, post comment, and ask and answer related questions. Furthermore, it combines

human opinions with automated threat detection.

User interface: The system’s main screen (Figure 1) allows making comments and

asking questions about a specific website. The user can select a rating (positive, neutral, or

negative), add comments, and post questions to be answered by other users. By default, the

comments are for the currently open web page, but the user can also post comments for the entire

web domain. For instance, when she is looking at a specific product on Amazon, she may enter

notes about that product page or about the entire amazon.com service. The user can specify

whether her notes are private, visible to her friends, or public. When the user visits a webpage,

she can read notes by others about it. She can also search the entire database of notes about all

5
Application of Machine Learning and Crowdsourcing to Detection of Cybersecurity Threats

webpages. In addition, the user can invoke automated scam detection, which calculates the

chances that a given webpage poses a threat.

Web Browser MULTIPLE


SmartNotes USERS

ratings and Browser Extension


scam
comments warnings

SmartNotes Host Analyzer


Web Service Web Service

DATA SOURCES

Figure 2. The distributed crowdsourcing architecture. The SmartNotes service collects


comments of multiple users. The Host Analyzer service gathers data about websites from trusted
online sources and uses them to calculate the chances that a given website poses a threat.

Main components: The distributed system consists of three components (solid boxes in

Figure 2), which communicate through HTTP requests (dashed lines in Figure 2).

• SmartNotes browser extension provides a graphical user interface, which is written in

JavaScript and uses the Chrome extension API to interact with the browser.

• SmartNotes web service is written in C#.NET and includes a SQL Server database. It exposes

methods for reading and writing notes, and supports other actions available to the users, such

as login and account administration.

• Host Analyzer web service is also written in C#.NET. It includes all data-analysis algorithms,

such as scam detection, parsing of user comments, and integration of user opinions with the

automated threat detection.

6
Application of Machine Learning and Crowdsourcing to Detection of Cybersecurity Threats

Detection of scam websites

Web scam is fraudulent or intentionally misleading information posted on the web, such as false

promises to help find work at home and cure various diseases, usually with the purpose to trick

people into sending money or disclosing sensitive information. The challenge of detecting such

scams is largely unaddressed. For legal reasons, search engines are reluctant to block scammers

unless they have specific strong proof of fraudulent activity, such as confirmed instances of

malware distribution. The initial research on scam detection includes the work of Anderson et

al. [2007], who analyzed spam email to extract addresses of scam websites; and that of Cormack

et al. [2010], who addressed the problem of preventing scammers from tricking search engines

into giving them undeservedly high rankings.

Currently, the most common approach to fighting web scam is blacklisting. Several

online services maintain lists of suspicious websites, usually compiled through user reports. For

example, Web of Trust (mywot.com) allows users to rate webpages on vendor reliability,

trustworthiness, privacy, and child safety, and displays the average ratings. As another example,

hosts-file.net and smapcop.net provide databases of malicious sites. The blacklisting however has

several limitations. In particular, a list may not include recently created scam websites, as well as

old sites moved to new domain names. Also, it may mistakenly include legitimate sites because

of inaccurate or intentionally biased reports.

We are developing a system that reduces the omissions and biases in blacklists by

integrating information from various heterogeneous sources, particularly focusing on quantitative

measurements that are hard to manipulate. We have created a web service, called Host Analyzer

(Figure 2), for gathering information about websites from various trusted online sources that

provide such data. It currently collects forty-three features describing websites from eleven

7
Application of Machine Learning and Crowdsourcing to Detection of Cybersecurity Threats

sources. Examples of these features include ratings and traffic ranks for a given website;

geographic location of the website server; and the number of positive and negative comments

provided through Web of Trust and other similar services.

We have applied logistics regression with L1 regularization [Schmidt et al., 2007] to

evaluate the chances that a specific website poses a security threat. The learning module

constructs a classifier based on a database of known ligitimate and malicious websites, and the

system then uses it to estimate the probability that previously unseen websites are malicious. We

have tested it using ten-fold cross-validation on a database of 837 manually labeled websites.

The precision of this technique is 98.0%; the recall is 98.1%; and the AUC measure, defined as

the area under the ROC curve, is 98.6%. Intuitively, these results mean that the system correctly

determines whether a website is malicious in 49 out of 50 cases.

Detection of cross-site request forgery

A cross-site request forgery (CSRF) is an attack through a web browser, in which a malicious

website uses a trusted browser session to send unauthorized requests to a target site [Barth et al.,

2008]. For example, Zeller and Felten [2008] described CSRF attacks that stole the user’s email

address and performed unauthorized money transfers. When a user visits a website, the browser

creates a session cookie that accompanies all subsequent requests from all browser windows

while the session is active, thus enabling web applications to maintain the state of their

interaction with the user. The browser provides the session information even if the request is

generated by a different website. If the user has an active session with site1.com, all requests sent

to site1.com include that information. If the user opens a (possibly malicious) site2.com, which

generates a (possibly unauthorized) request to site1.com, it will also include the site1.com session

8
Application of Machine Learning and Crowdsourcing to Detection of Cybersecurity Threats

information. This functionality is essential because some sites, such as advertising and payment-

processing servers, maintain the transaction state of requests from multiple domains; however, it

creates the vulnerability exploited by CSRF. A web application cannot determine whether a

request comes from the user or from a malicious site, since it contains the same session

information in both cases. The existing defenses require the developers of web applications to

adopt certain protocols. While these defenses are effective, developers occasionally fail to

implement them properly.

… …
Email News

Ads
Malicious
Bank
… …

Figure 3. Example graph of cross-site requests, where the nodes are domains and the edges are
requests. The solid nodes are the domains visited by the user, whereas the unfilled nodes are
accessed indirectly through cross-site requests. The dashed lines are CSRF attacks.

We are working on a machine learning technique for enhancing standard defenses, which

prevents attacks against unprotected sites by spotting malicious HTTP requests. It learns patterns

of legitimate requests, detects deviations from these patterns, and warns the user about

potentially malicious sites and requests.

We represent patterns of requests by a directed graph, where the nodes are web domains

and the edges are HTTP requests. We show an example in Figure 3, where the solid nodes are

domains visited by the user, and the unfilled nodes are domains accessed indirectly, through

requests from the visited domains. In the example of Figure 3, all sites except Bank show

advertising materials from the Ads server. Furthermore, both Email and Bank show a news bar,

9
Application of Machine Learning and Crowdsourcing to Detection of Cybersecurity Threats

which requires cross-site requests to News. A CSRF attack occurs when the Malicious site sends

forged requests, shown by dashed lines, to Email and Bank.

If there are no active browser sessions when the system starts building the graph, a CSRF

attack cannot occur on the first visit to a website. Therefore, when the system adds a new node,

its first incoming edge is a legitimate request. In the naïve version, we allow no incoming

requests for the directly accessed (solid) nodes and only one incoming edge for every indirectly

accessed (unfilled) node. If the system detects requests that do not match this pattern, it considers

them suspicious. In the example of Figure 3, the system would only allow requests from the solid

nodes to their “nearby” unfilled nodes within the same “corner” of the graph. It would give

warnings for requests between different corners, such as a request from Bank to News. The

justification for this approach comes from the observation that most legitimate requests are due

to the web application design in which the contents are distributed across servers.

While the naïve approach is effective for spotting attacks, it produces numerous false

positives, that is, warnings for legitimate requests. In the example of Figure 3, it would produce

warnings when multiple sites generate requests to Ads and News. To prevent such false positives,

we use the observation that, when a site receives legitimate requests from multiple domains, it

usually receives requests from a large number of domains. Thus, the most suspicious case is

when a domain receives requests from two or three sites, whereas the situation when it receives

requests from tens of sites is usually normal. The system thus identifies domains with a large

number of incoming edges and does not give warnings for HTTP requests sent to them. We also

use two heuristics to improve identification of legitimate requests.

• Trusted websites: The system automatically estimates domain trustworthiness, as described

in the previous section, and does not warn about any requests from trustworthy domains.

10
Application of Machine Learning and Crowdsourcing to Detection of Cybersecurity Threats

• Sensitive data: The system identifies sessions that are likely to involve sensitive data, and

uses stricter thresholds for spotting potentially malicious requests that affect these sessions. It

views a session as sensitive if either (1) the user has entered a password when starting this

session or (2) the related website uses the HTTPS protocol rather than HTTP.

System release

We have implemented the initial crowdsourcing system as a Chrome browser extension,

available at http://cyberpsa.com. This public release includes mechanisms for the manual rating

of websites and sharing free-text comments about potential threats, as well as the initial

automated mechanism for evaluating the chances that a website poses a threat.

Future work

We will continue the work on application of machine learning and crowdsourcing to automated

and semi-automated detection of various threats. The specific goals are as follows.

• Detection of newly evolving threats, which are not yet addressed by the standard defenses.

• Detection of cyber attacks by their observed “symptoms” in addition to using the traditional

approach of directly analyzing the attacking code, which will help to identify new re-

implementations of known malware.

• Detection of “nontraditional” threats that go beyond malware attacks, such as posting

misleading claims with the purpose to defraud users rather than corrupting their computers.

11
Application of Machine Learning and Crowdsourcing to Detection of Cybersecurity Threats

References

[Anderson et al., 2007] David S. Anderson, Chris Fleizach, Stefan Savage, and Geoffrey M.
Voelker. Spamscatter: Characterizing Internet scam hosting infrastructure. In Proceedings of the
Sixteenth USENIX Security Symposium, 2007.

[Cormack et al., 2010] Gordon V. Cormack, Mark D. Smucker, and Charles L. A. Clarke.
Efficient and effective spam filtering and re-ranking for large web datasets. Department of
Computer Science, University of Waterloo, 2010. Unpublished manuscript.

[Barth et al., 2008] Adam Barth, Collin Jackson, and John C. Mitchell. Robust defenses for
cross-site request forgery. In Proceedings of the Fifteenth ACM Conference on Computer and
Communications Security, pages 75–88, 2008.

[Kumaraguru et al., 2009] Ponnurangam Kumaraguru, Justin Cranshaw, Alessandro Acquisti,


Lorrie Cranor, Jason Hong, Mary Ann Blair, and Theodore Pham. School of phish: A real-world
evaluation of anti-phishing training. In Proceedings of the Fifth Symposium on Usable Privacy
and Security, pages 1–12, 2009.

[Schmidt et al., 2007] Mark Schmidt, Glenn Fung, and Rómer Rosales. Fast optimization
methods for L1 regularization: A comparative study and two new approaches. In Proceedings of
the European Conference on Machine Learning, pages 286–297, 2007.

[Sharifi et al., 2010] Mehrbod Sharifi, Eugene Fink, and Jaime G. Carbonell. Learning of
personalized security settings. In Proceedings of the IEEE International Conference on Systems,
Man, and Cybernetics, 3428–3432, 2010.

[Zeller and Felten, 2008] William Zeller and Edward W. Felten. Cross-site request forgeries:
Exploitation and prevention. Computer Science Department, Princeton University, 2008.
Unpublished manuscript.

12

You might also like