358

Chapter XLI
Data Mining
Mark Last
Ben-Gurion University of the Negev, Israel
Copyright © 2008, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
ABSTRACT
Data mining is a growing collection of computational techniques for automatic analysis of structured, semi-struc-
tured, and unstructured data with the purpose of identifying important trends and previously unknown behavioral
patterns. Data mining is widely recognized as the most important and central technology for homeland security in
general and for cyber warfare in particular. This chapter covers the following relevant areas of data mining:
• Web mining is the application of data mining techniques to web-based data. While Web usage mining is
already used by many intrusion detection systems, Web content mining can lead to automated identifcation
of terrorist-related content on the Web.
• Web information agents are responsible for fltering and organizing unrelated and scattered data in large
amounts of web documents. Agents represent a key technology to cyber warfare due to their capability
to monitor multiple diverse locations, communicate their fndings asynchronously, collaborate with each
other, and profle possible threats.
• Anomaly detection and activity monitoring. Real-time monitoring of continuous data streams can lead to
timely identifcation of abnormal, potentially criminal activities. Anomalous behavior can be automati-
cally detected by a variety of data mining methods.
359
Data Mining
INTRODUCTION
Data mining (DM) is a rapidly growing collection
of computational techniques for automatic analysis
of structured, semi-structured, and unstructured
data with the purpose of identifying various kinds of
previously unknown behavioral patterns. According
to Mena (2004), data mining is widely recognized
as the most important and central technology for
homeland security in general and for cyber warfare
in particular. This relatively new feld emerged in the
beginning of the 1990s as a combination of methods
and algorithms from statistics, pattern recognition,
and machine learning. The difference between data
mining and knowledge discovery in databases (KDD)
is defned by as follows: data mining refers to the
application of pattern extraction algorithms to data,
while KDD is the overall process of “identifying valid,
novel, potentially useful, and ultimately understand-
able patterns in data” (Fayyad, Piatetsky-Shapiro
& Smyth, 1996, p. 6). The complete KDD process
includes such stages as data selection; data cleaning
and pre-processing; data reduction and transformation;
choosing data mining tasks, methods and tools; data
mining (searching for patterns of ultimate interest);
interpretation of data mining results; and action upon
discovered knowledge.
BACKGROUND
Tens of computational techniques related to various
data mining tasks emerged over the last 15 years. Se-
lected examples of some common data mining tasks
and algorithms will be briefy described.
Association rules: Association rule mining is
aimed at fnding interesting association or correlation
relationships among a large set of data items (Han &
Kamber, 2001). The extracted patterns (association
rules) usually have the form “if event X occurs, then
event Y is likely.” Events X and Y may represent items
bought in a purchase transaction, documents viewed in
a user session, medical symptoms of a given patient,
and many other phenomena recorded in a database
over time. Extracted rules are evaluated by two main
parameters: support, which is the probability that a
transaction contains both X and Y and confdence,
which is the conditional probability that a transaction
having X also contains Y. Scalable algorithms, such as
Apriori (Srikant & Agrawal, 1996), have been devel-
oped for mining association rules in large databases
containing millions of multi-item transactions.
Cluster analysis: A cluster is a collection of data
objects (e.g., Web documents) that are similar to each
other within the same cluster, while being dissimilar to
the objects in any other cluster (Han & Kamber, 2001).
One of the most important goals of cluster analysis
is to discover hidden patterns, which characterize
groups of seemingly unrelated objects (transactions,
individuals, documents, etc.). Clustering of “normal
walks of life” can also serve as a basis for the task of
anomaly detection: an outlier, which does not belong
to any normal cluster, may be an indication of abnor-
mal, potentially malicious behavior (Last & Kandel,
2005, chap. 4 & 6). A survey of leading clustering
methods is presented in Data Clustering: A Reveiw
(Jain, Murty, & Flynn, 1999).
Predictive modeling: The task of predictive model-
ing is to predict (anticipate) future outcomes of some
complex, hardly understandable processes based on
automated analysis of historic data. Predicting future
behaviors (especially attacks) of terrorist and other
malicious groups is an example of such task. Han and
Kamber (2001) refer to prediction of continuous values
as prediction, while prediction of nominal class labels
(e.g., terrorist vs. non-terrorist documents) is regarded
by them as classifcation. Common classifcation
models include ANN—Artifcial Neural Networks
(Mitchell, 1997), decision trees (Quinlan, 1993),
Bayesian networks (Mitchell, 1997), IFN—Info-Fuzzy
Networks (Last & Maimon, 2004), and so forth.
Visual data mining: Visual data mining is the pro-
cess of discovering implicit but useful knowledge from
large data sets using visualization techniques. Since “a
360
Data Mining
picture is worth a thousand words,” the human eye can
identify patterns, trends, structure, irregularities, and
relationships among data much faster in a represen-
tative landscape than in a spreadsheet. Scatter plots,
boxplots, and frequency histograms are examples of
techniques used by descriptive data mining. Last and
Kandel (1999) use the concepts of fuzzy set theory to
automate the process of human perception based on
pre-defned objective parameters.
This chapter will cover in depth several applica-
tion areas of data mining in the cyber warfare and
cyber terrorism domain, namely: Web mining, Web
information agents, and anomaly detection (closely
related to activity monitoring).
WEB MINING
The military pressure put on the al-Qaeda leadership
in Afganistan after 9/11 has dramatically increased
the role of the Internet in the infrastructure of global
terrorist organizations (Corera, 2004). In terrorism
expert Peter Bergen’s words:
They lost their base in Afghanistan, they lost their
training camps, they lost a government that allowed
them do what they want within a country. Now they’re
surviving on internet to a large degree. It is really their
new base (ibid).
Beyond propaganda and ideology, jihadist sites
seem to be heavily used for practical training in kidnap-
ping, explosive preparation, and other “core” terrorist
activities, which were once taught in Afghan training
camps. The former U.S. Deputy Defense Secretary Paul
D. Wolfowitz, in a testimony before the House Armed
Services Committee, called such Web sites “cyber
sanctuaries” (Lipton & Lichtblau, 2004). Of course,
al-Qaeda is not the sole source of terror-related Web
sites. According to a recent estimate, the total number
of such Web sites has increased from only 12 in 1997
to around 4,300 in 2005 (Talbot, 2005).
Due to the extent of Internet usage by terrorist
organizations, cyber space has become a valuable
source of information on terrorists current activities
and intentions (Last & Kandel, 2005, chap. 1 & 2).
In this new kind of war, frequently called the “cyber
war” or the “web war,” homeland security agencies
face a variety of extremely diffcult challenges (Mena,
2004). Terrorist organizations can post their informa-
tion on the Web at any location (Web server), in any
form (Web page, Internet forum posting, chat room
communication, e-mail message, etc.), and in any
language. Moreover, they can take that information
off-line within hours or even minutes. Accurate and
timely identifcation of such material in the midst of
massive Web traffc is by far the most challenging task
currently faced by the intelligence community.
Furthermore, homeland security analysts are
interested in identifying who is behind the posted
material, what links they might have to active ter-
ror groups, and what threat, if any, they might pose.
They would also like to identify temporal trends in
terrorist-related content and track down the “target
audience” of individual and public online messages.
The current number of known terrorist sites is so large
that a continuous manual analysis of their multilingual
content is defnitely out of the question. This is why
the automated Web mining approach is so important
for the cyber war against international terror.
Most common Web mining tasks can be divided
into three different categories: Web usage, Web
structure, and Web content mining. The main goal
of Web usage mining is to gather information about
Web system users and to examine the relationships
between Web pages from the user point of view. Web
usage mining methods include mining access logs to
fnd common usage patterns of Web pages, calculating
page ratings, and so forth. Many intrusion detection
systems are routinely using Web usage mining tech-
niques to spot potential intruders among normal Web
users. In Web structure mining, the usual aim is to
extract some useful information about a document by
examining its hyperlinks to other documents. More
sophisticated link analysis techniques, combined
361
Data Mining
with information extraction (IE) tools, can discover
other types of links between unstructured documents
such as links between locations, organizations, and
individuals (Mena, 2004). As shown by Ben-Dov,
Wu, Feldman, and Cairns (2004), these techniques
can reveal complex terrorist networks.
Applying data mining algorithms to the content
of Web documents in order to generate an effcient
representation of content patterns (such as patterns of
terror-related pages) is a typical application of Web
content mining. Traditional information retrieval and
Web mining methods represent textual documents
with a vector-space model, which utilizes a series of
numeric values associated with each document. Each
value is associated with a specifc key word or key
phrase that may appear in a document. However, this
popular method of document representation does not
capture important structural information, such as the
order and proximity of term occurrence or the loca-
tion of a term within the document. This structural
information may be critical for a text categorization
system, which is required to make an accurate distinc-
tion between terrorist pages (such as guidelines for
preparing future terrorist attacks) and normal pages
(which may be news reports about terrorist attacks in
the past). Both types of pages may include nearly the
same set of key words, though their structure could
be radically different.
Last, Markov, and Kandel (2006) describe an ad-
vanced graph-based methodology for multilingual de-
tection of terrorist documents. The proposed approach
is evaluated on a collection of 648 Web documents
in Arabic. The results demonstrate that documents
downloaded from several known terrorist sites can
be reliably discriminated from the content of Arabic
news reports using a simple decision tree.
Most Web content mining techniques assume a
static nature of the Web content. This approach is
inadequate for long-term monitoring of Web traffc,
since both the users interests and the content of most
Web sites are subject to continuous changes over time.
Timely detection of an ongoing trend in certain Web
content may trigger periodic retraining of the data-
mining algorithm. In addition, the characteristics of
the trend itself (e.g., an increased occurrence of certain
key phrases) may indicate some important changes
in the online behavior of the monitored Web site and
its users. Chang, Healey, McHugh, and Wang (2001)
have proposed several methods for change and trend
detection in dynamic Web content, where a trend is
recognized by a change in frequency of certain topics
over a period of time. Another trend discovery system
for mining dynamic content of news Web sites is pre-
sented by Mendez-Torreblanca, Montes-y-Gomez, and
Lopez-Lopez (2002). A novel, fuzzy-based method for
identifying short-term and long-term trends in dynamic
Web content is proposed by Last (2005).
WEB INFORMATION AGENTS
An intelligent software agent is an autonomous pro-
gram designed to perform a human-like function over
a network or the Internet. Specifcally, information
agents are responsible for fltering and organizing
unrelated and scattered data such as large amounts of
unstructured Web documents. Agents represent a key
technology to homeland security due to their capability
to monitor multiple diverse locations, communicate
their fndings asynchronously, collaborate with each
other, analyze conditions, issue real-time alerts, and
profle possible threats (Mena, 2004).
Autonomous information agents is the evolving
solution to the problem of inaccurate and incomplete
search indexes (Cesarano, d’Acierno, & Picariello,
2003; Klusch, 2001; Pant, Srinivasan, & Menczer
(2004; Yu, Koo, & Liddy (2000). The basic idea of
the search agent technology is to imitate the behavior
of an expert user by submitting a query to several
search engines in parallel, determining automatically
the relevancy of retrieved pages, and then following
the most promising links from those pages. The pro-
cess goes on using the links on the new pages until
the agent resources are exhausted, there are no more
pages to browse, or the system objectives are reached
(Pant et al., 2004).
362
Data Mining
Intelligent information agents can be classifed
in several ways (Klusch, 2001). They can be either
cooperative or non-cooperative with each other. Agent
functionality is usually based on a set of information
processing rules that may be explicitly specifed by the
user, acquired by a knowledge engineer, or induced
by data mining algorithms. Most popular data mining
techniques used by information agent systems include
artifcial neural networks, genetic algorithms, rein-
forcement learning, and case-based reasoning. Agents
can also be adaptive, that is, continue to learn from the
environment and change their behavior accordingly.
According to Klusch (2001), any information agent
should possess the following key capabilities: access
to heterogeneous sites and resources on the Web (from
static pages to Web-based applications), retrieving
and fltering data from any kind of digital medium
(including documents written in any language and
multimedia information), processing of ontological
knowledge (e.g., expressed by semantic networks),
and information visualization.
ANOMALY DETECTION AND
ACTIVITY MONITORING
In activity monitoring, analysis of data streams is
applied in order to detect the interesting behavior
occurring which are referred to as a “positive activ-
ity.” Positive activities should be different from each
other and should be different from the non-positive
monitored activities. Indication of a positive activity
is called an alarm (Fawcett & Provost, 1999). There
exist many applications that use activity monitoring
such as computer intrusion detection, fraud detection,
crisis monitoring, network performance monitoring,
and news story monitoring. The representation of
the input in these applications might be completely
different from each other. The input data can be, for
example, a feature vector, a collection of documents,
and a stream of numbers. In activity monitoring the
goal is to issue an accurate alarm on time. In order
to address this goal, data mining techniques like
classifcation regression and time series analysis are
applied.
Anomaly detection relies on models of the intended
behavior of users and applications and interprets de-
viations from this “normal” behavior as evidence of
malicious (e.g., terrorist-related) activity (Kruegel &
Vigna, 2003). This approach is complementary with
respect to signature-based detection, where a number of
attack descriptions (usually in the form of signatures)
are matched against the stream of input data, looking
for evidence that one of the expected attacks (e.g., a
known computer virus) is taking place. The basic as-
sumption underlying anomaly detection is that attack
patterns differ from normal behavior.
There are two main stages in anomaly detection.
In the frst stage, a representation of the “normal”
behavior is obtained by applying some data mining
algorithms to examples of normal behavior. In the
second stage, events different from the “normal” be-
havior are detected and classifed as suspected to be
malicious. Most intrusion detection systems that use
anomaly detection (Sequeira & Zaki, 2002) monitor
user actions and operations rather than content ac-
cessed by the users as an audit source.
In Elovici, Kandel et al. (2004) and Elovici, Shapira
et al. (2005), a Terrorist Detection System (TDS) is
presented aimed at tracking down suspected terrorists
by analyzing the content of information they access
on the Web. The system operates in two modes: the
training mode is activated off-line, and the detection
mode is operated in real-time. In the training mode,
TDS is provided with Web pages of normal users
from which it derives their normal behavior profle by
applying data mining (clustering) algorithms to the
training data. In the detection mode, TDS performs
real-time monitoring of the traffc emanating from the
monitored group of users, analyzes the content of the
Web pages they access, and generates an alarm if a user
accesses abnormal information, that is, the content of
the information accessed is “very” dissimilar to the
typical content in the monitored environment.
363
Data Mining
CONCLUSION
This chapter has briefy covered the wide potential of
employing data mining and Web mining techniques
as cyber warfare tools in the global campaign against
terrorists who are using cyber space for their malicious
interests. It is important to understand that applications
of data mining technology to cyber warfare are in no
way limited to the methods covered in this chapter.
We believe that as computers become more powerful
and Web users become better connected to each other
we will see more information technologies aiding in
the war on terror, along with a higher level of tech-
nological sophistication exposed by cyber terrorists.
Examples of promising directions in the future of cyber
warfare research include cross-lingual Web content
mining, real-time data and Web mining, distributed
data mining, and many others.
REFERENCES
Ben-Dov, M., Wu, W., Feldman, R., & Cairns, P. A.
(2004). Improving knowledge discovery by combining
text-mining and link analysis techniques. Workshop
on Link Analysis, Counter-Terrorism, and Privacy,
in Conjunction with SIAM International Conference
on Data Mining.
Cesarano, C., d’Acierno, A., & Picariello, A. (2003,
November 7-8). An intelligent search agent system
for semantic information retrieval on the internet.
Proceedings of the Fifth ACM International Workshop
on Web Information and Data Management, New
Orleans, LA (pp. 111-117).
Chang, G., Healey, M. J., McHugh, J. A. M., & Wang,
J. T. L. (2001). Mining the World Wide Web: An in-
formation search approach. Norwell, MA: Kluwer
Academic Publishers.
Corera, G. (2004, October 6). Web wise terror network.
BBC NEWS. Retrieved October 9, 2004, from http://
news.bbc.co.uk/go/pr/fr/-/1/hi/world/3716908.stm
Elovici, Y., Kandel, A., Last, M., Shapira, B., Zaafrany,
O., Schneider, M., & Friedman, M. (2004). Terrorist
detection system. Proceedings of the 8th European
Conference on Principles and Practice of Knowledge
Discovery in Databases (PKDD 2004), Pisa, Italy
(LN in Artifcial Intelligence 3202, pp. 540-542).
Springer-Verlag.
Elovici, Y., Shapira, B., Last, M., Zaafrany, O., Fried-
man, M., Schneider, M., & Kandel, A. (2005, May
19-20). Content-based detection of terrorists browsing
the web using an advanced terror detection system
(ATDS). Proceedings of the IEEE International
Conference on Intelligence and Security Informatics
(IEEE ISI-2005), Atlanta, GA (pp. 244-255).
Fawcett, T. & Provost, F. (1999). Activity monitoring:
Noticing interesting changes in behavior. Proceedings
of the Fifth ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, San Diego,
CA (pp. 53-62).
Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996).
From data mining to knowledge discovery: An over-
view. In U. Fayyad, G. Piatetsky-Shapiro, P. Smyth,
& R. Uthurusamy (Eds.), Advances in knowledge
discovery and data mining (pp. 134). Menlo Park,
CA: AAAI/MIT Press.
Han, J., & Kamber, M. (2001). Data mining: Concepts
and techniques. Morgan Kaufmann.
Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data
clustering: A review. ACM Computing Surveys, 31(3),
264-323.
Klusch, M. (2001). Information agent technology for
the internet: A survey [Special Issue on Intelligent
Information Integration. D. Fensel (Ed.)]. Journal on
Data and Knowledge Engineering, 36(3), 337-372.
Kruegel, C., & Vigna, G. (2003, October 27-31).
Anomaly detection of web-based attacks. Proceed-
ings of the 10th ACM Conference on Computer and
Communications Security (CCS’03), Washington,
DC (pp. 251-261).
364
Data Mining
Last, M. (2005). Computing temporal trends in web
documents. Proceedings of the Fourth Conference of
the European Society for Fuzzy Logic and Technology
(EUSFLAT 2005) (pp. 615-620).
Last, M., & Kandel, A. (1999). Automated percep-
tions in data mining. Proceedings of the 1999 IEEE
International Fuzzy Systems Conference, Part I (pp.
190-197).
Last, M., & Kandel, A. (2005). Fighting terror in cyber-
space. Singapore: World Scientifc, Series in Machine
Perception and Artifcial Intelligence, Vol. 65.
Last, M., & Maimon, O. (2004). A compact and ac-
curate model for classifcation. IEEE Transactions on
Knowledge and Data Engineering, 16(2), 203215.
Last, M., Markov, A., & Kandel, A. (2006, April 9).
Multi-lingual detection of terrorist content on the
web. Proceedings of the PAKDD’06 Workshop on
Intelligence and Security Informatics (WISI’06),
Singapore (LN in Computer Science, Vol. 3917) (pp.
16-30). Springer
Lipton, E., & Lichtblau, E. (2004, September 23). Even
near home, a new front is opening in the terror battle.
The New York Times.
Mena, J. (2004). Homeland security techniques and
technologies. Charles River Media.
Mendez-Torreblanca, A., Montes-y-Gomez, M., &
Lopez-Lopez, A. (2002). A trend discovery system
for dynamic web content mining. Proceedings of
CIC-2002.
Mitchell, T. M. (1997). Machine learning. McGraw-
Hill.
Pant, G., Srinivasan, P., & Menczer, F. (2004). Crawl-
ing the web. In M. Levene & A. Poulovassilis (Eds.),
Web dynamics. Springer.
Quinlan, J. R. (1993). C4.5: Programs for machine
learning. Morgan Kaufmann.
Sequeira, K. & Zaki, M. (2002). ADMIT: Anomaly-
based data mining for intrusions. Proceedings of the
Eighth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (pp. 386-395).
Srikant, R. & Agrawal, R. (1996). Mining quantitative
association rules in large relational tables. Proceedings
of the 1996 ACM SIGMOD International Conference
on Management of Data, Montreal, Quebec, Canada
(pp. 1-12).
Talbot, D. (2005, February). Terror’s server. Tech-
nology Review. Retrieved March 17, 2005, from
http://www.technologyreview.com/articles/05/02/is-
sue/feature_terror.asp
Yu, E. S., Koo, P. C., & Liddy, E. D. (2000). Evolv-
ing intelligent text-based agents. Proceedings of the
Fourth International Conference on Autonomous
agents, Barcelona, Spain (pp. 388-395).
KEY TERMS
Activity Monitoring: The process of monitor-
ing the behavior of a large population of entities
for interesting events requiring action (Fawcett &
Provost, 1999).
Anomaly Detection: The process of detecting
anomalies (irregularities that cannot be explained by
existing domain models and knowledge) by monitor-
ing system activity and classifying it as either normal
or anomalous.
Data Mining: The process of applying pattern
extraction algorithms to data. Data mining is con-
sidered the core stage of knowledge discovery in
databases (KDD).
Intelligent Software Agent: An autonomous
software program designed to perform a human-like
function (e.g., information search). Basic capabilities of
intelligent agents include adaptation and learning.
Knowledge Discovery in Databases (KDD): The
overall process of “identifying valid, novel, potentially
useful, and ultimately understandable patterns in data”
(Fayyad et al., 1996).
365
Data Mining
Web Mining: The application of data mining
algorithms to discover useful patterns from the Web.
Three main categories of Web mining include Web
usage mining, Web structure mining, and Web con-
tent mining.

1996. may be an indication of abnormal. documents viewed in a user session. Since “a background Tens of computational techniques related to various data mining tasks emerged over the last 15 years. while KDD is the overall process of “identifying valid. individuals. while being dissimilar to the objects in any other cluster (Han & Kamber. According to Mena (2004).” Events X and Y may represent items bought in a purchase transaction. 1997).. 1999). data reduction and transformation. 1993). One of the most important goals of cluster analysis is to discover hidden patterns. methods and tools. such as Apriori (Srikant & Agrawal. have been developed for mining association rules in large databases containing millions of multi-item transactions. and so forth. Predicting future behaviors (especially attacks) of terrorist and other malicious groups is an example of such task. non-terrorist documents) is regarded by them as classification. Extracted rules are evaluated by two main parameters: support. decision trees (Quinlan. documents. interpretation of data mining results. potentially malicious behavior (Last & Kandel. Cluster analysis: A cluster is a collection of data objects (e. The complete KDD process includes such stages as data selection. & Flynn. and ultimately understandable patterns in data” (Fayyad. Clustering of “normal walks of life” can also serve as a basis for the task of anomaly detection: an outlier. and action upon discovered knowledge. 1997). 1996). Selected examples of some common data mining tasks and algorithms will be briefly described. data mining (searching for patterns of ultimate interest).Data Mining IntroductIon Data mining (DM) is a rapidly growing collection of computational techniques for automatic analysis of structured. IFN—Info-Fuzzy Networks (Last & Maimon. 2001). p. Common classification models include ANN—Artificial Neural Networks (Mitchell. Scalable algorithms. 2001). and machine learning.. then event Y is likely. chap. which is the probability that a transaction contains both X and Y and confidence. while prediction of nominal class labels (e.g. data mining is widely recognized as the most important and central technology for homeland security in general and for cyber warfare in particular. data cleaning and pre-processing. Bayesian networks (Mitchell. A survey of leading clustering methods is presented in Data Clustering: A Reveiw (Jain. 4 & 6). Web documents) that are similar to each other within the same cluster. Murty. pattern recognition. This relatively new field emerged in the beginning of the 1990s as a combination of methods and algorithms from statistics. hardly understandable processes based on automated analysis of historic data. and unstructured data with the purpose of identifying various kinds of previously unknown behavioral patterns. Visual data mining: Visual data mining is the process of discovering implicit but useful knowledge from large data sets using visualization techniques.). Han and Kamber (2001) refer to prediction of continuous values as prediction. etc. The extracted patterns (association rules) usually have the form “if event X occurs. which is the conditional probability that a transaction having X also contains Y. 2005. which characterize groups of seemingly unrelated objects (transactions. 2004). Piatetsky-Shapiro & Smyth. novel.g. and many other phenomena recorded in a database over time. which does not belong to any normal cluster. medical symptoms of a given patient. The difference between data mining and knowledge discovery in databases (KDD) is defined by as follows: data mining refers to the application of pattern extraction algorithms to data. potentially useful. semi-structured. terrorist vs.  . 6). Predictive modeling: The task of predictive modeling is to predict (anticipate) future outcomes of some complex. Association rules: Association rule mining is aimed at finding interesting association or correlation relationships among a large set of data items (Han & Kamber. choosing data mining tasks.

Of course. what links they might have to active terror groups. The current number of known terrorist sites is so large that a continuous manual analysis of their multilingual content is definitely out of the question. 1 & 2). 2005). Scatter plots. In Web structure mining. combined 0 . Last and Kandel (1999) use the concepts of fuzzy set theory to automate the process of human perception based on pre-defined objective parameters. Web information agents. and relationships among data much faster in a representative landscape than in a spreadsheet.S. namely: Web mining. Beyond propaganda and ideology. trends. and in any language. Deputy Defense Secretary Paul D. According to a recent estimate. 2004). frequently called the “cyber war” or the “web war. Internet forum posting. homeland security analysts are interested in identifying who is behind the posted material. explosive preparation. e-mail message. Due to the extent of Internet usage by terrorist organizations. irregularities. and what threat. and so forth. they can take that information off-line within hours or even minutes. the usual aim is to extract some useful information about a document by examining its hyperlinks to other documents. cyber space has become a valuable source of information on terrorists current activities and intentions (Last & Kandel. chap. in a testimony before the House Armed Services Committee. 2004). Web structure. Furthermore. and frequency histograms are examples of techniques used by descriptive data mining. the total number of such Web sites has increased from only 12 in 1997 to around 4. which were once taught in Afghan training camps. Many intrusion detection systems are routinely using Web usage mining techniques to spot potential intruders among normal Web users.). boxplots. It is really their new base (ibid). Web usage mining methods include mining access logs to find common usage patterns of Web pages. The main goal of Web usage mining is to gather information about Web system users and to examine the relationships between Web pages from the user point of view. The former U. Wolfowitz. and other “core” terrorist activities. jihadist sites seem to be heavily used for practical training in kidnapping. Most common Web mining tasks can be divided into three different categories: Web usage.300 in 2005 (Talbot. They would also like to identify temporal trends in terrorist-related content and track down the “target audience” of individual and public online messages. they lost a government that allowed them do what they want within a country. Web mInIng The military pressure put on the al-Qaeda leadership in Afganistan after 9/11 has dramatically increased the role of the Internet in the infrastructure of global terrorist organizations (Corera. This is why the automated Web mining approach is so important for the cyber war against international terror. Now they’re surviving on internet to a large degree. in any form (Web page. they lost their training camps. In this new kind of war. calculating page ratings.Data Mining picture is worth a thousand words. and Web content mining. Moreover. and anomaly detection (closely related to activity monitoring). This chapter will cover in depth several application areas of data mining in the cyber warfare and cyber terrorism domain. In terrorism expert Peter Bergen’s words: They lost their base in Afghanistan.” homeland security agencies face a variety of extremely difficult challenges (Mena. structure. al-Qaeda is not the sole source of terror-related Web sites. called such Web sites “cyber sanctuaries” (Lipton & Lichtblau.” the human eye can identify patterns. Accurate and timely identification of such material in the midst of massive Web traffic is by far the most challenging task currently faced by the intelligence community. 2004). etc. chat room communication. Terrorist organizations can post their information on the Web at any location (Web server). 2005. they might pose. More sophisticated link analysis techniques. if any.

communicate their findings asynchronously. such as the order and proximity of term occurrence or the location of a term within the document. This structural information may be critical for a text categorization system. collaborate with each other. and Kandel (2006) describe an advanced graph-based methodology for multilingual detection of terrorist documents. since both the users interests and the content of most Web sites are subject to continuous changes over time. Chang. which is required to make an accurate distinction between terrorist pages (such as guidelines for preparing future terrorist attacks) and normal pages (which may be news reports about terrorist attacks in the past). and Cairns (2004). Wu. 2004). and individuals (Mena. The basic idea of the search agent technology is to imitate the behavior of an expert user by submitting a query to several search engines in parallel. Klusch. Applying data mining algorithms to the content of Web documents in order to generate an efficient representation of content patterns (such as patterns of terror-related pages) is a typical application of Web content mining. The results demonstrate that documents downloaded from several known terrorist sites can be reliably discriminated from the content of Arabic news reports using a simple decision tree. Traditional information retrieval and Web mining methods represent textual documents with a vector-space model. there are no more pages to browse. these techniques can reveal complex terrorist networks. As shown by Ben-Dov. though their structure could be radically different. Yu. fuzzy-based method for identifying short-term and long-term trends in dynamic Web content is proposed by Last (2005). This approach is inadequate for long-term monitoring of Web traffic. Specifically. an increased occurrence of certain key phrases) may indicate some important changes in the online behavior of the monitored Web site and its users. Each value is associated with a specific key word or key phrase that may appear in a document. and Lopez-Lopez (2002). Autonomous information agents is the evolving solution to the problem of inaccurate and incomplete search indexes (Cesarano. analyze conditions. Most Web content mining techniques assume a static nature of the Web content. which utilizes a series of numeric values associated with each document. Feldman. the characteristics of the trend itself (e. Web InFormatIon agents An intelligent software agent is an autonomous program designed to perform a human-like function over a network or the Internet. Pant. Both types of pages may include nearly the same set of key words. Last. issue real-time alerts. where a trend is recognized by a change in frequency of certain topics over a period of time. & Picariello. and Wang (2001) have proposed several methods for change and trend detection in dynamic Web content. and profile possible threats (Mena. & Liddy (2000). A novel. can discover other types of links between unstructured documents such as links between locations.. this popular method of document representation does not capture important structural information. The process goes on using the links on the new pages until the agent resources are exhausted. determining automatically the relevancy of retrieved pages. The proposed approach is evaluated on a collection of 648 Web documents in Arabic. Agents represent a key technology to homeland security due to their capability to monitor multiple diverse locations. In addition.. Markov.Data Mining with information extraction (IE) tools. 2001. Montes-y-Gomez. 2003. Koo. Srinivasan.g. & Menczer (2004. However. McHugh. Timely detection of an ongoing trend in certain Web content may trigger periodic retraining of the data- mining algorithm. information agents are responsible for filtering and organizing unrelated and scattered data such as large amounts of unstructured Web documents. d’Acierno. or the system objectives are reached (Pant et al. Another trend discovery system for mining dynamic content of news Web sites is presented by Mendez-Torreblanca. organizations.  . and then following the most promising links from those pages. Healey. 2004). 2004).

a representation of the “normal” behavior is obtained by applying some data mining algorithms to examples of normal behavior. looking for evidence that one of the expected attacks (e. analyzes the content of the Web pages they access. 2001).” Positive activities should be different from each other and should be different from the non-positive monitored activities. that is. a Terrorist Detection System (TDS) is presented aimed at tracking down suspected terrorists by analyzing the content of information they access on the Web. continue to learn from the environment and change their behavior accordingly. and the detection mode is operated in real-time. (2005). Agent functionality is usually based on a set of information processing rules that may be explicitly specified by the user. Kandel et al. genetic algorithms. Indication of a positive activity is called an alarm (Fawcett & Provost. Shapira et al. The representation of the input in these applications might be completely different from each other. a feature vector. In the second stage.Data Mining Intelligent information agents can be classified in several ways (Klusch. for example. crisis monitoring. In the first stage. In the training mode. Agents can also be adaptive. acquired by a knowledge engineer. The system operates in two modes: the training mode is activated off-line. The basic assumption underlying anomaly detection is that attack patterns differ from normal behavior. In the detection mode.  . that is. or induced by data mining algorithms. Most intrusion detection systems that use anomaly detection (Sequeira & Zaki. In order to address this goal. Most popular data mining techniques used by information agent systems include artificial neural networks.. and a stream of numbers. TDS performs real-time monitoring of the traffic emanating from the monitored group of users. There are two main stages in anomaly detection. 2002) monitor user actions and operations rather than content accessed by the users as an audit source. terrorist-related) activity (Kruegel & Vigna.g. reinforcement learning. The input data can be.. There exist many applications that use activity monitoring such as computer intrusion detection. In Elovici. Anomaly detection relies on models of the intended behavior of users and applications and interprets deviations from this “normal” behavior as evidence of malicious (e. (2004) and Elovici. This approach is complementary with respect to signature-based detection. data mining techniques like classification regression and time series analysis are applied. any information agent should possess the following key capabilities: access to heterogeneous sites and resources on the Web (from static pages to Web-based applications). In activity monitoring the goal is to issue an accurate alarm on time. where a number of attack descriptions (usually in the form of signatures) are matched against the stream of input data. and information visualization. fraud detection. events different from the “normal” behavior are detected and classified as suspected to be malicious.g. expressed by semantic networks). a known computer virus) is taking place.g. and generates an alarm if a user accesses abnormal information. analysis of data streams is applied in order to detect the interesting behavior occurring which are referred to as a “positive activity. According to Klusch (2001). retrieving and filtering data from any kind of digital medium (including documents written in any language and multimedia information).. 1999). the content of the information accessed is “very” dissimilar to the typical content in the monitored environment. and news story monitoring. network performance monitoring. They can be either cooperative or non-cooperative with each other. TDS is provided with Web pages of normal users from which it derives their normal behavior profile by applying data mining (clustering) algorithms to the training data. processing of ontological knowledge (e. anomaly detectIon and actIvIty monItorIng In activity monitoring. 2003). a collection of documents. and case-based reasoning.

Springer-Verlag. Klusch.. U.). San Diego. A. M. Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2004). 264-323. (2001). M. and many others. G. CA: AAAI/MIT Press. Content-based detection of terrorists browsing the web using an advanced terror detection system (ATDS). McHugh. A. 134). We believe that as computers become more powerful and Web users become better connected to each other we will see more information technologies aiding in the war on terror. Retrieved October 9. J. in Conjunction with SIAM International Conference on Data Mining. & R. Fayyad. Uthurusamy (Eds. Chang.. Information agent technology for the internet: A survey [Special Issue on Intelligent Information Integration. (2003. & Vigna. Friedman. and Privacy. Zaafrany. L. P. J. 251-261). Fayyad. P. M.. A. Kruegel. & Flynn. A. A. & Smyth. Last. Counter-Terrorism. Feldman... October 6). Fensel (Ed.. & Provost. P. & Friedman. Improving knowledge discovery by combining text-mining and link analysis techniques. Corera. Morgan Kaufmann. C. October 27-31).bbc. T. 53-62). (2004). 244-255). New Orleans. (2004. Terrorist detection system. (2005. J. d’Acierno. CA (pp. Data clustering: A review.. & Cairns. reFerences Ben-Dov. along with a higher level of technological sophistication exposed by cyber terrorists. Wu.)]. Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. F. Fawcett. Workshop on Link Analysis. It is important to understand that applications of data mining technology to cyber warfare are in no way limited to the methods covered in this chapter. Italy (LN in Artificial Intelligence 3202. (2001). Murty. M. An intelligent search agent system for semantic information retrieval on the internet...co. B. From data mining to knowledge discovery: An overview. ACM Computing Surveys. 111-117). M.. N. Proceedings of the Fifth ACM International Workshop on Web Information and Data Management. real-time data and Web mining. O. G. C. M.. from http:// news. 2004. M. & Wang. Menlo Park.. Zaafrany.. LA (pp. (2003.. T. Journal on Data and Knowledge Engineering. Y. 31(3). 36(3). Washington. Mining the World Wide Web: An information search approach. BBC NEWS. Pisa. Schneider. Piatetsky-Shapiro. (1999).. Last. Activity monitoring: Noticing interesting changes in behavior. In U. November 7-8). (1996). B.stm  . Jain.. Atlanta. Data mining: Concepts and techniques.. DC (pp.. GA (pp.. Elovici. G.... Smyth. Han. A. 540-542). Norwell. A. & Picariello. K. Piatetsky-Shapiro. M.. & Kandel. (2001). Anomaly detection of web-based attacks.. Advances in knowledge discovery and data mining (pp. M. M. Proceedings of the 10th ACM Conference on Computer and Communications Security (CCS’03).uk/go/pr/fr/-/1/hi/world/3716908. MA: Kluwer Academic Publishers. D. Cesarano. pp. M. 337-372. Kandel. (1999). W. Y. May 19-20). M. (2004). Shapira.Data Mining conclusIon This chapter has briefly covered the wide potential of employing data mining and Web mining techniques as cyber warfare tools in the global campaign against terrorists who are using cyber space for their malicious interests. Proceedings of the IEEE International Conference on Intelligence and Security Informatics (IEEE ISI-2005).. Shapira. G. J. J. Schneider. Examples of promising directions in the future of cyber warfare research include cross-lingual Web content mining. R. P. & Kamber. Healey. distributed data mining. Web wise terror network. Elovici. G. O..

Singapore: World Scientific. A. Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. Last. D. Vol. 16-30). Vol. IEEE Transactions on Knowledge and Data Engineering. Talbot.. Basic capabilities of intelligent agents include adaptation and learning. Last. & Menczer. Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data. potentially useful. & Maimon. Last. M. Mendez-Torreblanca.. Web dynamics. February). Charles River Media. Proceedings of the PAKDD’06 Workshop on Intelligence and Security Informatics (WISI’06). (2002). 388-395)..  . Mena. (1993). E. 2005. (1997). & Liddy. C. & Lopez-Lopez. Automated perceptions in data mining. & Kandel. & Agrawal. Quebec. M. Fighting terror in cyberspace. A compact and accurate model for classification. Intelligent Software Agent: An autonomous software program designed to perform a human-like function (e. (2002). M.. Srikant. The New York Times. 190-197). April 9). Homeland security techniques and technologies.. E..Data Mining Last. Montreal. Terror’s server. Multi-lingual detection of terrorist content on the web.com/articles/05/02/issue/feature_terror.asp Yu. & Zaki. novel. A. Data Mining: The process of applying pattern extraction algorithms to data.. A... R. E. (2006. 3917) (pp. J. D. (2005. Levene & A. A trend discovery system for dynamic web content mining. (2004). Markov. Quinlan. & Kandel. Springer. Spain (pp. key terms Activity Monitoring: The process of monitoring the behavior of a large population of entities for interesting events requiring action (Fawcett & Provost. P. (2004).technologyreview. Koo.g. Proceedings of the Fourth International Conference on Autonomous agents. 203215. A. 615-620). 1-12).. M. 1996). (1996). Data mining is considered the core stage of knowledge discovery in databases (KDD). (2004). Proceedings of CIC-2002. Montes-y-Gomez. M. P. Pant. (2000). J. Mitchell. S. Computing temporal trends in web documents.. In M. Mining quantitative association rules in large relational tables. Even near home. Springer Lipton. Poulovassilis (Eds.5: Programs for machine learning. information search). from http://www... Retrieved March 17. A. R. Canada (pp. Proceedings of the Fourth Conference of the European Society for Fuzzy Logic and Technology (EUSFLAT 2005) (pp. & Lichtblau. Last. Srinivasan. Anomaly Detection: The process of detecting anomalies (irregularities that cannot be explained by existing domain models and knowledge) by monitoring system activity and classifying it as either normal or anomalous. and ultimately understandable patterns in data” (Fayyad et al. C4.. Knowledge Discovery in Databases (KDD): The overall process of “identifying valid. ADMIT: Anomalybased data mining for intrusions. September 23). Evolving intelligent text-based agents. Sequeira. M. (2005). M. & Kandel. G. 386-395). Machine learning. 1999). Series in Machine Perception and Artificial Intelligence. M.). Barcelona. McGrawHill. Singapore (LN in Computer Science. Morgan Kaufmann. 65. F. T. 16(2). E. R. Crawling the web. (1999). a new front is opening in the terror battle. Technology Review. O. K. (2005). Part I (pp. A. Proceedings of the 1999 IEEE International Fuzzy Systems Conference. (2004.

Three main categories of Web mining include Web usage mining.  .Data Mining Web Mining: The application of data mining algorithms to discover useful patterns from the Web. Web structure mining. and Web content mining.