Professional Documents
Culture Documents
INTRODUCTION
The World Wide Web is without doubt, already know what and have used it
extensively. The World Wide Web (or the Web for short) has impacted on almost every
aspect of our lives. It is the biggest and most widely known information source that is
easily accessible and searchable. It consists of billions of interconnected documents
(called Web pages) which are authored by millions of people. Since its inception, the Web
has dramatically changed our information seeking behaviour. Before the Web, finding
information means asking a friend or an expert, or buying/borrowing a book to read.
However, with the Web, everything is only a few clicks away from the comfort of our
homes or offices. Not only can we find needed information on the Web, but we can also
easily share our information and knowledge with others.
The Web has also become an important channel for conducting businesses. We can
buy almost anything from online stores without needing to go to a physical shop. The Web
also provides convenient means for us to communicate with each other, to express our
views and opinions on anything, and to discuss with people from anywhere in the world.
The Web is truly a virtual society. In this chapter, we introduce the Web, its history, and
the topics that we will discuss in the seminar.
The idea of hypertext was invented by Ted Nelson in 1965, who also created the
well known hypertext system Xanadu (http://xanadu. com/). Hypertext that also allows
other media (e.g., image, audio and video files) is called hypermedia.
The Web was invented in 1989 by Tim Berners-Lee, who, at that time, worked at
CERN (Centre European pour la Recherche Nucleaire, or European Laboratory for
Particle Physics) in Switzerland. He coined the term “World Wide Web,” wrote the first
World Wide Web server, httpd, and the first client program (a browser and editor),
Initially, the proposal did not receive the needed support. However, in 1990,
Berners-Lee re-circulated the proposal and received the support to begin the work. With
this project, Berners-Lee and his team at CERN laid the foundation for the future
development of the Web as a distributed hypertext system. They introduced their server
and browser, the protocol used for communication between clients and the server, the
The next significant event in the development of the Web was the arrival of
Mosaic. In February of 1993, Marc Andreesen from the University of Illinois’ NCSA
(National Center for Supercomputing Applications) and his team released the first "Mosaic
for X" graphical Web browser for UNIX. A few months later, different versions of Mosaic
were released for Macintosh and Windows operating systems. This was an important
event. For the first time, a Web client, with a consistent and simple point-and-click
graphical user interface, was implemented for the three most popular operating systems
available at the time. It soon made big splashes outside the academic circle where it had
begun. In mid-1994, Silicon Graphics founder Jim Clark collaborated with Marc
Andreessen, and they founded the company Mosaic Communications (later renamed as
Netscape Communications). Within a few months, the Netscape browser was released to
the public, which started the explosive growth of the Web. The Internet Explorer from
Microsoft entered the market in August, 1995 and began to challenge Netscape.
The creation of the World Wide Web by Tim Berners-Lee followed by the release
of the Mosaic browser are often regarded as the two most significant contributing factors
to the success and popularity of the Web.
INTERNET:
The Web would not be possible without the Internet, which provides the
communication network for the Web to function. The Internet started with the computer
network ARPANET in the Cold War era. It was produced as the result of a project in the
United States aiming at maintaining control over its missiles and bombers after a nuclear
attack. It was supported by Advanced Research Projects Agency (ARPA), which was part
of the Department of Defense in the United States. The first ARPANET connections were
made in 1969, and in 1972, it was demonstrated at the First International Conference on
Computers and Communication, held in Washington D.C. At the conference, ARPA
scientists linked computers together from 40 different locations.
SEARCH ENGINES:
With information being shared worldwide, there was a need for individuals to find
information in an orderly and efficient manner. Thus began the development of search
engines. The search system Excite was introduced in 1993 by six Stanford University
students. EINet Galaxy was established in 1994 as part of the MCC Research Consortium
at the University of Texas. Jerry Yang and David Filo created Yahoo! in 1994, which
started out as a listing of their favourite Web sites, and offered directory search. In
subsequent years, many search systems emerged, e.g., Lycos, Inforseek, AltaVista,
Inktomi, Ask Jeeves, Northernlight, etc.
Google was launched in 1998 by Sergey Brin and Larry Page based on their
research project t at Stanford University. Microsoft started to commit to search in 2003,
and launched the MSN search engine in spring 2005. It used search engines from others
before. Yahoo! provided a general search capability in 2004 after it purchased Inktomi in
2003.
W3C was formed in the December of 1994 by MIT and CERN as an international
organization to lead the development of the Web. W3C's main objective was “to promote
standards for the evolution of the Web and interoperability between WWW products by
producing specifications and reference software.” The first International Conference on
World Wide Web (WWW) was also held in 1994, which has been a yearly event ever
since. From 1995 to 2001, the growth of the Web boomed. Investors saw commercial
The rapid growth of the Web in the last decade makes it the largest publicly
accessible data source in the world. The Web has many unique characteristics, which make
mining useful information and knowledge a fascinating and challenging task. Let us
review some of these characteristics.
1. The amount of data/information on the Web is huge and still growing. The coverage of
the information is also very wide and diverse. One can find information on almost
anything on the Web.
2. Data of all types exist on the Web, e.g., structured tables, semi structured Web pages,
unstructured texts, and multimedia files (images, audios, and videos).
3. Information on the Web is heterogeneous. Due to the diverse authorship of Web pages,
multiple pages may present the same or similar information using completely different
words and/or formats. This makes integration of information from multiple pages a
challenging problem.
4. A significant amount of information on the Web is linked. Hyperlinks exist among Web
pages within a site and across different sites. Within a site, hyperlinks serve as information
organization mechanisms. Across different sites, hyperlinks represent implicit conveyance
of authority to the target pages. That is, those pages that are linked (or pointed) to by many
other pages are usually high quality pages or authoritative pages simply because many
people trust them.
5. The information on the Web is noisy. The noise comes from two main sources. First, a
typical Web page contains many pieces of information, e.g., the main content of the page,
navigation links, advertisements, copyright notices, privacy policies, etc. For a particular
application, only part of the information is useful. The rest is considered noise. To perform
fine-grain Web information analysis and data mining, the noise should be removed.
Second, due to the fact that the Web does not have quality control of information, i.e., one
6. The Web is also about services. Most commercial Web sites allow people to perform
useful operations at their sites, e.g., to purchase products, to pay bills, and to fill in forms.
7. The Web is dynamic. Information on the Web changes constantly. Keeping up with the
change and monitoring the change are important issues for many applications.
8. The Web is a virtual society. The Web is not only about data, information and services,
but also about interactions among people, organizations and automated systems. One can
communicate with people anywhere in the world easily and instantly, and also express
one’s views on anything in Internet forums, blogs and review sites.
All these characteristics present both challenges and opportunities for mining and
discovery of information and knowledge from the Web.
� Pre-processing: The raw data is usually not suitable for mining due to various reasons.
It may need to be cleaned in order to remove noises or abnormalities. The data may also
be too large and/or involve many irrelevant attributes, which call for data reduction
through sampling and attribute selection.
� Data mining: The processed data is then fed to a data mining algorithm which will
produce patterns or knowledge.
� Post-processing: In many applications, not all discovered patterns are useful. This step
identifies those useful ones for applications. Various evaluation and visualization
techniques are used to make the decision.
The whole process (also called the data mining process) is almost always iterative.
It usually takes many rounds to achieve final satisfactory results, which are then
incorporated into real-world operational tasks. Traditional data mining uses structured data
stored in relational tables, spread sheets, or flat files in the tabular form. With the growth
of the Web and text documents, Web mining and text mining are becoming increasingly
important and popular.
Web mining aims to discover useful information or knowledge from the Web
hyperlink structure, page content, and usage data. Although Web mining uses many data
mining techniques, as mentioned above it is not purely an application of traditional data
mining due to the heterogeneity and semi-structured or unstructured nature of the Web
data. Many new mining tasks and algorithms were invented in the past decade. Based on
the primary kinds of data used in the mining process, Web mining tasks can be categorized
into three types: Web structure mining, Web content mining and Web usage mining.
Web mining is the use of data mining techniques to automatically discover and
extract information from Web documents and services. Web mining should be decomposed
into these subtasks:
1. Resource finding: The task of retrieving intended Web documents.
Resource finding is the process of retrieving data from text sources available on the
Web such as electronic magazines and newsletters or text contents of HTML documents.
Information selection and pre processing step is transformation process retrieved in
information retrieval (IR) process from original data. These transformations cover
removing stop words, finding phrases in the training corpus, transforming the
representation to relational or first order logic form, etc. Data mining techniques and
machine learning are often used for generalization.
In information and knowledge discovery process, people play very important role.
This is important for validation and/or interpretation in last step.
Web mining is categorized into three areas of interest based on part of Web to
mine:
1. Web content mining
• describes discovery of useful information from contents, data and
documents
• two different points of view: ir view and db view
2. Web structure mining
• model of link structures, topology of hyperlinks
• categorizing of web pages
3. Web usage mining
• mines secondary data derived from user interactions
Web content mining is the process of extracting knowledge from the content of
documents or their descriptions. Web structure mining is the process of inferring
knowledge from the Web organization and links between references and referents in the
In this seminar, we will discuss all these three types of mining. However, due to
the richness and diversity of information on the Web, there are a large number of Web
mining tasks. We will not be able to cover them all. We will only focus on some important
tasks and their algorithms.
The Web mining process is similar to the data mining process. The difference is
usually in the data collection. In traditional data mining, the data is often already collected
and stored in a data warehouse. For Web mining, data collection can be a substantial task,
especially for Web structure and content mining, which involves crawling a large number
of target Web pages. We will devote a whole chapter on crawling.
Once the data is collected, we go through the same three-step process: data pre-
processing, Web data mining and post-processing. However, the techniques used for each
step can be quite different from those used in traditional data mining.
Data mining has attracted a great deal of attention in the information industry and
in society as a whole in recent years, due to the wide availability of huge amounts of data
and the imminent need for turning such data into useful information and knowledge. The
information and knowledge gained can be used for applications ranging from market
analysis, fraud detection, and customer retention, to production control and science
exploration.
Since the 1960s, database and information technology has been evolving
systematically from primitive file processing systems to sophisticated and powerful
database systems. The research and development in database systems since the 1970s has
progressed from early hierarchical and network database systems to the development of
relational database systems, data modelling tools, and indexing and accessing methods. In
addition, users gained convenient and flexible data access through query languages, user
interfaces, optimized query processing, and transaction management. Efficient methods
for on-line transaction processing (OLTP), where a query is viewed as a read-only
transaction, have contributed substantially to the evolution and wide acceptance of
relational technology as a major tool for efficient storage, retrieval, and management of
large amounts of data.
Database technology since the mid-1980s has been characterized by the popular
adoption of relational technology and an upsurge of research and development activities
on new and powerful database systems. These promote the development of advanced data
WEB MINING Page 10
models such as extended-relational, object-oriented, object-relational, and deductive
models. Application-oriented database systems, including spatial, temporal, multimedia,
active, stream, and sensor, and scientific and engineering databases, knowledge bases, and
office information bases, have flourished. Issues related to the distribution, diversification,
and sharing of data have been studied extensively. Heterogeneous database systems and
Internet-based global information systems such as the World Wide Web (WWW) have also
emerged and play a vital role in the information industry.
The steady and amazing progress of computer hardware technology in the past
three decades has led to large supplies of powerful and affordable computers, data
collection equipment, and storage media. This technology provides a great boost to the
database and information industry, and makes a huge number of databases and information
repositories available for transaction management, information retrieval, and data analysis.
Data can now be stored in many different kinds of databases and information
repositories. One data repository architecture that has emerged is the data warehouse, a
repository of multiple heterogeneous data sources organized under a unified schema at a
single site in order to facilitate management decision making. Data warehouse technology
includes data cleaning, data integration, and on-line analytical processing (OLAP), that is,
analysis techniques with functionalities such as summarization, consolidation, and
aggregation as well as the ability to view information from different angles. Although
OLAP tools support multidimensional analysis and decision making, additional data
analysis tools are required for in-depth analysis, such data classification, clustering, and
the characterization of data changes over time. In addition, huge volumes of data can be
accumulated beyond databases and data warehouses. Typical examples include the World
Wide Web and data streams, where data flow in and out like streams, as in applications
like video surveillance, telecommunication, and sensor networks. The effective and
efficient analysis of data in such different forms becomes a challenging task.
The abundance of data, coupled with the need for powerful data analysis tools, has
been described as a data rich but information poor situation. The fast-growing, tremendous
amount of data, collected and stored in large and numerous data repositories, has far
exceeded our human ability for comprehension without powerful tools. As a result, data
collected in large data repositories become “data tombs”—data archives that are seldom
visited. Consequently, important decisions are often made based not on the information-
WEB MINING Page 11
rich data stored in data repositories, but rather on a decision maker’s intuition, simply
because the decision maker does not have the tools to extract the valuable knowledge
embedded in the vast amounts of data. In addition, consider expert system technologies,
which typically rely on users or domain experts to manually input knowledge into
knowledge bases. Unfortunately, this procedure is prone to biases and errors, and is
extremely time-consuming and costly. Data mining tools perform data analysis and may
uncover important data patterns, contributing greatly to business strategies, knowledge
bases, and scientific and medical research. The widening gap between data and
information calls for a systematic development of data mining tools that will turn data
tombs into “golden nuggets” of knowledge.
Simply stated, data mining refers to extracting or “mining” knowledge from large
amounts of data. The term is actually a misnomer. Remember that the mining of gold from
rocks or sand is referred to as gold mining rather than rock or sand mining. Thus, data
mining should have been more appropriately named “knowledge mining from data,”
which is unfortunately somewhat long. “Knowledge mining,” a shorter term, may not
reflect the emphasis on mining from large amounts of data. Nevertheless, mining is a vivid
term characterizing the process that finds a small set of precious nuggets from a great deal
of raw material. Thus, such a misnomer that carries both “data” and “mining” became a
popular choice. Many other terms carry a similar or slightly different meaning to data
mining, such as knowledge mining from data, knowledge extraction, data/pattern analysis,
data archaeology, and data dredging.
Many people treat data mining as a synonym for another popularly used term,
Knowledge Discovery from Data, or KDD. Alternatively, others view data mining as
simply an essential step in the process of knowledge discovery. Knowledge discovery as a
process is depicted in Figure 1 and consists of an iterative sequence of the following steps:
3. Data selection (where data relevant to the analysis task are retrieved from the database)
4. Data transformation (where data are transformed or consolidated into forms appropriate
for mining by performing summary or aggregation operations)
5. Data mining (an essential process where intelligent methods are applied in order to
extract data patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge
based on some interestingness measures)
We agree that data mining is a step in the knowledge discovery process. However,
in industry, in media, and in the at a base research milieu, the term data mining is
becoming more popular than the longer term of knowledge discovery from data.
Therefore, in this book, we choose to use the term data mining. We adopt a broad view of
data mining functionality: data mining is the process of discovering interesting knowledge
from large amounts of data stored in databases, data warehouses, or other information
repositories.
Based on this view, the architecture of a typical data mining system may have the
following major components (Figure 2):
Database, data warehouse, World Wide Web, or other information repository: This is
one or a set of databases, data warehouses, spreadsheets, or other kinds of information
repositories. Data cleaning and data integration techniques may be performed on the data.
Knowledge base: This is the domain knowledge that is used to guide the search or
evaluate the interestingness of resulting patterns. Such knowledge can include concept
hierarchies, used to organize attributes or attribute values into different levels of
abstraction. Knowledge such as user beliefs, which can be used to assess a pattern’s
interestingness based on its unexpectedness, may also be included. Other examples of
domain knowledge are additional interestingness constraints or thresholds, and metadata
(e.g., describing data from multiple heterogeneous sources).
Data mining engine: This is essential to the data mining system and ideally consists of a
set of functional modules for tasks such as characterization, association and correlation
analysis, classification, prediction, cluster analysis, outlier analysis, and evolution
analysis.
User interface: This module communicates between users and the data mining system,
allowing the user to interact with the system by specifying a data mining query or task,
providing information to help focus the search, and performing exploratory data mining
based on the intermediate data mining results. In addition, this component allows the user
Although there are many “data mining systems” on the market, not all of them can
perform true data mining. A data analysis system that does not handle large amounts of
data should be more appropriately categorized as a machine learning system, a statistical
data analysis tool, or an experimental system prototype. A system that can only perform
data or information retrieval, including finding aggregate values, or that performs
deductive query answering in large databases should be more appropriately categorized as
a database system, an information retrieval system, or a deductive database system.
Memory Based Reasoning - This technique has results similar to neural network but goes
about it differently. MBR looks for "neighbour" kind of data, rather than patterns. If you
look at insurance claims and want to know which the adjudicators should look at and
which they can just let go through the system, you would set up a set of claims you want
adjudicated and let the technique find similar claims.
Link Analysis - This is another technique for associating like records. Not used too much,
but there are some tools created just for this. As the name suggests, the technique tries to
find links, either in customers, transactions, etc. and demonstrate those links.
Visualization - This technique helps users understand their data. Visualization makes the
bridge from text based to graphical presentation. Such things as decision tree, rule, cluster
and pattern visualization help users see data relationships rather than read about them.
Many of the stronger data mining programs have made strides in improving their visual
content over the past few years. This is really the vision of the future of data mining and
analysis. Data volumes have grown to such huge levels, it is going to be impossible for
humans to process it by any text-based method effectively, soon. We will probably see an
approach to data mining using visualization appear that will be something like Microsoft’s
Photosynth. The technology is there, it will just take an analyst with some vision to sit
down and put it together.
Decision Tree/Rule Induction - Decision trees use real data mining algorithms. Decision
trees help with classification and spit out information that is very descriptive, helping
users to understand their data. A decision tree process will generate the rules followed in a
process. For example, a lender at a bank goes through a set of rules when approving a
loan. Based on the loan data a bank has, the outcomes of the loans (default or paid), and
limits of acceptable levels of default, the decision tree can set up the guidelines for the
lending institution. These decision trees are very similar to the first decision support (or
expert) systems.
Genetic Algorithms - GAs are techniques that act like bacteria growing in a Petri dish.
You set up a data set then give the GA ability to do different things for whether a direction
or outcome is favourable. The GA will move in a direction that will hopefully optimize the
final result. GAs are used mostly for process optimization, such as scheduling, workflow,
batching, and process re-engineering. Think of GA as simulations run over and over to
find optimal results and the infrastructure around being able to both run the simulations
and the ways to set up which results are optimal.
Classification according to the kinds of knowledge mined: Data mining systems can be
categorized according to the kinds of knowledge they mine, that is, based on data mining
functionalities, such as characterization, discrimination, association and correlation
analysis, classification, prediction, clustering, outlier analysis, and evolution analysis. A
comprehensive data mining system usually provides multiple and/or integrated data
mining functionalities.
Classification according to the kinds of techniques utilized: Data mining systems can
be categorized according to the underlying data mining techniques employed. These
techniques can be described according to the degree of user interaction involved (e.g.,
autonomous systems, interactive exploratory systems, query-driven systems) or the
methods of data analysis employed (e.g., database-oriented or data warehouse– oriented
techniques, machine learning, statistics, visualization, pattern recognition, neural
networks, and so on). A sophisticated data mining system will often adopt multiple data
mining techniques or work out an effective, integrated technique that combines the merits
of a few individual approaches.
Classification according to the applications adapted: Data mining systems can also be
categorized according to the applications they adapt. For example, data mining Systems
may be tailored specifically for finance, telecommunications, DNA, stock markets, e-mail,
and so on. Different applications often require the integration of application-specific
methods. Therefore, a generic, all-purpose data mining system may not fit domain-specific
mining tasks.
Each user will have a data mining task in mind, that is, some form of data analysis
that he or she would like to have performed. A data mining task can be specified in the
form of a data mining query, which is input to the data mining system. A data mining
query is defined in terms of data mining task primitives. These primitives allow the user to
interactively communicate with the data mining system during discovery in order to direct
the mining process, or examine the findings from different angles or depths. The data
mining primitives specify the following.
The set of task-relevant data to be mined: This specifies the portions of the database or
the set of data in which the user is interested. This includes the database attributes or data
warehouse dimensions of interest (referred to as the relevant attributes or dimensions).
The kind of knowledge to be mined: This specifies the data mining functions to be
performed, such as characterization, discrimination, association or correlation analysis,
classification, prediction, clustering, outlier analysis, or evolution analysis.
The background knowledge to be used in the discovery process: This knowledge about
the domain to be mined is useful for guiding the knowledge discovery process and for
evaluating the patterns found. Concept hierarchies are a popular form of background
knowledge, which allow data to be mined at multiple levels of abstraction. User beliefs
regarding relationships in the data are another form of background knowledge.
The interestingness measures and thresholds for pattern evaluation: They may be
used to guide the mining process or, after discovery, to evaluate the discovered patterns.
Different kinds of knowledge may have different interestingness measures.
The expected representation for visualizing the discovered patterns: This refers to the
form in which discovered patterns are to be displayed, which may include rules, tables,
charts, graphs, decision trees, and cubes.
The following figure shows the architecture of web mining briefly. It divides
into two stages. The stage1 contains all information of data & stage2 contains analysis.
According to analysis target, web mining can be divided into three different types,
which are web usage mining, web content mining and web structure mining.
Data mining is the nontrivial process of identifying valid novel, potentially useful,
and ultimately understandable patterns in data Fayyad. The most commonly used
techniques in data mining is artificial neural networks, decision trees, genetic algorithm,
nearest neighbour method, and rule induction. Data mining research has drawn on a
number of other fields such as inductive learning, machine learning and statistics etc.
Inductive learning – Induction means inference of information from data and Inductive
learning is a model building process where the database is analyzed to find patterns. Main
strategies are supervised learning and unsupervised learning.
Statistics: used to detect unusual patterns and explain patterns using statistical models
such as linear models.
Data mining models can be a discovery model – it is the system automatically discovering
important information hidden in the data or verification model – takes an hypothesis from
the user and tests the validity of it against the data.
The web contains collection of pages that includes countless hyperlinks and huge
volumes of access and usage information. Because of the ever-increasing amount of
information in cyberspace, knowledge discovery and web mining are becoming critical for
successfully conducting business in the cyber world. Web mining is the discovery and
analysis of useful information from the web. Web mining is the use of data mining
techniques to automatically discover and extract information from web documents and
services (content, structure, and usage).
ii. Data centric view – web mining as a web data that was being used in the
mining process.
The important data mining techniques applied in the web domain include
Association Rule, Sequential pattern discovery, clustering, path analysis, classification and
outlier discovery.
WEB MINING Page 23
1. Association Rule Mining: Predict the association and correlation among set of
items “where the presence of one set of items in a transaction implies (with a
certain degree of confidence) the presence of other itms. That is,
1) Discovers the correlations between pages that are most often referenced together
in a single server session/user session.
2) Provide the information:
i. What are the set of pages frequently accessed together by web users?
ii. What page will be fetched next?
iii. What are paths frequently accessed by web users?
3) Associations and correlations:
i. Page association from usage data – user sessions, user transactions.
ii. Page associations from content data – similarity based on content analysis
iii. Page associations based on structure -- link connectivity between pages.
Advantages:
Guide for web site restructuring – by adding links that interconnect pages often
viewed together.
Improve the system performance by pre fetching web data.
2. Sequential pattern discovery: Applied to web access server transaction logs. The
purpose is to discover sequential patterns that indicate user visit patterns over a
certain period. That is, the order in which URLs tend to be accessed.
Advantage:
Useful user trends can be discovered.
Predictions concerning visit pattern can be made.
To improve website navigation.
Personalize advertisements.
Dynamically reorganize link structure and adopt web site contents to individual
client requirements or to provide clients with automatic recommendations that
best suit customer profiles.
3. Clustering: Group together items (users, pages, etc.,) that have similar
characteristics.
a) Page clusters: groups of pages that seem to be conceptually related according to
users’ perception.
To use data mining on our web site, we have to establish and record visitor and
item characteristics, and visitor interactions.
ii. Visitor site statistics are per session characteristics, such as total time, pages
viewed, and so on.
The web mining research is a converging research area from several research
communities, such as database, information retrieval, and AI research communities,
especially from machine learning and natural language processing. World wide web is a
popular and interactive medium to gather information today. The WWW provides every
Internet citizen with access to an abundance of information. Users encounter some
problems when interacting with the web.
Common problems web marketers want to solve are how to target advertisements
(Targeting), Personalize web pages (Personalization), create web pages that show products
often bought together (associations), classify articles automatically (Classification),
characterize group of similar visitors (clustering), estimate missing data and predict future
behaviour.
Thus, web mining refers to the overall process of discovering potentially useful
and previously unknown information or knowledge from the web data. Web mining aims
at finding and extracting relevant information that is hidden in web-related data, in
particular in text documents that are published on the web like data mining is a multi-
disciplinary effort that draws technique from fields like information retrieval, statistics,
machine learning, natural language processing and others. Web mining can be a
promising tool to address ineffective search engines that produce incomplete indexing,
retrieval of irrelevant information/unverified reliability or retrieved information. It is
essential to have a system that helps the user find relevant and reliable information easily
and quickly on the web. Web mining discovers information from mounds of data on the
www, but it also monitors and predicts user visit patterns. This gives designers more
reliable information in structuring and designing a web site.
Given the rate of growth of the web, scalability of search engines is a key issue, as
the amount of hardware and network resources needed is large, and expensive. In
addition, search engines are popular tools, so they have heavy constraints on query answer
time. So, the efficient use of resources can improve both scalability and answer time. One
tool to achieve this goal is web mining.
Web content mining is the process of extracting useful information from the
contents of web documents. Content data is the collection of facts a web page is designed
to contain. It may consist of text, images, audio, video, or structured records such as lists
and tables. Application of text mining to web content has been the most widely researched.
Issues addressed in text mining include topic discovery and tracking, extracting
association patterns, clustering of web documents and classification of web pages.
Research activities on this topic have drawn heavily on techniques developed in other
disciplines such as Information Retrieval (IR) and Natural Language Processing (NLP).
Web content mining is an automatic process that goes beyond keywords extraction.
Since the content of a text document presents no machine-readable semantic, some
approaches have suggested to restricted the document content in a representation that
could be exploited by machines. The usual approach to exploit known structure in
document is to use wrappers to map document to some data model. Techniques using
lexicons for content interpretation are yet to come. There are two groups of web content
mining strategies: those that directly mine the content of document and those that improve
on the content search of other tools like search engines.
The structure of a typical web graph consists of web pages as nodes, and
hyperlinks as edges connecting related pages. Web structure mining is the process of
discovering structure information from the web. This can be further divided into two kinds
based on the kind of structure information used.
Hyperlinks
A hyperlink is a structural unit that connects a location in a web page to a different
location, either within the same web page or on a different web page. A hyperlink that
connects to a different part of the same page is called an intra-document hyperlink, and a
hyperlink that connects two different pages is called an inter-document hyperlink.
Document Structure
In addition, the content within a Web page can also be organized in a tree
structured format, based on the various HTML and XML tags within the page. Mining
efforts here have focused on automatically extracting document object model (DOM)
structures out of documents (Wang and Liu 1998; Moh, Lim, and Ng 2000).
World Wide Web can reveal more information than just the information contained
in documents. For example, links pointing to a document indicate the popularity of the
document, while links coming out of a document indicate the richness or perhaps the
User logs are collected by the web server and typically include IP address, page
reference and access time.
Web servers record and accumulate data about user interactions whenever
requests for resources are received. Analyzing the web access logs of different web sites
can help understand the user behaviour and the web structure, thereby improving the
design of this colossal collection of resources. There are two main tendencies in web usage
mining driven by the applications of discoveries: general access pattern tracking and
The World Wide Web has grown in the past few years from a small research
community to the biggest and most popular way of communication and information
dissemination. Every day, the WWW grows by roughly a million electronic pages, adding
to the hundreds of millions already on-line. WWW serves as a platform for exchanging
various kinds of information, ranging from research papers, and educational content, to
multimedia content and software. The continuous growth in the size and the use of the
WWW imposes new methods for processing these huge amounts of data. Because of its
rapid and chaotic growth, the resulting network of information lacks of organization and
structure. Moreover, the content is published in various diverse formats.
Web data are those that can be collected and used in the context of Web
Personalization. These data are classified in four categories according to servers, i.e., web
usage mi Content data are presented to the end-user appropriately structured. They can be
simple text, images, or structured data, such as information retrieved from databases.
Structure data represent the way content is organized. They can be either data
entities used within a Web page, such as HTML or XML tags, or data entities used
to put a Web site together, such as hyperlinks connecting one page to another.
Usage data represent a Web site’s usage, such as a visitor’s IP address, time and
date of access, complete path accessed, referrers’ address, and other attributes that
can be included in a Web access log.
User profile data provide information about the users of a Web site. A user profile
contains demographic information for each user of a Web site, as well as
information about users’ interests and preferences. Such information is acquired
Web mining essentially has many advantages which makes this technology
attractive to corporations including the government agencies. This technology has enabled
ecommerce to do personalized marketing, which eventually results in higher trade
volumes. The government agencies are using this technology to classify threats and fight
against terrorism. The predicting capability of the mining application can benefits the
society by identifying criminal activities. The companies can establish better customer
relationship by giving them exactly what they need. Companies can understand the needs
of the customer better and they can react to customer needs faster. The companies can
find, attract and retain customers; they can save on production costs by utilizing the
acquired insight of customer requirements. They can increase profitability by target
pricing based on the profiles created. They can even find the customer who might default
to a competitor the company will try to retain the customer by providing promotional
offers to the specific customer, thus reducing the risk of losing a customer or customers.
CONS
Web mining, itself, doesn’t create issues, but this technology when used on data of
personal nature might cause concerns. The most criticized ethical issue involving web
mining is the invasion of privacy. Privacy is considered lost when information concerning
an individual is obtained, used, or disseminated, especially if this occurs without their
knowledge or consent. The obtained data will be analyzed, and clustered to form profiles;
the data will be made anonymous before clustering so that there are no personal profiles.
Thus these applications de-individualize the users by judging them by their mouse clicks.
De-individualization, can be defined as a tendency of judging and treating people on the
basis of group characteristics instead of on their own individual characteristics and merits.
Another important concern is that the companies collecting the data for a specific purpose
CONTENT MINING
Data base approach: focuses on “integrating and organizing the heterogeneous and
semi-structured data on the web into more structured and high level collections of
resources”. These organized resources can then be accessed and analyzed. These metadata,
or generalization are then organized into structured collections and can be analyzed.
Multilevel Databases: The main idea behind this approach is that the lowest level of the
database contains semi-structured information stored in various Web repositories, such as
hypertext documents. At the higher level meta data or generalizations are extracted from
lower levels and organized in structured collections, i.e. relational or object-oriented
databases. For example, Han, et. Al. use a multilayered database where each layer is
obtained via generalization and transformation operations performed on the lower layers.
Kholsa, et. al. propose the creation and maintenance of meta-databases at each information
providing domain and the use of a global schema for the meta-database. The incremental
integration of a portion of the schema from each information source, rather than relying on
a global heterogeneous database schema. The ARANEUS system extracts relevant
information from hypertext documents and integrates these into higher-level derived Web
Hypertexts which are generalizations of the notion of database views.
Web Query Systems: Many Web-based query systems and languages utilize standard
database query languages such as SQL, structural information about Web documents, and
even natural language processing for the queries that are used in World Wide Web
searches. W3QL combines structure queries, based on the organization of hypertext
documents, and content queries, based on information retrieval techniques. Web Log
Logic-based query language for restructuring extracts information from Web in- formation
sources. Lorel and UnQL query heterogeneous and semi-structured information on the
Web using a labelled graph data model. TSIMMIS extracts data from heterogeneous and
semi-structured information sources and correlates them to generate an integrated database
representation of the extracted information.
This is perhaps the most widely studied research topic of Web content mining. One
of the reasons for its importance and popularity is that structured data on the Web are often
very important as they represent their host pages. Essential information, e.g., lists of
products and services. Extracting such data allows one to provide value added services,
e.g., comparative shopping, and meta-search. Structured data is also easier to extract
compared to unstructured texts. This problem has been studied by researchers in AI,
database and data mining, and Web communities. There are several approaches to
structured data extraction, which is also called wrapper generation. The first approach is to
manually write an extraction program for each Web site based on observed format patterns
of the site. This approach is very labour intensive and time consuming. It thus does not
scale to a large number of sites. The second approach is wrapper induction or wrapper
learning, which is the main technique currently. Wrapper learning works as follows: The
user first manually labels a set of trained pages. A learning system then generates rules
from the training pages. The resulting rules are then applied to extract target items from
Web pages. The third approach is the automatic approach. Since structured data objects on
the Web are normally database records retrieved from underlying databases and displayed
in Web pages with some fixed templates. Automatic methods aim to find
patterns/grammars from the Web pages and then use them to extract data.
Most Web pages can be seen as text documents. Extracting information from Web
documents has also been studied by many researchers. The research is closely related to
text mining, information retrieval and natural language processing. Current techniques are
mainly based on machine learning and natural language processing to learn extraction
rules. Recently, a number of researchers also make use of common language patterns
(common sentence structures used to express certain facts or relations) and redundancy of
information on the Web to find concepts, relations among concepts and named entities.
The patterns can be automatically learnt or supplied by human users. Another direction of
research in this area is Web question-answering. Although question-answering was first
studied in information retrieval literature, it becomes very important on the Web as Web
Due to the sheer scale of the Web and diverse authorships, various Web sites may
use different syntaxes to express similar or related information. In order to make use of or
to extract information from multiple sites to provide value added services, e.g.,
metasearch, deep Web search, etc, one needs to semantically integrate information from
multiple sources. Recently, several researchers attempted this task. Two popular problems
related to the Web are (1) Web query interface integration, to enable querying multiple
Web databases and (2) schema matching, e.g., integrating Yahoo and Google.s directories
to match concepts in the hierarchies. The ability to query multiple deep Web databases is
attractive and interesting because the deep Web contains a huge amount of information or
data that is not indexed by general search engines.
A typical Web page consists of many blocks or areas, e.g., main content areas,
navigation areas, advertisements, etc. It is useful to separate these areas automatically for
Consumer opinions used to be very difficult to obtain before the Web was
available. Companies usually conduct consumer surveys or engage external consultants to
find such opinions about their products and those of their competitors. Now much of the
information is publicly available on the Web. There are numerous Web sites and pages
containing consumer opinions, e.g., customer reviews of products, forums, discussion
groups, and blogs. This online word-of-mouth behaviour represents new and measurable
sources of information for marketing intelligence. Techniques are now being developed to
exploit these sources to help companies and individuals to gain such information
effectively and easily. For instance, proposes a feature based summarization method to
automatically analyze consumer opinions in customer reviews from online merchant sites
and dedicated review sites. The result of such a summary is useful to both potential
customers and product manufacturers.
Web Structure Mining operates on the web’s hyperlink structure. This graph
structure can provide information about page ranking or authoritativeness and enhance
search results through filtering i.e., tries to discover the model underlying the link
structures of the web. This model is used to analyze the similarity and relationship
between different web sites. Uses the hyperlink structure of the web as an additional
information source. This type of mining can be further divided into 2 kinds based on the
kind of structural data used.
a) HYPERLINKS:
A hyperlink is a structural unit that connects a web page to different location, either
within the same web page (intra document hyperlink) or to a different web page
(inter document) hyperlink.
b) DOCUMENT STRUCTURE:
In addition, the content within a web page can also be organized in a tree
structured format, based on various HTML and XML tags within the page. Mining
efforts here have focused on automatically extracting document object model
(DOM) structures out of documents.
3. page categorization
Web structure mining uses the hyperlink structure of the Web to yield useful
information, including definitive pages specification, hyperlinked communities
identification, Web pages categorization andWeb site completeness evaluation. Web
structure mining can be divided into two categories based on the kind of structured data
used:
1. Web graph mining: The Web provides additional information about how different
documents are connected to each other via hyperlinks. The Web can be viewed as a
(directed) graph whose nodes are Web pages and whose edges are hyperlinks
between them.
2. Deep Web mining: Web also contains a vast amount of non crawlable content. his
hidden part of the Web is referred to as the deep Web or the hidden Web.
Compared to the static surface Web, the deep Web contains a much larger amount
of high-quality structured information.
Most of mining algorithms, that are improving the performance of Web search, are
based on two assumptions.
(a) Hyperlinks convey human endorsement. If there exists a link from page A to
page B, and these two pages are authored by different people, then the first author
found the second page valuable. Thus the importance of a page can be propagated
to those pages it links to.
(b) Pages that are co-cited by a certain page are likely related to the same topic.
The popularity or importance of a page is correlated to the number of incoming
links to some extendt, and related pages tend to be clustered together through
dense linkages among them.
Web information extraction has the goal of pulling out information from a
collection of Web pages and converting it to a homogeneous form that is more readily
digested and analyzed for both humans and machines. The result of IE could be used to
It is usually difficult or even impossible to directly obtain the structures of the Web
sites’ backend databases without cooperation from the sites. Instead, the sites present two
other distinguishing structures: Interface schema and result schema. The interface schema
is the schema of the query interface, which exposes attributes that can be queried in the
backend database. The result schema is the schema of the query results, which exposes
attributes that are shown to users.
Web Usage Mining is a part of Web Mining, which, in turn, is a part of Data
Mining. As Data Mining involves the concept of extraction meaningful and valuable
information from large volume of data, Web Usage mining involves mining the usage
characteristics of the users of Web Applications. This extracted information can then be
used in a variety of ways such as, improvement of the application, checking of fraudulent
elements etc.
The major problem with Web Mining in general and Web Usage Mining in
particular is the nature of the data they deal with. With the upsurge of Internet in this
millennium, the Web Data has become huge in nature and a lot of transactions and usages
are taking place by the seconds. Apart from the volume of the data, the data is not
completely structured. It is in a semi-structured format so that it needs a lot of
preprocessing and parsing before the actual extraction of the required information. we
have taken up a small part of the Web Usage Mining process, which involves the
Preprocessing, User Identification, Bot removal and Analysis of the
In Web Usage Mining, data can be collected in server logs, browser logs, proxy
logs, or obtained from an organization's database. These data collections differ in terms of
the location of the data source, the kinds of data available, the segment of population from
which the data was collected, and methods of implementation.
There are many kinds of data that can be used in Web Mining.
1. Content: The visible data in the Web pages or the information which was meant to be
imparted to the users. A major part of it includes text and graphics (images).
WEB MINING Page 45
2. Structure: Data which describes the organization of the website. It is divided into two
types. Intra-page structure information includes the arrangement of various HTML or
XML tags within a given page. The principal kind of inter-page structure information
are the hyper-links used for site navigation.
3. Usage: Data that describes the usage patterns of Web pages, such as IP addresses, page
references, and the date and time of accesses and various other information depending
on the log format.
The data sources used in Web Usage Mining may include web data repositories like:
1. WEB SERVER LOGS: These are logs which maintain a history of page requests. The
W3C maintains a standard format for web server log files, but other proprietary formats
exist. More recent entries are typically appended to the end of the file. Information about
the request, including client IP address, request date/time, page requested, HTTP code,
bytes served, user agent, and referrer are typically added.
These data can be combined into a single file, or separated into distinct logs, such
as an access log, error log, or referrer log. However, server logs typically do not collect
user-specific information. These files are usually not accessible to general Internet users,
only to the webmaster or other administrative person. A statistical analysis of the server
log may be used to examine traffic patterns by time of day, day of week, referrer, or user
agent. Efficient web site administration, adequate hosting resources and the fine tuning of
sales efforts can be aided by analysis of the web server logs. Marketing departments of
any organization that owns a website should be trained to understand these powerful tools.
A Web server log is an important source for performing Web Usage Mining
because it explicitly records the browsing behaviour of site visitors. The data recorded in
server logs reflects the (possibly concurrent) access of a Web site by multiple users. These
log files can be stored in various formats such as Common log or Extended log formats.
However, the site usage data recorded by server logs may not be entirely reliable due to
The Web server also relies on other utilities such as CGI scripts to handle data sent
back from client browsers. Web servers implementing the CGI standard parse the URI 1 of
the requested file to determine if it is an application program. The URI for CGI programs
may contain additional parameter values to be passed to the CGI application. Once the
CGI program has completed its execution, the Web server send the output of the CGI
application back to the browser.
2. PROXY SERVER LOGS: A Web proxy is a caching mechanism which lies between
client browsers and Web servers. It helps to reduce the load time of Web pages as well as
the network traffic load at the server and client side. Proxy server logs contain the HTTP
requests from multiple clients to multiple Web servers. This may serve as a data source to
discover the usage pattern of a group of anonymous users, sharing a common proxy server.
A Web proxy acts as an intermediate level of caching between client browsers and
Web servers. Proxy caching can be used to reduce the loading time of a Web page
experienced by users as well as the network traffic load at the server and client sides. The
performance of proxy caches depends on their ability to predict future page requests
correctly. Proxy traces may reveal the actual HTTP requests from multiple clients to
multiple Web servers. This may serve as a data source for characterizing the browsing
behaviour of a group of anonymous users sharing a common proxy server.
1. Number of Hits: This number usually signifies the number of times any resource is
accessed in a Website. A hit is a request to a web server for a file (web page, image,
JavaScript, Cascading Style Sheet, etc.). When a web page is uploaded from a server
the number of "hits" or "page hits" is equal to the number of files requested. Therefore,
one page load does not always equal one hit because often pages are made up of other
images and other files which stack up the number of hits counted.
3. Visitor Referring Website: The referring website gives the information or url of the
website which referred the particular website in consideration.
4. Visitor Referral Website: The referral website gives the information or url of the
website which is being referred to by the particular website in consideration.
5. Time and Duration: This information in the server logs give the time and duration for
how long the Website was accessed by a particular user.
6. Path Analysis: Path analysis gives the analysis of the path a particular user has
followed in accessing contents of a Website.
7. Visitor IP address: This information gives the Internet Protocol(I.P.) address of the
visitors who visited the Website in consideration.
8. Browser Type: This information gives the information of the type of browser that was
used for accessing the Website.
9. Cookies: A message given to a Web browser by a Web server. The browser stores the
message in a text file called cookie. The message is then sent back to the server each
time the browser requests a page from the server. The main purpose of cookies is to
identify users and possibly prepare customized Web pages for them. When you enter a
Web site using cookies, you may be asked to fill out a form providing such
information as your name and interests. This information is packaged into a cookie and
sent to your Web browser which stores it for later use. The next time you go to the
same Web site, your browser will send the cookie to the Web server. The server can
use this information to present you with custom Web pages. So, for example, instead
of seeing just a generic welcome page you might see a welcome page with your name
on it.
10. Platform: This information gives the type of Operating System etc. that was used to
access the Website.
2. Eliminating or Combining Low Visit Pages: The pages which are not frequently
accessed by users can be either removed or their content can be merged with pages
with frequent access.
3. Redesigning Pages to help User Navigation: To help the user to navigate through the
website in the best possible manner, the information obtained can be used to redesign
the structure of the Website.
4. Redesigning Pages For Search Engine Optimization: The content as well as other
information in the website can be improved from analyzing user patterns and this
information can be used to redesign pages for Search Engine Optimization so that the
search engines index the website at a proper rank.
6.6.1 PREPROCESSING
Usage preprocessing is arguably the most difficult task in the Web Usage Mining process
due to the incompleteness of the available data. Unless a client side tracking mechanism is
used, only the IP address, agent, and server side click- stream are available to identify
users and server sessions. Some of the typically encountered problems are:
Assuming each user has now been identified (through cookies, logins, or
IP/agent/path analysis), the click-stream for each user must be divided into sessions. Since
page requests from other servers are not typically available, it is difficult to know when a
user has left a Web site. A thirty minute timeout is often used as the default method of
breaking a user's click-stream into sessions. When a session ID is embedded in each URI,
the definition of a session is set by the content server.
While the exact content served as a result of each user action is often available
from the request field in the server logs, it is sometimes necessary to have access to the
content server information as well. Since content servers can maintain state variables for
each active session, the information necessary to determine exactly what content is served
by a user request is not always available in the URI. The final problem encountered when
preprocessing usage data is that of inferring cached page references. The only variable
method of tracking cached page views is to monitor usage from the client side. The
referrer field for each request can be used to detect some of the instances when cached
pages have been viewed. IP address 123.456.78.9 is responsible for three server sessions,
and IP addresses 209.456.78.2 and 209.45.78.3 are responsible for a fourth session. Using
a combination of referrer and agent information, lines 1 through 11 can be divided into
three sessions of A-B-F-O-G, L-R, and A-B-C-J. Path completion would add two page
references to the first session A-B-F-O-F-B-G, and one reference to the third session A-B-
A-C-J. Without using cookies, an embedded session ID, or a client-side data collection
Content preprocessing consists of converting the text, image, scripts, and other les such as
multimedia into forms that are useful for the Web Usage Mining process. Often, this
consists of performing content mining such as classification or clustering. While applying
data mining to the content of Web sites is an interesting area of research in its own right, in
the context of Web Usage Mining the content of a site can be used to filter the input to, or
output from the pattern discovery algorithms. For example, results of a classification
algorithm could be used to limit the discovered patterns to those containing page views
about a certain subject or class of products. In addition to classifying or clustering page
views based on topics, page views can also be classified according to their intended use.
Page views can be intended to convey information (through text, graphics, or other
multimedia), gather information from the user, allow navigation (through a list of
hypertext links), or some combination uses. The intended use of a page view can also filter
the sessions before or after pattern discovery.
In order to run content mining algorithms on page views, the information must first
be converted into a quantifiable format. Some version of the vector space model is
typically used to accomplish this. Text files can be broken up into vectors of words.
Keywords or text descriptions can be substituted for graphics or multimedia. The content
of static page views can be easily preprocessed by parsing the HTML and reformatting the
information or running additional algorithms as desired. Dynamic page views present
more of a challenge. Content servers that employ personalization techniques and/or draw
upon databases to construct the page views may be capable of forming more page views
than can be practically preprocessed. A given set of server sessions may only access a
fraction of the page views possible for a large dynamic site. Also the content may be
revised on a regular basis. The content of each page view to be pre- processed must be
\assembled", either by an HTTP request from a crawler, or a combination of template,
script, and database accesses. If only the portion of page views that are accessed are
preprocessed, the output of any classification or clustering algorithms may be skewed.
Web Usage mining can be used to uncover patterns in server logs but is often carried out
only on samples of data. The mining process will be ineffective if the samples are not a
good representation of the larger body of data.
Pattern discovery draws upon methods and algorithms developed from several
fields such as statistics, data mining, machine learning and pattern recognition. However,
it is not the intent of this paper to describe all the available algorithms and techniques
derived from these fields. Interested readers should consult references such as. This
section describes the kinds of mining activities that have been applied to the Web domain.
Methods developed from other fields must take into consideration the different kinds of
data abstractions and prior knowledge available for Web Mining.
For example, in association rule discovery, the notion of a transaction for market-basket
analysis does not take into consideration the order in which items are selected. How- ever,
in Web Usage Mining, a server session is an ordered sequence of pages requested by a
user. Furthermore, due to the difficulty in identifying unique sessions, additional prior
knowledge is required (such as imposing a default timeout period, as was pointed out in
the previous section).
1. Statistical Analysis
Statistical techniques are the most common method to extract knowledge about visitors to
a Web site. By analyzing the session file, one can perform different kinds of descriptive
statistical analyses (frequency, mean, median, etc.) on variables such as page views,
viewing time and length of a navigational path. Many Web traffic analysis tools produce a
periodic report containing statistical information such as the most frequently accessed
2. Association Rules
Association rule generation can be used to relate pages that are most often referenced
together in a single server session. In the context of Web Usage Mining, association rules
refer to sets of pages that are accessed together with a support value exceeding some
specified threshold. These pages may not be directly connected to one another via
hyperlinks. For example, association rule discovery using the Apriori algorithm may
reveal a correlation between users who visited a page containing electronic products to
those who access a page about sporting equipment. Aside from being applicable for
business and marketing applications, the presence or absence of such rules can help Web
designers to restructure their Web site. The association rules may also serve as a heuristic
for pre fetching documents in order to reduce user-perceived latency when loading a page
from a remote site.
3. Clustering
Classification is the task of mapping a data item into one of several predefined classes. In
the Web domain, one is interested in developing a profile of users belonging to a particular
class or category. This requires extraction and selection of features that best describe the
properties of a given class or category. Classification can be done by using supervised
inductive learning algorithms such as decision tree classifiers, naive Bayesian classifiers,
k-nearest neighbour classifiers, Support Vector Machines etc. For example, classification
on server logs may lead to the discovery of interesting rules such as : 30% of users who
placed an online order in /Product/Music are in the 18-25 age group and live on the West
Coast.
5. Sequential Patterns
The technique of sequential pattern discovery attempts to find inter-session patterns such
that the presence of a set of items is followed by another item in a time-ordered set of
sessions or episodes. By using this approach, Web marketers can predict future visit
patterns which will be helpful in placing advertisements aimed at certain user groups.
Other types of temporal analysis that can be performed on sequential patterns includes
trend analysis, change point detection, or similarity analysis.
6. Dependency Modeling
Dependency modeling is another useful pattern discovery task in Web Mining. The goal
here is to develop a model capable of representing significant dependencies among the
various variables in the Web domain. As an example, one may be interested to build a
model representing the different stages a visitor undergoes while shopping in an online
store based on the actions chosen (i.e. from a casual visitor to a serious potential buyer).
There are several probabilistic learning techniques that can be employed to model the
browsing behaviour of users. Such techniques include Hidden Markov Models and
Bayesian Belief Networks. Modeling of Web usage patterns will not only provide a
theoretical framework for analyzing the behaviour of users but is potentially useful for
predicting future Web resource consumption. Such information may help develop
strategies to increase the sales of products offered by the Web site or improve the
navigational convenience of users.
This is the final step in the Web Usage Mining process. After the preprocessing and pattern
discovery, the obtained usage patterns are analyzed to filter uninteresting information and
extract the useful information. The methods like SQL(Structured Query Language)
processing and OLAP (Online Analytical Processing) can be used.
1. Personalization
2. System Improvement
3. Site Modification
4. Business Intelligence
5. Usage Characterization
1. LETIZIA
Letizia is an application that assists a user browsing the Internet. As the user operates a
conventional Web browser such as Mozilla, the application tracks usage patterns and
attempts to predict items of interest by performing concurrent and autonomous exploration
of links from the user's current position. The application uses a best-first search augmented
by heuristics inferring user interest from browsing behaviour.
2. WEBSIFT
3. ADAPTIVE WEBSITES
We used different web server log analyzers like Web Expert Lite 6.1 and Analog6.0 to
analyze various sample web server logs obtained. The key information obtained was: Total
Hits, Visitor Hits, Average Hits per Day, Average Hits per Visitor, Failed Requests, Page
Views Total Page Views, Average Page Views per Day, Average Page Views per Visitor,
Visitors Total Visitors Average Visitors per Day, Total Unique IPs , Bandwidth, Total
Bandwidth , Visitor Bandwidth , Average Bandwidth per Day, Average Bandwidth per Hit,
Average Bandwidth per Visitor. Access Data like files, images etc., Referrers, User Agents
etc.
In this section we briefly describe the new concepts introduced by the web mining
research community.
Searching the web involves two main steps: Extracting the pages relevant to a
query and ranking them according to their quality. Ranking is important as it helps the
user look for “quality” pages that are relevant to the query. Different metrics have been
proposed to rank web pages according to their quality. We briefly discuss two of the
prominent ones.
PAGE RANK
Page Rank is a metric for ranking hypertext documents based on their quality.
Page, Brin, Motwani, and Winograd (1998) developed this metric for the popular search
engine Google (Brin and Page 1998). The key idea is that a page has a high rank if it is
pointed to by many highly ranked pages. So, the rank of a page depends upon the ranks of
the pages pointing to it. This process is done iteratively until the rank of all pages are
determined. Intuitively, the approach can be viewed as a stochastic analysis of a random
walk on the web graph. The first term in the right hand side of the equation is the
probability that a random web surfer arrives at a page p by typing the URL or from a
bookmark; or may have a particular page as his/her homepage. Here d is the probability
Hubs and authorities can be viewed as “fans’ and “centers” in a bipartite core of a
web graph, where the “fans” represent the hubs and the “centers” represent the authorities.
The hub and authority scores computed for each web page indicate the extent to which the
web page serves as a hub pointing to good authority pages or as an authority on a topic
pointed to by good hubs. The scores are computed for a set of pages related to a topic
using an iterative procedure called HITS (Kleinberg 1999). First a query is submitted to a
search engine and a set of relevant documents is retrieved. This set, called the “root set,” is
then expanded by including web pages that point to those in the “root set” and are pointed
by those in the “root set.” This new set is called the “base set.” An adjacency matrix, A is
formed such that if there exists at least one hyperlink from page i to page j, then Ai,j = 1,
otherwise Ai,j = 0. HITS algorithm is then used to compute the hub and authority scores for
these set of pages.
There have been modifications and improvements to the basic page rank and hubs
and authorities approaches such as SALSA (Lempel and Moran 2000), topic sensitive page
rank, (Haveliwala 2002) and web page reputations (Mendelzon and Rafiei 2000). These
different hyperlink based metrics have been discussed by Desikan, Srivastava, Kumar, and
Tan (2002).
Web robots are software programs that automatically traverse the hyperlink
structure of the web to locate and retrieve information. The importance of separating robot
behaviour from human behaviour prior to building user behaviour models has been
illustrated by Kohavi. First, e-commerce retailers are particularly concerned about the
unauthorized deployment of robots for gathering business intelligence at their web sites.
Second, web robots tend to consume considerable network bandwidth at the expense of
other users. Sessions due to web robots also make it difficult to perform click-stream
Information scent is a concept that uses the snippets of information present around the
links in a page as a “scent” to evaluate the quality of content of the page it points to, and
the cost of accessing such a page. The key idea is to model a user at a given page as
“foraging” for information, and following a link with a stronger “scent.” The “scent” of a
path depends on how likely it is to lead the user to relevant information, and is determined
by a network flow algorithm called spreading activation. The snippets, graphics, and other
information around a link are called “proximal cues.” The user’s desired information need
is expressed as a weighted keyword vector.
The similarity between the proximal cues and the user’s information need is
computed as “proximal scent.” With the proximal cues from all the links and the user’s
information need vector, a “proximal scent matrix” is generated. Each element in the
matrix reflects the extent of similarity between the link’s proximal cues and the user’s
information need. If enough information is not available around the link, a “distal scent” is
computed with the information about the link described by the contents of the pages it
points to. The proximal scent and the distal scent are then combined to give the scent
matrix. The probability that a user would follow a link is then decided by the scent or the
value of the element in the scent matrix.
One of the significant impacts of publishing on the web has been the close
interaction now possible between authors and their readers. In the pre web era, a reader’s
level of interest in published material had to be inferred from indirect measures such as
buying and borrowing, library checkout and renewal, opinion surveys, and in rare cases
feedback on the content. For material published on the web it is possible to track the click-
stream of a reader to observe the exact path taken through on-line published material. We
can measure times spent on each page, the specific link taken to arrive at a page and to
leave it, etc. Much more accurate inferences about readers’ interest in content can be
drawn from these observations. Mining the user click-stream for user behaviour, and using
it to adapt the “look-and-feel” of a site to a reader’s needs was first proposed by Perkowitz
and Etzioni. While the usage data of any portion of a web site can be analyzed, the most
significant, and thus “interesting,” is the one where the usage pattern differs significantly
from the link structure. This is so because the readers’ behaviour, reflected by web usage,
is very different from what the author would like it to be, reflected by the structure created
by the author. Treating knowledge extracted from structure data and usage data as
evidence from independent sources, and combining them in an evidential reasoning
framework to develop measures for interestingness.
In the panel discussion referred to earlier, preprocessing of web data to make it suitable for
mining was identified as one of the key issues for web mining. A significant amount of
work has been done in this area for web usage data, including user identification and
session creation, robot detection and filtering, and extracting usage path patterns. Cooley’s
Ph.D. dissertation provides a comprehensive overview of the work in web usage data
preprocessing. Preprocessing of web structure data, especially link information, has been
carried out for some applications, the most notable being Google style web search.
The web has had tremendous success in building communities of users and
information sources. Identifying such communities is useful for many purposes. Gibson,
Kleinberg, and Raghavan identified web communities as “a core of central authoritative
pages linked together by hub pages. Their approach was to discover emerging web
communities while crawling. A different approach to this problem was taken by Flake,
Lawrence, and Giles who applied the “maximum-flow minimum cut model” to the web
graph for identifying “web communities.” Compare HITS and the maximum flow
approaches and discuss the strengths and weakness of the two methods. Reddy and
Kitsuregawa propose a dense bipartite graph method, a relaxation to the complete bipartite
method followed by HITS approach, to find web communities. A related concept of
“friends and neighbours” was introduced by Adamic and Adar. They identified a group of
individuals with similar interests, who in the cyber-world would form a “community.”
Two people are termed “friends” if the similarity between their web pages is high.
Similarity is measured using features such as text, out-links, in-links and mailing lists.
With the web having become the fastest growing and most up to date source of
information, the research community has found it extremely useful to have online
repositories of publications. Lawrence observed that having articles online makes them
more easily accessible and hence more often cited than articles that are offline. Such
WEB MINING Page 62
online repositories not only keep the researchers updated on work carried out at different
centres but also makes the interaction and exchange of information much easier. With such
information stored in the web, it becomes easier to point to the most frequent papers that
are cited for a topic and also related papers that have been published earlier or later than a
given paper. This helps in understanding the state of the art in a particular field, helping
researchers to explore new areas. Fundamental web mining techniques are applied to
improve the search and categorization of research papers, and citing related articles.
Mining web data provides a lot of information, which can be better understood
with visualization tools. This makes concepts clearer than is possible with pure textual
representation. Hence, there is a need to develop tools that provide a graphical interface
that aids in visualizing results of web mining. Analyzing the web log data with
visualization tools has evoked a lot of interest in the research community. Chi, Pitkow,
Mackinlay, Pirolli, Gossweiler, and Card developed a web ecology and evolution
visualization (WEEV) tool to understand the relationship between web content, web
structure and web usage over a period of time. The site hierarchy is represented in a
circular form called the “Disk Tree” and the evolution of the web is viewed as a “Time
Tube.” Cadez, Heckerman, Meek, Smyth, and White present a tool called WebCANVAS
that displays clusters of users with similar navigation behaviour. Prasetyo, Pramudiono,
Takahashi, Toyoda, and Kitsuregawa developed Naviz, an interactive web log
visualization tool that is designed to display the user browsing pattern on the web site at a
global level, and then display each browsing path on the pattern displayed earlier in an
incremental manner. The support of each traversal is represented by the thickness of
theedge between the pages. Such a tool is very useful in analyzing user behaviour and
improving web sites.
Excitement about the web in the past few years has led to the web applications
being developed at a much faster rate in the industry than research in web related
technologies. Many of these are based on the use of web mining concepts, even though the
organizations that developed these applications, and invented the corresponding
technologies, did not consider it as such. We describe some of the most successful
applications in this section. Clearly, realizing that these applications use web mining is
largely a retrospective exercise. For each application category discussed below, we have
selected a prominent representative, purely for exemplary purposes. This in no way
implies that all the techniques described were developed by that organization alone. On
the contrary, in most cases the successful techniques were developed by a rapid “copy and
improve” approach to each other’s ideas.
Early on in the life of Amazon.com, its visionary CEO Jeff Bezos observed, “In a
traditional (brick-and-mortar) store, the main effort is in getting a customer to the store.
Once a customer is in the store they are likely to make a purchase—since the cost of going
to another store is high—and thus the marketing budget (focused on getting the customer
to the store) is in general much higher than the in store customer experience budget (which
keeps the customer in the store). In the case of an on-line store, getting in or out requires
exactly one click, and thus the main focus must be on customer experience in the store.”
Google is one of the most popular and widely used search engines. It provides
users access to information from over 2 billion web pages that it has indexed on its server.
The quality and quickness of the search facility makes it the most successful search
engine. Earlier search engines concentrated on web content alone to return the relevant
pages to a query. Google was the first to introduce the importance of the link structure in
mining information from the web. Page Rank, which measures the importance of a page, is
the underlying technology in all Google search products, and uses structural information
of the web graph to return high quality results.
The Google toolbar is another service provided by Google that seeks to make
search easier and informative by providing additional features such as highlighting the
query words on the returned web pages. The full version of the toolbar, if installed, also
sends the click-stream information of the user to Google. The usage statistics thus
obtained are used by Google to enhance the quality of its results. Google also provides
advanced search capabilities to search images and find pages that have been updated
within a specific date range. Built on top of Netscape’s Open Directory project, Google’s
web directory provides a fast and easy way to search within a certain topic or related
topics.
One of the biggest successes of America Online (AOL) has been its sizeable and
loyal customer base. A large portion of this customer base participates in various AOL
communities, which are collections of users with similar interests. In addition to providing
a forum for each such community to interact amongst themselves, AOL provides them
with useful information and services. Over time these communities have grown to be well-
visited waterholes for AOL users with shared interests. Applying web mining to the data
collected from community interactions provides AOL with a very good understanding of
its communities, which it has used for targeted marketing through advertisements and e-
mail solicitation. Recently, it has started the concept of “community sponsorship,”
whereby an organization, say Nike, may sponsor a community called “Young Athletic
Twenty Something.” In return, consumer survey and new product development experts of
the sponsoring organization get to participate in the community, perhaps without the
knowledge of other participants. The idea is to treat the community as a highly specialized
focus group, understand its needs and opinions on new and existing products, and also test
strategies for influencing opinions.
As individuals in a society where we have many more things than we need, the
allure of exchanging our useless stuff for some cash, no matter how small, is quite
powerful. This is evident from the success of flea markets, garage sales and estate sales.
The genius of eBay’s founders was to create an infrastructure that gave this urge a global
reach, with the convenience of doing it from one’s home PC. In addition, it popularized
auctions as a product selling and buying mechanism and provides the thrill of gambling
without the trouble of having to go to Las Vegas. All of this has made eBay as one of the
most successful businesses of the internet era. Unfortunately, the anonymity of the web
has also created a significant problem for eBay auctions, as it is impossible to distinguish
real bids from fake ones. eBay is now using web mining techniques to analyze bidding
behaviour to determine if a bid is fraudulent (Colet 2002). Recent efforts are geared
towards understanding participants’ bidding behaviours/patterns to create a more efficient
auction market.
NEC Research Index, also known as CiteSeer is one of the most popular online
bibiliographic indices related to computer science. The key contribution of the CiteSeer
repository is its “Autonomous Citation Indexing” (ACI) (Lawrence, Giles, and Bollacker
1999). Citation indexing makes it possible to extract information about related articles.
Automating such a process reduces a lot of human effort, and makes it more effective and
faster. CiteSeer works by crawling the web and downloading research related papers.
Information about citations and the related context is stored for each of these documents.
The entire text and information about the document is stored in different formats.
Information about documents that are similar at a sentence level (percentage of sentences
that match between the documents), at a text level or related due to co citation is also
given. Citation statistics for documents are computed that enable the user to look at the
most cited or popular documents in the related field. They also maintain a directory for
computer science related papers, to make search based on categories easier. These
documents are ordered by the number of citations
Although we are going through an inevitable phase of irrational despair following a phase
of irrational exuberance about the commercial potential of the web, the adoption and usage
of the web continues to grow unabated. As the web and its usage grows, it will continue to
generate ever more content, structure, and usage data, and the value of web mining will
keep increasing. Outlined here are some research directions that must be pursued to ensure
that we continue to develop web mining technologies that will enable this value to be
realized.
Mining of market basket data, collected at the point-of-sale in any store, has been
one of the visible successes of data mining. However, this data provides only the end
result of the process, and that too decisions that ended up in product purchase. Click-
stream data provides the opportunity for a detailed look at the decision making process
itself, and knowledge extracted from it can be used for optimizing, influencing the
process, etc. Underhill has conclusively proven the value of process information in
understanding users’ behaviour in traditional shops. Research needs to be carried out in (1)
extracting process models from usage data, (2) understanding how different parts of the
process model impact various web metrics of interest, and (3) how the process models
change in response to various changes that are made, i.e. changing stimuli to the user.
Society’s interaction with the web is changing the web as well as the way people
interact with each other. While storing the history all of this interaction in one place is
clearly too staggering a task, at least the changes to the web are being recorded by the
pioneering internet archive project. Research needs to be carried out in extracting temporal
models of how web content, web structures, web communities, authorities, hubs, etc.
evolve over time. Large organizations generally archive usage data from their web sites.
With these sources of data available, there is a large scope of research to develop
techniques for analyzing of how the web evolves over time.
As services over the web continue to grow, there will be a continuing need to make
them robust, scalable and efficient. Web mining can be applied to better understand the
behaviour of these services, and the knowledge extracted can be useful for various kinds
of optimizations. The successful application of web mining for predictive pre fetching of
pages by a browser has been demonstrated by Pandey, Srivastava, and Shekhar. It is
necessary to do analysis of the web logs for web services performance optimization.
Research is needed in developing web mining techniques to improve various other aspects
of web services.
The anonymity provided by the web has led to a significant increase in attempted
fraud, from unauthorized use of individual credit cards to hacking into credit card
databases for blackmail purposes. Yet another example is auction fraud, which has been
increasing on popular sites like eBay. Since all these frauds are being perpetrated through
the internet, web mining is the perfect analysis technique for detecting and preventing
them. Research issues include developing techniques to recognize known frauds,
characterize them and recognize emerging frauds. The issues in cyber threat analysis and
intrusion detection are quite similar in nature.
While there are many benefits to be gained from web mining, a clear drawback is
the potential for severe violations of privacy. Public attitude towards privacy seems to be
almost schizophrenic, i.e. people say one thing and do quite the opposite. For example,
famous cases like those involving Amazon and Doubleclick seem to indicate that people
value their privacy, while experience at major e-commerce portals shows that over 97% of
all people accept cookies with no problems, and most of them actually like the
personalization features that are provided based on it. Spiekerman, Grossklags, and
Berendt have demonstrated that people were willing to provide fairly personal information
about themselves, which was completely irrelevant to the task at hand, if provided the
right stimulus to do so. Furthermore, explicitly bringing attention to information privacy
policies had practically no effect. One explanation of this seemingly contradictory attitude
towards privacy may be that we have a bi-modal view of privacy, namely that “I’d be
willing to share information about myself as long as I get some (tangible or intangible)
benefits from it, and as long as there is an implicit guarantee that the information will not
be abused.” The research issue generated by this attitude is the need to develop
approaches, methodologies and tools that can be used to verify and validate that a web
service is indeed using user’s information in a manner consistent with its stated policies.