Web Mining Report

VedaVyasa Institute of Technology WEB MINING
1. INTRODUCTION
With the explosive growth of information sources available on the World Wide Web,
it has become increasingly necessary for users to utilize auto-mated tools in order to
find, extract, filter, and evaluate the desired information and resources. In addition,
with the transformation of the web into the primary tool for electronic commerce, it is
imperative for organizations and companies, who have invested millions in Internet
and Intranet technologies, to track and analyze user access patterns. These factors give
rise to the necessity of creating server-side and client-side intelligent systems that can
effectively mine for knowledge both across the Internet and in particular web
localities.
Figure(i). web size
224,749,695 (Mar 2009) Netcraft survey – Total no of sites across all domains
At present most of the users commonly use searching engines such as

www.google.com,www.yahoo.com to find their required information. Moreover, the
target of the Web search engine is only to discover resource on the Web. Each
searching engines having its own characteristics and employing different algorithms
to index, rank, and present web documents. But because all these searching engines is
Dept .of Computer Science & Engg. 1 2008 Admission

build based on exact key words matching and it's query language belongs to some
artificial kind, with restricted syntax and vocabulary other than natural language, there
are defects that all kind of searching engines cannot overcome.
Narrowly Searching Scope: Web pages indexed by any searching engines are only a
tiny part of the whole pages on the www, and the return pages when user input and
submit query are another tiny part of indexed numbers of the searching engine.
Low Precision: User cannot browse all the pages one by one, and most pages are
irrelevant to the user's meaning, they are highlighted and returned by searching engine
just because these pages in possession of the key words.
Web mining techniques could be used to solve the information over load problems
directly or indirectly. However, Web mining techniques are not the only tools. Other
techniques and works from different research areas, such as DataBase (DB),
Information Retrieval (IR), Natural Language Processing (NLP), and the Web
document community, could also be used.
INFORMATION RETRIEVAL
Information retrieval is the art and science of searching for information in documents,
searching for documents themselves, searching for metadata which describes
documents, or searching within databases, whether relational standalone databases or
hypertext networked databases such as the Internet or intranets, for text, sound,
images or data.
NATURAL LANGUAGE PROCESSING
Natural language processing (NLP) is concerned with the interactions between
computers and human (natural) languages. NLP is a form of human-to-computer
interaction where the elements of human language, be it spoken or written, are
formalized so that a computer can perform value-adding tasks based on that
interaction.
Natural language understanding is sometimes referred to as an AI-complete problem,
because natural-language recognition seems to require extensive knowledge about the
outside world and the ability to manipulate it

2. OVERVIEW OF WEB MINING METHODOLOGIES
Web mining is the application of machine learning (data mining) techniques to web-
based data for the purpose of learning or extracting knowledge. Web mining
encompasses a wide variety techniques, including soft computing . Web mining
methodologies can generally be classified into one of three distinct categories: web
usage mining, web structure mining, and web content mining . In web usage mining
the goal is to examine web page usage patterns in order to learn about a web system's
users or the relationships between the documents. For example, the tool presented by
Masseglia et al. [ M P C ~c~re]a tes association rules from web access logs, which
store the identity of pages accessed by users along withother information such as
when the pages were accessed and by whom; these logs are the focus of the data
mining effort, rather than the actual web pages themselves. Rules created by their
method could include, for example, "70% of the users that visited page A also visited
page B." Similarly, the method of Nasraoui et al. [ N F J K ~a~ls]o examines web
access logs. The method employed in that paper is to perform a hierarchical clustering
in order to determine the usage patterns of different groups of users. Beeferman and
Berger [BBOO] described a process they developed which determines topics related
to a user query using click-through logs and agglomerative clustering of bipartite
graphs. The transaction-based method developed in Ref. [Mer99] creates links
between pages that are frequently accessed together during the same session. Web
usage mining is useful for providing personalized web services, an area of web
mining research that has lately become active. It promises to help tailor web services,
such as web search engines, to the preferences of each individual user. For a recent
review of web personalization methods
In the second category of web mining methodologies, web structure mining, we
examine only the relationships between web documents by utilizing the information
conveyed by each document's hyperlinks. Like the web usage mining methods
described above, the other content of the web pages is often ignored. In Ref. [ K R R
T ~K~u]m ar et al. examined utilizing a graph representation of web page structure.
Here nodes in the graphs are web pages and edges indicate hyperlinks between pages.
By examining these "web graphs" it is possible to find documents or areas of interest

through the use of certain graph-theoretical measures or procedures. Structures such

as web rings, portals, or affiliated sites can be identified by matching the
characteristics of these structures (e.g. we can identify portal pages because they have
an unusually high out-degree). Graph models are also used in other web structure
mining approaches. For example, in Ref. [CvdBD99] the authors' method examines
linked URLs and performs classification using a Bayesian method. The graph is
processed to determine groups of pages that relate to a common topic
In web content mining we examine the actual content of web pages (most often the
text contained in the pages) and then perform some knowledge discovery procedure to
learn about the pages themselves and their relationships. Most typically this is done to
organize a group of documents into related categories. This is especially beneficial for
web search engines, since it allows users to more quickly find the information they
are looking for in comparison to the usual "endless" ranked list. There are several
examples of web or text mining approaches [AHKv~~] that are content-oriented and
attempt to cluster documents for browsing. The Athena system of Agrawal et al.
[AJSOO]c reates groupings of e-mail messages. The goal is to create folders (classes)
for different topics and route e-mail messages automatically to the appropriate folders.
Athena uses a clustering algorithm called C-Evolve to create topics (folders), while
the classification of documents to each cluster is supervised and requires manual
interaction with the user. The classification method is based on ~ a i v e Bayes. Some
notable papers that deal with clustering for web search include Ref. [~GG+99a ]
which describes 2 partitioned methods, and Ref. [C~ 9 7 ] , which is a hierarchical
clustering approach. Nahm and Mooney [NMoo] described a methodology where
information extraction and data mining can be combined to improve upon each other;
information extraction provides the data mining process with access to textual
documents (text mining) and in turn data mining provides learned rules to the
information extraction portion to improve its performance. An important paper that is
strongly related to the current work is that of Strehl et al. [SGMOO], which examined
clustering performance of different clustering algorithms when using various
similarity measures on web document collections. Clustering methods examined in
the paper included k-means, graph partitioning, and self-organizing maps (SOMs).
Vector-based representations were used in the experiments along with distance
measures based on Euclidean distance, cosine similarity, and Jaccard similarity. One

of the data sets used in this paper is publicly available and we will use it in our
experiments.
3. TRADITIONAL INFORMATION RETRIEVAL

TECHNIQUES
Traditional information retrieval methods represent plain-text documents using a

series of numeric values for each document. Each value is associated with a specific
term (word) that may appear on a document, and the set of possible terms is shared
across all documents. The values may be binary, indicating the presence or absence of
the corresponding term. The values may also be a non-negative integers, which
represents the number of times a term appears on a document (ie. term frequency).
Non-negative real numbers can also be used, in this case indicating the importance or
weight of each term. These values are derived through a method such as the popular
inverse document frequency model , which reduces the importance of terms that
appear on many documents. Regardless of the method used, each series of values
represents a document and corresponds to a point (ie. vector) in a Euclidean feature
space; this is called the vector-space model of information retrieval. This model is
often used when applying machine learning techniques to documents, as there is a
strong mathematical foundation for performing distance measure and centroid
calculations using vectors.
Vector- based distance measures
Here we briefly review some of the most popular vector-related distance measures
which will also be used in the experiments we perform. First, we have the well-known
Euclidean distance

Where xi and yi are the ith components of vectors x = [xl, xz, . . . , x,] and y = [yl,
yz,: . . , yn], respectively. Euclidean distance measures the direct distance between
two points in the space iRn. For applications in text and document clustering and
classification, the cosine similarity measure [Sal89] is often used. We can convert this
to a distance measure by the following
Here . indicates the dot product operation and ||…|| indicates the magnitude (length)
of a vector. If we take each document to be a point in SRn formed by the values
assigned to it under the vector model, each value corresponding to a dimension, then
the direction of the vector formed from the origin to the point representing the
document indicates the content of the document. Under the Euclidean distance,
documents with large differences in size have a large distance between them,
regardless of whether or not the content is similar, because the length of the vectors
differs. The cosine distance is length invariant, meaning only the direction of the
vectors is compared; the magnitude of the vectors is ignored.
Another popular distance measure for determining document similarity is the
extended Jaccard similarity [Sa189], which is converted to a distance measure as
follows:
Jaccard distance has properties of both the Euclidean and cosine distance measures.
At high distance values, Jaccard behaves more like the cosine measure; at lower
values, it behaves more like Euclidean.

4. WEB MINING
With the recent explosive growth of the amount of content on the Internet, it has
become increasingly difficult for users to find and utilize information and for content
providers to classify and catalog documents. Traditional web search engines often
return hundreds or thousands of results for a search, which is time consuming for
users to browse. On-line libraries, search engines, and other large document
repositories (e.g. customer support databases, product specification databases, press
release archives, news story archives, etc.) are growing so rapidly that it is difficult
and costly to categorize every document manually. In order to deal with these
problems, researchers look toward automated methods of working with web
documents so that they can be more easily browsed, organized, and cataloged with
minimal human intervention.
In contrast to the highly structured tabular data upon which most machine learning
methods are expected to operate, web and text documents are semi-structured. Web
documents have well-defined structures such as letters, words, sentences, paragraphs,
sections, punctuation marks, HTML tags, and so forth. We know that words make up
sentences, sentences make up paragraphs, and so on, but many of the rules governing
the order in which the various elements are allowed to appear are vague or ill-defined
and can vary dramatically between documents. It is estimated that as much as 85% of
all digital business information, most of it web-related, is stored in non-structured
formats ( i e . non-tabular formats, such as those that are used in databases and
spreadsheets) [pet]. Developing improved methods of performing machine learning
techniques on this vast amount of non-tabular, semi-structured web data is therefore
highly desirable.
Clustering and classification have been useful and active areas of machine learning
research that promise to help us cope with the problem of information overload on the
Internet. With clustering the goal is to separate a given group of data items (the data
set) into groups called clusters such that items in the same cluster are similar to each
other and dissimilar to the items in other clusters. In clustering methods no labeled

examples are provided in advance for training (this is called unsupervised learning).
Under classification we attempt to assign a data item to a predefined category based
on a model that is created from pre-classified training data (supervised learning). In
more general terms, both clustering and classification come under the area of
knowledge discovery in databases or data mining. Applying data mining techniques to
web page content is referred to as web content mining which is a new sub-area of web
mining, partially built upon the established field of information retrieval.
When representing text and web document content for clustering and classification, a
vector-space model is typically used. In this model, each possible term that can appear
in a document becomes a feature dimension [Sa189]. The value assigned to each
dimension of a document may indicate the number of times the corresponding term
appears on it or it may be a weight that takes into account other frequency
information, such as the number of documents upon which the terms appear. This
model is simple and allows the use of traditional machine learning methods that deal
with numerical feature vectors in a Euclidean feature space. However, it discards
information such as the order in which the terms appear, where in the document the
terms appear, how close the terms are to each other, and so forth. By keeping this kind
of structural information we could possibly improve the performance of various
machine learning algorithms. The problem is that traditional data mining methods are
often restricted to working on purely numeric feature vectors due to the need to
compute distances between data items or to calculate some representative of a cluster
of items (i.e. a centroid or center of a cluster), both of which are easily accomplished
in a Euclidean space. Thus either the original data needs to be converted to a vector of
numeric values by discarding possibly useful structural information (which is what we
are doing when using the vector model to represent documents) or we need to develop
new, customized methodologies for the specific representation.
Graphs are important and effective mathematical constructs for modeling
relationships and structural information. Graphs (and their more restrictive form,
trees) are used in many different problems, including sorting, compression, traffic
flow analysis, resource allocation, etc. [CLR97]. In addition to problems where the
graph itself is processed by some algorithm (e.g. sorting by the depth first method or
finding the minimum spanning tree) it would be extremely desirable in many
applications, including those related to machine learning, to model data as graphs

since these graphs can retain more information than sets or vectors of simple atomic
features. Thus much research has been performed in the area of graph similarity in
order to exploit the additional information allowed by graph representations by
introducing mathematical frameworks for dealing with graphs. Some application
domains where graph similarity techniques have been applied include face
[WFKvdM97] and fingerprint [WJHO~re]c ognition as well as anomaly detection in
communication networks [DBD+01]. In the literature, the work comes under several
different topic names including graph distance, (exact) graph matching, inexact graph
matching, error-tolerant graph matching, or error-correcting graph matching. In exact
graph matching we are attempting to determine if two graphs are identical. Inexact
graph matching implies we are attempting not to find a perfect matching, but rather a
"best" or "closest" matching. Error-tolerant and error-correcting are special cases of
inexact matching where the imperfections (e.g. missing nodes) in one of the graphs,
called the data graph, are assumed to be the result of some errors (e.g. from
transmission or recognition). We attempt to match the data graph to the most similar
model graph in our database. Graph distance is a numeric measure of dissimilarity
between graphs, with larger distances implying more dissimilarity. By graph
similarity, we mean we are interested in some measurement that tells us how similar
graphs are regardless if there is an exact matching between them.
The web as we all know is the SINGLE largest source of data available. Web mining
aims to extract and mine useful knowledge from the web. It is used to understand the
customer behavior, evaluate the effectiveness of a website and also to help quantify
the success of a marketing campaign. Web mining is the integration of information
gathered by traditional data mining methodologies and techniques with information
gathered over the World Wide Web.

Figure(ii). Searching the Web
4.1. WEB MINING Vs DATA MINING
Data is structured and has well defined tables, columns, rows, keys and constraints.
Web Mining has Dynamic and rich in features and patterns. It involves analysis of
web server logs of a website whereas data mining involves using techniques to find
relationships in large amounts of data. It often need to react to evolving usage patterns
in real time eg. Merchandizing.
Data mining is also called knowledge discovery and data mining (KDD). It is the
extraction of useful patterns from data sources, e.g. databases, texts, web, images, etc.
Patterns must be valid, novel, potentially useful, understandable. Classic data mining
tasks
 Classification: mining patterns that can classify future (new) data into known
classes.
 Association rule mining: mining any rule of the form X ® Y, where X and Y
are sets of data items. E.g.,Cheese, Milk® Bread [sup =5%, confid=80%]
 Clustering: identifying a set of similarity groups in the data
 Sequential pattern mining: A sequential rule: A® B, says that event A will be
immediately followed by event B with a certain confidence

Figure (iii). The Data Mining (KDD) Process
Just as data mining aims at discovering valuable information that is hidden in

conventional databases, the emerging field of web mining aims at finding and
extracting relevant information that is hidden in Web-related data, in particular hyper-
text documents published on the Web. Web Mining is the extraction of interesting and
potentially useful patterns and implicit information from artifacts or activity related to
the World Wide Web. There are roughly three knowledge discovery domains that
pertain to web mining: Web Content Mining, Web Structure Mining, and Web Usage
Mining. Web content mining is the process of extracting knowledge from the content
of documents or their descriptions. Web document text mining, resource discovery
based on concepts indexing or agent based technology may also fall in this category.
Web structure mining is the process of inferring knowledge from the World Wide
Web organization and links between references and referents in the Web. Finally, web
usage mining, also known as Web Log Mining, is the process of extracting interesting
patterns in web access logs.
Web is a collection of inter-related files on one or more Web servers. Web mining is a
multi-disciplinary effort that draws techniques from fields like information retrieval,
statistics, machine learning, natural language processing, and others. Web mining has
new character compared with the traditional data mining. First, the objects of Web

mining are a large number of Web documents which are heterogeneously distributed
and each data source are heterogeneous; second, the Web document itself is semi-
structured or unstructured and lack the semantics the machine can understand.
4.2. HISTORY
The term “Web Mining” first used in [E1996], defined in a ‘task oriented’ manner.
Alternate ‘data oriented’ definition given in [CMS1997]. Its First panel discussion at
ICTAI 1997 [SM1997]. It is a continuing forum.
 WebKDD workshops with ACM SIGKDD, 1999, 2000, 2001, 2002, … ; 60 –
90 attendees
 SIAM Web analytics workshop 2001, 2002, …
 Special issues of DMKD journal, SIGKDD Explorations
 Papers in various data mining conferences & journals
 Surveys [MBNL 1999, BL 1999, KB2000]
4.3. WEB MINING SUBTASKS
Web mining can be decomposed into the subtasks, namely:

1. Resource finding: the task of retrieving intended Web documents. By
resource finding we mean the process of retrieving the data that is either
online or offline from the text sources available on the web such as electronic
newsletters, electronic newswire, the text contents of HTML documents
obtained by removing HTML tags, and also the manual selection of Web
resources.
2. Information selection and pre-processing: automatically selecting and pre-
processing specific information from retrieved Web resources. It is a kind of
transformation processes of the original data retrieved in the IR process. These
transformations could be either a kind of pre-processing that are mentioned
above such as stop words, stemming, etc. or a pre-processing aimed at
obtaining the desired representation such as finding phrases in the training
corpus, transforming the representation to relational or first order logic form,
etc.

3. Generalization: automatically discovers general patterns at individual Web

sites as well as across multiple sites. Machine learning or data mining
techniques are typically used in the process of generalization. Humans play an
important role in the information or knowledge discovery process on the Web
since the Web is an interactive medium.
4. Analysis: validating and/or interpretation of the mined patterns.

5. CHALLENGES OF WEB MINING
1. Today World Wide Web is flooded with billions of static and dynamic web
pages created with programming languages such as HTML, PHP and ASP. It
is significant challenge to search useful and relevant information on the web.
2. Creating knowledge from available information.
3. As the coverage of information is very wide and diverse, personalization of
the information is a tedious process.
4. Learning customer and individual user patterns.
5. Complexity of Web pages far exceeds the complexity of any conventional text
document. Web pages on the internet lack uniformity and standardization.
6. Much of the information present on web is redundant, as the same piece of
information or its variant appears in many pages.
7. The web is noisy i.e. a page typically contains a mixture of many kinds of
information like, main content, advertisements, copyright notice, navigation
panels.
8. The web is dynamic, information keeps on changing constantly. Keeping up
with the changes and monitoring them are very important.
9. The Web is not only disseminating information but it also about services.
Many Web sites and pages enable people to perform operations with input
parameters, i.e., they provide services.
10. The most important challenge faced is Invasion of Privacy. Privacy is
considered lost when information concerning an individual is obtained, used,
or disseminated, when it occurs without their knowledge or consent.

6. TAXONOMY OF WEB MINING
Figure(iv). Taxonomy of Web mining
In general, Web mining tasks can be classified into three categories:

1. Web content mining,
2. Web structure mining and
3. Web usage mining.
However, there are two other different approaches to categorize Web mining. In both,
the categories are reduced from three to two: Web content mining and Web usage
mining. In one, Web structure is treated as part of Web Content while in the other
Web usage is treated as part of Web Structure. All of the three categories focus on the
process of knowledge discovery of implicit, previously unknown and potentially
useful information from the Web. Each of them focuses on different mining objects of
the Web.

Figure(v). Detailed classification of Taxonomy of Web mining
6.1. WEB CONTENT MINING
Web Content mining is the discovery of useful information from web contents / data /
documents. Web data contents: text, image, audio, video, metadata and hyperlinks.
Web content mining is an automatic process that goes beyond keyword extraction.
Since the content of a text document presents no machine readable semantic, some
approaches have suggested to restructure the document content in a representation
that could be exploited by machines. The usual approach to exploit known structure in
documents is to use wrappers to map documents to some data model. Techniques
using lexicons for content interpretation are yet to come. There are two groups of web
content mining strategies: Those that directly mine the content of documents and
those that improve on the content search of other tools like search engines.
Web Content Mining deals with discovering useful information or knowledge from
web page contents. Web content mining analyzes the content of Web resources.
Content data is the collection of facts that are contained in a web page. It consists of
unstructured data such as free texts, images, audio, video, semi-structured data such

as HTML documents, and a more structured data such as data in tables or database
generated HTML pages. The primary Web resources that are mined in Web content
mining are individual pages. They can be used to group, categorize, analyze, and
retrieve documents. There are issues in Web Content Mining : Developing intelligent
tools for IR Finding keywords for key phrases, Discovering grammatical rules and
collocations , Hypertext classification/categorization, Extracting key phrases from text
documents, Learning extraction models/rules, Hierarchical clustering, Predicting
words relationship, Developing Web query systems WebOQL, XML-QL, Mining
multimedia data Mining image from satellite(Fayyad, et al. 1996) , Mining image to
identify small volcanoes on Venus (Smyth, et al 1996)
Web content mining could be differentiated from two points of view:

6.1.1. Agent-Based Approach
This approach aims to assist or to improve the information finding and filtering the
information to the users. This could be placed into the following three categories:
a. Intelligent Search Agents: These agents search for relevant information using
domain characteristics and user profiles to organize and interpret the
discovered information.
b. Information Filtering/ Categorization: These agents use information retrieval
techniques and characteristics of open hypertext Web documents to
automatically retrieve, filter, and categorize them.
c. Personalized Web Agents: These agents learn user preferences and discover
Web information based on these preferences, and preferences of other users
with similar interest.
6.1.2. Database Approach

Database approach aims on modeling the data on the Web into more structured form
in order to apply standard database querying mechanism and data mining applications
to analyze it. The two main categories are
a. Multilevel databases: The main idea behind this approach is that the lowest
level of the database contains semi-structured information stored in various
Web sources, such as hypertext documents. At the higher level(s) meta data or

generalizations are extracted from lower levels and organized in structured

collections, i.e. relational or object-oriented databases.
b. Web query systems: Many Web-based query systems and languages utilize
standard database query languages such as SQL, structural information about
Web documents, and even natural language processing for the queries that are
used in World Wide Web searches.
6.2. WEB STRUCTURE MINING
World Wide Web can reveal more information than just the information contained in
documents. For example, links pointing to a document indicate the popularity of the
document, while links coming out of a document indicate the richness or perhaps the
variety of topics covered in the document. This can be compared to bibliographical
citations. When a paper is cited often, it ought to be important. The PageRank and
CLEVER methods take advantage of this information conveyed by the links to find
pertinent web pages. By means of counters, higher levels cumulate the number of
artifacts subsumed by the concepts they hold. Counters of hyperlinks, in and out
documents, retrace the structure of the web artifacts summarized.
Web structure mining is the process of discovering structure information from the
web. The structure of a typical web graph consists of web pages as nodes, and
hyperlinks as edges connecting related pages. This can be further divided into two
kinds based on the kind of structure information used.
Figure(vi). Web graph structure

Hyperlinks
A hyperlink is a structural unit that connects a location in a web page to a different
location, either within the same web page or on a different web page. A hyperlink that
connects to a different part of the same page is called an Intra-document hyperlink,
and a hyperlink that connects two different pages is called an inter-document
hyperlink.
Document Structure
In addition, the content within a Web page can also be organized in a tree structured
format, based on the various HTML and XML tags within the page. Mining efforts
here have focused on automatically extracting document object model (DOM)
structures out of documents.
Web structure mining focuses on the hyperlink structure within the Web itself. The
different objects are linked in some way. Simply applying the traditional processes
and assuming that the events are independent can lead to wrong conclusions.
However, the appropriate handling of the links could lead to potential correlations,
and then improve the predictive accuracy of the learned models.
Two algorithms that have been proposed to lead with those potential correlations are:
1. HITS and
2. Page Rank.
6.2.1. Page Rank

Page Rank is a metric for ranking hypertext documents that determines the quality of
these documents. The key idea is that a page has high rank if it is pointed to by many
highly ranked pages. So the rank of a page depends upon the ranks of the pages
pointing to it. This process is done iteratively till the rank of all the pages is
determined.
The rank of a page p can thus be written as:

Here, n is the number of nodes in the graph, OutDegree(q) is the number of

hyperlinks on page q and d damping factor is the probability at each page the random
surfer will get bored and request another random page.
6.2.2. HITS
Hyperlink-induced topic search (HITS) is an iterative algorithm for mining the Web
graph to identify topic hubs and authorities. Authorities are the pages with good
sources of content that are referred by many other pages or highly ranked pages for a
given topic; hubs are pages with good sources of links. The algorithm takes as input,
search results returned by traditional text indexing techniques, and filters these results
to identify hubs and authorities. The number and weight of hubs pointing to a page
determine the page's authority. The algorithm assigns weight to a hub based on the
authoritativeness of the pages it points to. If many good hubs point to a page p, then
authority of that page p increases. Similarly if a page p points to many good
authorities, then hub of page p increases.
After the computation, HITS outputs the pages with the largest hub weight and the
pages with the largest authority weights, which is the search result of a given topic.
6.3. WEB USAGE MINING
Web usage mining is a process of extracting useful information from server logs i.e.
users history. Web usage mining is the process of finding out what users are looking
for on the Internet.
Web usage mining focuses on techniques that could predict the behavior of users
while they are interacting with the WWW. It collects the data from Web log records
to discover user access patterns of Web pages. Usage data captures the identity or
origin of web users along with their browsing behavior at a web site.
Web servers record and accumulate data about user interactions whenever requests for
resources are received. Analyzing the web access logs of different web sites can help
understand the user behavior and the web structure, thereby improving the design of
this colossal collection of resources. There are two main tendencies in Web Usage
Mining driven by the applications of the discoveries: General Access Pattern Tracking

and Customized Usage Tracking. The general access pattern tracking analyzes the
web logs to understand access patterns and trends. These analyses can shed light on
better structure and grouping of resource providers. Many web analysis tools existed
but they are limited and usually unsatisfactory. We have designed a web log data
mining tool, Web Log Miner, and proposed techniques for using data mining and
OnLine Analytical Processing (OLAP) on treated and transformed web access files.
Applying data mining techniques on access logs unveils interesting access patterns
that can be used to restructure sites in a more efficient grouping, pinpoint effective
advertising locations, and target specific users for specific selling ads.
Customized usage tracking analyzes individual trends. Its purpose is to customize
web sites to users. The information displayed, the depth of the site structure and the
format of the resources can all be dynamically customized for each user over time
based on their access patterns.
While it is encouraging and exciting to see the various potential applications of web
log file analysis, it is important to know that the success of such applications depends
on what and how much valid and reliable knowledge one can discover from the large
raw log data. Current web servers store limited information about the accesses. Some
scripts custom-tailored for some sites may store additional information. However, for
an effective web usage mining, an important cleaning and data transformation step
before analysis may be needed.
In the using and mining of Web data, the most direct source of data are Web log files
on the Web server. Web log files records of the visitor's browsing behavior very
clearly. Web log files include the server log, agent log and client log (IP address,
URL, page reference, access time, cookies etc.).
Web servers have the ability to log all requests. Web server log format is the mostly
used Common Log Format(CLF). New, Extended Log Format allows configuration of
log file. Design of a Web Log Miner : Web log is filtered to generate a relational
database. A data cube is generated from the database. OLAP is used to drill-down and
roll-up in the cube. OLAM is used for mining interesting knowledge

Knowledge
Figure(vii). Web usage mining process
Figure (viii). Web Logs
There are several available research projects and commercial products that analyze
those patterns for different purposes. The applications generated from this analysis
can be classified as personalization, system improvement, site modification, business
intelligence and usage characterization.

Figure (ix). A general Architecture for Web usage mining
The Web Usage Mining can be decomposed into the following three main sub tasks:
Figure (x). Diagram of Web usage mining process

6.3.1. Pre-processing
It is necessary to perform a data preparation to convert the raw data for further
process. The actual data collected generally have the features that incomplete,
redundancy and ambiguity. In order to mine the knowledge more effectively, pre-
processing the data collected is essential. Preprocessing can provide accurate, concise
data for data mining. Data preprocessing, includes data cleaning, user identification,
user sessions identification, access path supplement and transaction identification.
 The main task of data cleaning is to remove the Web log redundant data which
is not associated with the useful data, narrowing the scope of data objects.
 Determining the single user must be done after data cleaning. The purpose of
user identification is to identify the users uniqueness. It can be complete by
means of cookie technology, user registration techniques and investigative
rules.
 User session identification should be done on the basis of the user
identification. The purpose is to divide each user's access information into
several separate session processes. The simplest way is to use time-out
estimation approach, that is, when the time interval between the page requests
exceeds the given value, namely, that the user has started a new session.
 Because the widespread use of the page caching technology and the proxy
servers, the access path recorded by the Web server access logs may not be the
complete access path of users. Incomplete access log does not accurately
reflect the user's access patterns, so it is necessary to add access path. Path
supplement can be achieved using the Web site topology to make the page
analysis.
 The transaction identification is based on the user's session recognition, and its
purpose is to divide or combine transactions according to the demand of data
mining tasks in order to make it appropriate for demand of data mining
analysis.

6.3.2. Pattern discovery

Pattern discovery mines effective, novel, potentially useful and ultimately
understandable information and knowledge using mining algorithm. Its methods
include statistical analysis, classification analysis, association rule discovery,
sequential pattern discovery, clustering analysis, and dependency modeling.
 Statistical Analysis: Statistical analysts may perform different kinds of
descriptive statistical analyses (frequency, mean, median, etc.) based on
different variables such as page views, viewing time and length of a
navigational path when analyzing the session _le. By analyzing the statistical
information contained in the periodic web system report, the extracted report
can be potentially useful for improving the system performance, enhancing the
security of the system, facilitation the site modification task, and providing
support for marketing decisions.
 Association Rules: In the web domain, the pages, which are most often
referenced together, can be put in one single server session by applying the
association rule generation. Association rule mining techniques can be used to
discover unordered correlation between items found in a database of
transactions.
 Clustering analysis: Clustering analysis is a technique to group together users
or data items (pages) with the similar characteristics. Clustering of user
information or pages can facilitate the development and execution of future
marketing strategies.
 Classification analysis: Classification is the technique to map a data item into
one of several predefined classes. The classification can be done by using
supervised inductive learning algorithms such as decision tree classifiers, nave
Bayesian classifiers, k-nearest neighbor classifier, Support Vector Machines
etc.
 Sequential Pattern: This technique intends to find the inter-session pattern,
such that a set of the items follows the presence of another in a time-ordered
set of sessions or episodes. Sequential patterns also include some other types
of temporal analysis such as trend analysis, change point detection, or
similarity analysis.

 Dependency Modeling: The goal of this technique is to establish a model

that is able to represent significant dependencies among the various variables
in the web domain. The modeling technique provides a theoretical framework
for analyzing the behavior of users, and is potentially useful for predicting
future web resource consumption.
6.3.3. Pattern Analysis

Pattern Analysis is a final stage of the whole web usage mining. The goal of this
process is to eliminate the irrelevant rules or patterns and to understand, visualize and
to extract the interesting rules or patterns from the output of the pattern discovery
process. The output of web mining algorithms is often not in the form suitable for
direct human consumption, and thus need to be transform to a format can be
assimilate easily. There are two most common approaches for the patter analysis. One
is to use the knowledge query mechanism such as SQL, while another is to construct
multi-dimensional data cube before perform OLAP operations.

Figure (xi). Web Usage Mining
Figure (xii).Potential Data Sources

7. WEB CRAWLER
A Web crawler is a computer program that browses the World Wide Web in a
methodical, automated manner. Other terms for Web crawlers are ants, automatic
indexers, bots, and worms or Web spider, Web robot. Search engines, use spidering as
a means of providing up-to-date data . Crawlers can also be used for automating
maintenance tasks on a Web site, such as checking links or validating HTML code.
Crawlers can be used to gather specific types of information from Web pages, such as
harvesting e-mail addresses (usually for spam), eg. anita at cs dot sunysb dot edu ;
mueller{remove this}@cs.sunysb.edu. A Web crawler is one type of bot, or software
agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler
visits these URLs, it identifies all the hyperlinks in the page and adds them to the list
of URLs to visit, called the crawl frontier

Figure (xiii). Web crawler
8. APPLICATIONS OF WEB MINING

Web mining techniques can be applied to understand and analyze such data, and
turned into actionable information, that can support a web enabled electronic business
to improve its marketing, sales and customer support operations. Based on the
patterns found and the original cache and log data, many applications can be
developed. Some of them are:
8.1. Personalized Services
The so-called personalized service, that is, when the user browses Web sites, as far as
possible to meet each user's browsing interest and constantly adjusts to adapt to the
users browsing interests change, so that make each user feel he/she is a unique user of
this Web site.
In order to achieve personalized service, it first has to obtain and collect information
on clients to grasp customer's spending habits, hobbies, consumer psychology, etc.,
and then can be targeted to provide personalized service. To obtain consumer
spending behavior patterns, the traditional marketing approach is very difficult, but it
can be done using Web mining techniques.

Figure(xiv). Personalization of Web pages

Early on in the life of Amazon.com, its visionary CEO Jeff Bezos observed, In a
traditional (brick-and mortar) store, the main effort is in getting a customer to the
store. Once a customer is in the store they are likely to make a purchase since the cost
of going to another store is high and thus the marketing budget (focused on getting the
customer to the store) is in general much higher than the in-store customer experience
budget (which keeps the customer in the store). In the case of an on-line store, getting
in or out requires exactly one click, and thus the main focus must be on customer
experience in the store. This fundamental observation has been the driving force
behind Amazons comprehensive approach to personalized customer experience, based
on the mantra a personalized store for every customer. A host of Web mining
techniques, e.g. associations between pages visited, click-path analysis, etc., are used
to improve the customers experience during a store visit. Knowledge gained from
Web mining is the key intelligence behind Amazons features such as instant
recommendations, purchase circles, wish-lists, etc.
8.2. Improve the website design

Attractiveness of the site depends on its reasonable design of content and
organizational structure. Web mining can provide details of user behavior, providing
web site designers basis of decision making to improve the design of the site.
8.3. System Improvement

Performance and other service quality attributes are crucial to user satisfaction from
services such as databases, net-works, etc. Similar qualities are expected from the
users of Web services. Web usage mining provides the key to under-standing Web
traffic behavior, which can in turn be used for developing policies for Web caching,
network transmission, load balancing, or data distribution. Security is an acutely
growing concern for Web-based services, especially as electronic commerce
continues to grow at an exponential rate. Web usage mining can also provide patterns
which are useful for detecting intrusion, fraud, attempted break-ins, etc.
8.4. Predicting trends

Web mining can predict trend within the retrieved information to indicate future
values. For example, an electronic auction company provides information about items
to auction, previous auction details, etc. Predictive modeling can be utilized to
analyze the existing information, and to estimate the values for auctioneer items or
number of people participating in future auctions.
The predicting capability of the mining application can also benefit society by
identifying criminal activities.
8.5. To carry out intelligent business

A visit cycle of customer network marketing activities can be divided into four steps:
Being attracted, presence, purchase and left. Web mining technology can dig out the
customers' motivation by analyzing the customer click-stream information in order to
help sales make reasonable strategies, custom personalized pages for customers, carry
out targeted information feedback and advertising. In short, in e-commerce network
marketing, Using Web mining techniques to analyze large amounts of data can dig out
the laws of the consumption of goods and the customer’s access patterns, help
businesses develop effective marketing strategies, enhance enterprise
competitiveness.
The companies can establish better customer relationship by giving them exactly what
they need. Companies can understand the needs of the customer better and they can
react to customer needs faster. The companies can find, attract and retain customers;
they can save on production costs by utilizing the acquired insight of customer
requirements. They can increase profitability by target pricing based on the profiles
created. They can even find the customer who might default to a competitor the
company will try to retain the customer by providing promotional offers to the
specific customer, thus reducing the risk of losing a customer.

9. WEB MINING PROS & CONS
9.1. PROS
Web mining essentially has many advantages which makes this technology attractive
to corporations including the government agencies. This technology has enabled
ecommerce to do personalized marketing, which eventually results in higher trade
volumes. The government agencies are using this technology to classify threats and
fight against terrorism. The predicting capability of the mining application can benefit
the society by identifying criminal activities. The companies can establish better
customer relationship by giving them exactly what they need. Companies can
understand the needs of the customer better and they can react to customer needs
faster. The companies can find, attract and retain customers; they can save on
production costs by utilizing the acquired insight of customer requirements. They can
increase profitability by target pricing based on the profiles created. They can even
find the customer who might default to a competitor the company will try to retain the
customer by providing promotional offers to the specific customer, thus reducing the
risk of losing a customer.
9.2. CONS
Web mining, itself, doesn’t create issues, but this technology when used on data of
personal nature might cause concerns. The most criticized ethical issue involving web
mining is the invasion of privacy. Privacy is considered lost when information
concerning an individual is obtained, used, or disseminated, especially if this occurs
without their knowledge or consent. The obtained data will be analyzed, and clustered
to form profiles; the data will be made anonymous before clustering so that there are
no personal profiles. Thus these applications de-individualize the users by judging
them by their mouse clicks. De-individualization, can be defined as a tendency of
judging and treating people on the basis of group characteristics instead of on their
own individual characteristics and merits.
Another important concern is that the companies collecting the data for a specific
purpose might use the data for a totally different purpose, and this essentially violates

the user’s interests. The growing trend of selling personal data as a commodity
encourages website owners to trade personal data obtained from their site. This trend
has increased the amount of data being captured and traded increasing the likeliness
of one’s privacy being invaded. The companies which buy the data are obliged make
it anonymous and these companies are considered authors of any specific release of
mining patterns. They are legally responsible for the contents of the release; any
inaccuracies in the release will result in serious lawsuits, but there is no law
preventing them from trading the data.
Some mining algorithms might use controversial attributes like sex, race, religion, or
sexual orientation to categorize individuals. These practices might be against the anti-
discrimination legislation. The applications make it hard to identify the use of such
controversial attributes, and there is no strong rule against the usage of such
algorithms with such attributes. This process could result in denial of service or a
privilege to an individual based on his race, religion or sexual orientation, right now
this situation can be avoided by the high ethical standards maintained by the data
mining company. The collected data is being made anonymous so that, the obtained
data and the obtained patterns cannot be traced back to an individual. It might look as
if this poses no threat to one’s privacy, actually many extra information can be
inferred by the application by combining two separate unscrupulous data from the
user.
With the rapidly growth of World Wide Web, it therefore becomes a very hot and
popular popular topic in web mining, and web mining now plays an important role for
E-commerce website and E-service to understand how their websites and services are
used and to provide better services for their customers and users.
Recently, many researchers are focusing on developing new web mining techniques
and algorithms have not been applied in real application environment. Therefore, it
could be an important time to shift the research focus to application area, such as E-
commerce and E-services.
Recently, many researchers are focusing on developing new web mining techniques
and algorithms have not been applied in real application environment.

10. VISUAL WEB MINING
Analysis of web site usage data involves two significant challenges:

1 1. Volume of data arising from the growth of the web.
2 2.Structural complexity of Web sites.
3
Web mining is the application of Information visualization techniques on results of
Web Mining in order to further amplify the perception of extracted patterns and
visually explore new ones in web domain. It applies Data Mining and Information
Visualization techniques to the web domain; in order to benefit from the power of
both human visual perception and computing and also applies Data Mining techniques
to large web data sets and use Information Visualization methods on the results. Its
goal is to correlate the outcomes of mining Web Usage Logs and the extracted Web
Structure, by visually superimposing the results.
Information Visualization enables Visual representations of abstract data, using

computer-supported, interactive visual interfaces to reinforce human cognition; thus
enabling the viewer to gain knowledge about the internal structure of the data and
relationships in it. Visual Web Mining Framework provides a prototype
implementation for applying information visualization techniques to the results of
Data Mining. User Session gives a compact sequence of web accesses by a user.
Visualization is in order to understand the structure of a particular website & to know
Web surfers’ behavior when visiting that website. Due to the large dataset and the
structural complexity of the sites, 3D visual representations are used. It is
implemented using an open source toolkit called the Visualization Tool Kit (VTK).
VTK consists of a C++ class library and several interpreted interface layers including
Tcl/Tk, Java, and Python.

Figure (xv). Visual Web mining architecture
Input is Web pages and Web server log files. web robot (webbot) is used to retrieve
the pages of the website. The webbot is a very fast Web walker with support for
regular expressions, SQL logging facilities, and many other features. It can be used to
check links, find bad HTML, map out a web site, download images, etc. In parallel,
Web Server Log files are downloaded and processed through a sessionizer and a
LOGML file is generated.
The Integration Engine is a suite of programs for data preparation, i.e., extracting,
cleaning, transforming and integrating data and finally loading into database and later
generating graphs in XGML. User sessions from web logs are extracted, which yields
results roughly related to a specific user. User sessions are then converted into a
special format for Sequence Mining using cSPADE (continues Spade - Sequential
Pattern Discovery Using Equivalent Class).The Outputs are frequent contiguous
sequences with a given minimum support. These are imported into a database, and
non-maximal frequent sequences are removed. Different queries are executed against
this data according to some criterion, e.g. support of each pattern, length of patterns,
etc. Different URLs which correspond to the same webpage are unified in the final

results.The Visualization Stage maps the extracted data and attributes into visual
images, realized through VTK extended with support for graphs. Result: Interactive
3D/2D visualizations which could be used by analysts to compare actual web surfing
patterns with expected patterns.
1
10.1. Visual Representation

Structures:
 Graphs
Extract spanning tree from the site structure, and use this as the framework for presenting
access-related results through glyphs(an element of writing) and color mapping.
 Stream Tubes
Variable-width tubes showing access paths with different traffic are introduced on top of the
web graph structure.
10.2. Design and Implementation of Diagrams

Below is a visualization of the web graph of the Computer Science department of
Rensselaer Polytechnic Institute. Strahler numbers are used for assigning colors to
edges.
Figure(xvi). 2D visualization layout with Strahler Coloring applied on web usage

logs

One can see user access paths scattering from first page of website (the node in
center) to cluster of web pages corresponding to faculty pages, course home pages,
etc. Strahler numbers is a numerical measure of the branching complexity for
assigning colors to the edges.
Figure(xvii). 3D visualization layout with Strahler Coloring applied on web usage

logs
Adding third dimension enables visualization of more information and clarifies user
behavior in and between clusters. Center node of circular basement is first page of
web site from which users scatter to different clusters of web pages. Color spectrum
from Red (entry point into clusters) to Blue (exit points) clarifies behavior of users.
The cylinder like part of this figure is visualization of web usage of surfers as they
browse a long HTML document

Figure(xviii). User log sessions

Left: One can observe long user sessions as strings falling off. Those are special type
of long sessions when user navigates sequence of web pages which come one after the
other e.g., sections of a long document. In many cases were found web pages with
many nodes connected with Next/Up/Previous hyperlinks.
Right: An enlarged view of the same visualization.
Frequent access patterns extracted by the web mining process are visualized as a
white graph on top of an embedded and colorful graph of web usage.

Figure(xix). Superimposition of Frequent Patterns extracted from Web Mining on top

of Web Usage
Similar to the above picture with addition of another attribute, i.e., frequency of
pattern which is rendered as thickness of white tubes. This helps in the analysis of
results.

Figure(xx). White tubes

Thickness of the tubes represent frequency of found patterns
Superimposition of Web Usage on top of Web Structure with higher order layout.
Top node is the first page of the website. Hierarchical output of layouts make
analysis easier.

Figure(xxi). Higher Order layout for clear visualization and easier analysis

Figure(xxii)Left: Superimposition of website dynamics(colored) on top of its static

structure(gray)
Right: Zoom view of colored region with layout of Web Usage taken from Web
Graph basement. The basement itself is removed for clarity
Using the visualizations, a web analyzer can easily identify which parts of the website
are cold parts with few hits and which parts are hot ones with many hits and classify
them accordingly. This also paves way for making exploratory changes in website.
For e.g., adding links from hot parts of web site to cold parts and then extracting,
visualizing and interpreting changes in access patterns.

Figure(xxiii). Amplification of a user session: Clickstream(Bottom Left) in drill down

cylinder, Cone Scatter(Top Right) and Funnel Backoff to main page of website (Top
Right)
User’s browsing access pattern is amplified by a different coloring. Depending on link

structure of underlying pages, we can see vertical access patterns of a user drilling
down the cluster, making a cylinder shape. Also users following links going down a
hierarchy of webpages makes a cone shape and users going up hierarchies, e.g., back
to main page of website makes a funnel shape.

11. CONCLUSION
As the Web and its usage continue to grow, so does the opportunity to analyze Web
data and extract all manner of useful knowledge from it. The past few years have seen
the emergence of Web mining as a rapidly growing area, due to the efforts of the
research community as well as various organizations that are practicing. The key
component of web mining is the mining process itself. This area of research is so huge
today due to the tremendous growth of information sources available on the Web and
the recent interest in e-commerce. Web mining is used to understand customer
behavior, evaluate the effectiveness of a particular Web site, and help quantify the
success of a marketing campaign. Here we have described the key computer science
contributions made in this field, including the overview of web mining, taxonomy of
web mining, the prominent successful applications, and outlined some promising
areas of future research.

12. REFERENCE
[1] http://en.wikipedia.org/wiki/Web mining

[2] http://www.galeas.de/webimining.html
[3] Jaideep srivastava, Robert Cooley, Mukund Deshpande, Pan-Ning Tan, Web
Usage Mining: Discovery and Applications of Usage Patterns from Web Data,
SIGKDD Explorations, ACM SIGKDD,Jan 2000.
[4] Miguel Gomes da Costa Jnior,Zhiguo Gong, Web Structure Mining: An
Introduction, Proceedings of the 2005 IEEE International Conference on Information
Acquisition
[5] R. Cooley, B. Mobasher, and J. Srivastava,Web Mining: Information and Pattern
Discovery on the World Wide Web, ICTAI97
[6] Brijendra Singh, Hemant Kumar Singh, WEB DATA MINING RE- SEARCH: A
SURVEY, 2010 IEEE
[7] Mining the Web: discovering knowledge from hypertext data, Part 2 By Soumen
Chakrabarti, 2003 edition
[8] Web mining: applications and techniques By Anthony Scime

Web Mining Report

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Web Mining Report

Uploaded by

Copyright:

Available Formats

VedaVyasa Institute of Technology WEB MINING

Figure(i). web size

At present most of the users commonly use searching engines such as

Dept .of Computer Science & Engg. 1 2008 Admission

Dept .of Computer Science & Engg. 2 2008 Admission

2. OVERVIEW OF WEB MINING METHODOLOGIES

Dept .of Computer Science & Engg. 3 2008 Admission

through the use of certain graph-theoretical measures or procedures. Structures such

Dept .of Computer Science & Engg. 4 2008 Admission

3. TRADITIONAL INFORMATION RETRIEVAL

Traditional information retrieval methods represent plain-text documents using a

Vector- based distance measures

Dept .of Computer Science & Engg. 5 2008 Admission

Dept .of Computer Science & Engg. 6 2008 Admission

Dept .of Computer Science & Engg. 7 2008 Admission

Dept .of Computer Science & Engg. 8 2008 Admission

Dept .of Computer Science & Engg. 9 2008 Admission

Figure(ii). Searching the Web

4.1. WEB MINING Vs DATA MINING

Dept .of Computer Science & Engg. 10 2008 Admission

Figure (iii). The Data Mining (KDD) Process

Just as data mining aims at discovering valuable information that is hidden in

Dept .of Computer Science & Engg. 11 2008 Admission

4.3. WEB MINING SUBTASKS

Web mining can be decomposed into the subtasks, namely:

Dept .of Computer Science & Engg. 12 2008 Admission

3. Generalization: automatically discovers general patterns at individual Web

Dept .of Computer Science & Engg. 13 2008 Admission

5. CHALLENGES OF WEB MINING

Dept .of Computer Science & Engg. 14 2008 Admission

6. TAXONOMY OF WEB MINING

Figure(iv). Taxonomy of Web mining

In general, Web mining tasks can be classified into three categories:

Dept .of Computer Science & Engg. 15 2008 Admission

Figure(v). Detailed classification of Taxonomy of Web mining

6.1. WEB CONTENT MINING

Dept .of Computer Science & Engg. 16 2008 Admission

Web content mining could be differentiated from two points of view:

6.1.2. Database Approach

Dept .of Computer Science & Engg. 17 2008 Admission

generalizations are extracted from lower levels and organized in structured

6.2. WEB STRUCTURE MINING

Figure(vi). Web graph structure

Dept .of Computer Science & Engg. 18 2008 Admission

6.2.1. Page Rank

Dept .of Computer Science & Engg. 19 2008 Admission

Here, n is the number of nodes in the graph, OutDegree(q) is the number of

6.3. WEB USAGE MINING

Dept .of Computer Science & Engg. 20 2008 Admission

Dept .of Computer Science & Engg. 21 2008 Admission

Figure(vii). Web usage mining process

Figure (viii). Web Logs

Dept .of Computer Science & Engg. 22 2008 Admission

Figure (ix). A general Architecture for Web usage mining

Figure (x). Diagram of Web usage mining process

Dept .of Computer Science & Engg. 23 2008 Admission

Dept .of Computer Science & Engg. 24 2008 Admission

6.3.2. Pattern discovery

Dept .of Computer Science & Engg. 25 2008 Admission

 Dependency Modeling: The goal of this technique is to establish a model

6.3.3. Pattern Analysis

Dept .of Computer Science & Engg. 26 2008 Admission

Figure (xi). Web Usage Mining