Professional Documents
Culture Documents
1. INTRODUCTION
With the explosive growth of information sources available on the World Wide Web,
it has become increasingly necessary for users to utilize auto-mated tools in order to
find, extract, filter, and evaluate the desired information and resources. In addition,
with the transformation of the web into the primary tool for electronic commerce, it is
imperative for organizations and companies, who have invested millions in Internet
and Intranet technologies, to track and analyze user access patterns. These factors give
rise to the necessity of creating server-side and client-side intelligent systems that can
effectively mine for knowledge both across the Internet and in particular web
localities.
224,749,695 (Mar 2009) Netcraft survey – Total no of sites across all domains
build based on exact key words matching and it's query language belongs to some
artificial kind, with restricted syntax and vocabulary other than natural language, there
are defects that all kind of searching engines cannot overcome.
Narrowly Searching Scope: Web pages indexed by any searching engines are only a
tiny part of the whole pages on the www, and the return pages when user input and
submit query are another tiny part of indexed numbers of the searching engine.
Low Precision: User cannot browse all the pages one by one, and most pages are
irrelevant to the user's meaning, they are highlighted and returned by searching engine
just because these pages in possession of the key words.
Web mining techniques could be used to solve the information over load problems
directly or indirectly. However, Web mining techniques are not the only tools. Other
techniques and works from different research areas, such as DataBase (DB),
Information Retrieval (IR), Natural Language Processing (NLP), and the Web
document community, could also be used.
INFORMATION RETRIEVAL
Information retrieval is the art and science of searching for information in documents,
searching for documents themselves, searching for metadata which describes
documents, or searching within databases, whether relational standalone databases or
hypertext networked databases such as the Internet or intranets, for text, sound,
images or data.
NATURAL LANGUAGE PROCESSING
Natural language processing (NLP) is concerned with the interactions between
computers and human (natural) languages. NLP is a form of human-to-computer
interaction where the elements of human language, be it spoken or written, are
formalized so that a computer can perform value-adding tasks based on that
interaction.
Natural language understanding is sometimes referred to as an AI-complete problem,
because natural-language recognition seems to require extensive knowledge about the
outside world and the ability to manipulate it
Web mining is the application of machine learning (data mining) techniques to web-
based data for the purpose of learning or extracting knowledge. Web mining
encompasses a wide variety techniques, including soft computing . Web mining
methodologies can generally be classified into one of three distinct categories: web
usage mining, web structure mining, and web content mining . In web usage mining
the goal is to examine web page usage patterns in order to learn about a web system's
users or the relationships between the documents. For example, the tool presented by
Masseglia et al. [ M P C ~c~re]a tes association rules from web access logs, which
store the identity of pages accessed by users along withother information such as
when the pages were accessed and by whom; these logs are the focus of the data
mining effort, rather than the actual web pages themselves. Rules created by their
method could include, for example, "70% of the users that visited page A also visited
page B." Similarly, the method of Nasraoui et al. [ N F J K ~a~ls]o examines web
access logs. The method employed in that paper is to perform a hierarchical clustering
in order to determine the usage patterns of different groups of users. Beeferman and
Berger [BBOO] described a process they developed which determines topics related
to a user query using click-through logs and agglomerative clustering of bipartite
graphs. The transaction-based method developed in Ref. [Mer99] creates links
between pages that are frequently accessed together during the same session. Web
usage mining is useful for providing personalized web services, an area of web
mining research that has lately become active. It promises to help tailor web services,
such as web search engines, to the preferences of each individual user. For a recent
review of web personalization methods
In the second category of web mining methodologies, web structure mining, we
examine only the relationships between web documents by utilizing the information
conveyed by each document's hyperlinks. Like the web usage mining methods
described above, the other content of the web pages is often ignored. In Ref. [ K R R
T ~K~u]m ar et al. examined utilizing a graph representation of web page structure.
Here nodes in the graphs are web pages and edges indicate hyperlinks between pages.
By examining these "web graphs" it is possible to find documents or areas of interest
of the data sets used in this paper is publicly available and we will use it in our
experiments.
Here we briefly review some of the most popular vector-related distance measures
which will also be used in the experiments we perform. First, we have the well-known
Euclidean distance
Where xi and yi are the ith components of vectors x = [xl, xz, . . . , x,] and y = [yl,
yz,: . . , yn], respectively. Euclidean distance measures the direct distance between
two points in the space iRn. For applications in text and document clustering and
classification, the cosine similarity measure [Sal89] is often used. We can convert this
to a distance measure by the following
Here . indicates the dot product operation and ||…|| indicates the magnitude (length)
of a vector. If we take each document to be a point in SRn formed by the values
assigned to it under the vector model, each value corresponding to a dimension, then
the direction of the vector formed from the origin to the point representing the
document indicates the content of the document. Under the Euclidean distance,
documents with large differences in size have a large distance between them,
regardless of whether or not the content is similar, because the length of the vectors
differs. The cosine distance is length invariant, meaning only the direction of the
vectors is compared; the magnitude of the vectors is ignored.
Another popular distance measure for determining document similarity is the
extended Jaccard similarity [Sa189], which is converted to a distance measure as
follows:
Jaccard distance has properties of both the Euclidean and cosine distance measures.
At high distance values, Jaccard behaves more like the cosine measure; at lower
values, it behaves more like Euclidean.
4. WEB MINING
With the recent explosive growth of the amount of content on the Internet, it has
become increasingly difficult for users to find and utilize information and for content
providers to classify and catalog documents. Traditional web search engines often
return hundreds or thousands of results for a search, which is time consuming for
users to browse. On-line libraries, search engines, and other large document
repositories (e.g. customer support databases, product specification databases, press
release archives, news story archives, etc.) are growing so rapidly that it is difficult
and costly to categorize every document manually. In order to deal with these
problems, researchers look toward automated methods of working with web
documents so that they can be more easily browsed, organized, and cataloged with
minimal human intervention.
In contrast to the highly structured tabular data upon which most machine learning
methods are expected to operate, web and text documents are semi-structured. Web
documents have well-defined structures such as letters, words, sentences, paragraphs,
sections, punctuation marks, HTML tags, and so forth. We know that words make up
sentences, sentences make up paragraphs, and so on, but many of the rules governing
the order in which the various elements are allowed to appear are vague or ill-defined
and can vary dramatically between documents. It is estimated that as much as 85% of
all digital business information, most of it web-related, is stored in non-structured
formats ( i e . non-tabular formats, such as those that are used in databases and
spreadsheets) [pet]. Developing improved methods of performing machine learning
techniques on this vast amount of non-tabular, semi-structured web data is therefore
highly desirable.
Clustering and classification have been useful and active areas of machine learning
research that promise to help us cope with the problem of information overload on the
Internet. With clustering the goal is to separate a given group of data items (the data
set) into groups called clusters such that items in the same cluster are similar to each
other and dissimilar to the items in other clusters. In clustering methods no labeled
examples are provided in advance for training (this is called unsupervised learning).
Under classification we attempt to assign a data item to a predefined category based
on a model that is created from pre-classified training data (supervised learning). In
more general terms, both clustering and classification come under the area of
knowledge discovery in databases or data mining. Applying data mining techniques to
web page content is referred to as web content mining which is a new sub-area of web
mining, partially built upon the established field of information retrieval.
When representing text and web document content for clustering and classification, a
vector-space model is typically used. In this model, each possible term that can appear
in a document becomes a feature dimension [Sa189]. The value assigned to each
dimension of a document may indicate the number of times the corresponding term
appears on it or it may be a weight that takes into account other frequency
information, such as the number of documents upon which the terms appear. This
model is simple and allows the use of traditional machine learning methods that deal
with numerical feature vectors in a Euclidean feature space. However, it discards
information such as the order in which the terms appear, where in the document the
terms appear, how close the terms are to each other, and so forth. By keeping this kind
of structural information we could possibly improve the performance of various
machine learning algorithms. The problem is that traditional data mining methods are
often restricted to working on purely numeric feature vectors due to the need to
compute distances between data items or to calculate some representative of a cluster
of items (i.e. a centroid or center of a cluster), both of which are easily accomplished
in a Euclidean space. Thus either the original data needs to be converted to a vector of
numeric values by discarding possibly useful structural information (which is what we
are doing when using the vector model to represent documents) or we need to develop
new, customized methodologies for the specific representation.
Graphs are important and effective mathematical constructs for modeling
relationships and structural information. Graphs (and their more restrictive form,
trees) are used in many different problems, including sorting, compression, traffic
flow analysis, resource allocation, etc. [CLR97]. In addition to problems where the
graph itself is processed by some algorithm (e.g. sorting by the depth first method or
finding the minimum spanning tree) it would be extremely desirable in many
applications, including those related to machine learning, to model data as graphs
since these graphs can retain more information than sets or vectors of simple atomic
features. Thus much research has been performed in the area of graph similarity in
order to exploit the additional information allowed by graph representations by
introducing mathematical frameworks for dealing with graphs. Some application
domains where graph similarity techniques have been applied include face
[WFKvdM97] and fingerprint [WJHO~re]c ognition as well as anomaly detection in
communication networks [DBD+01]. In the literature, the work comes under several
different topic names including graph distance, (exact) graph matching, inexact graph
matching, error-tolerant graph matching, or error-correcting graph matching. In exact
graph matching we are attempting to determine if two graphs are identical. Inexact
graph matching implies we are attempting not to find a perfect matching, but rather a
"best" or "closest" matching. Error-tolerant and error-correcting are special cases of
inexact matching where the imperfections (e.g. missing nodes) in one of the graphs,
called the data graph, are assumed to be the result of some errors (e.g. from
transmission or recognition). We attempt to match the data graph to the most similar
model graph in our database. Graph distance is a numeric measure of dissimilarity
between graphs, with larger distances implying more dissimilarity. By graph
similarity, we mean we are interested in some measurement that tells us how similar
graphs are regardless if there is an exact matching between them.
The web as we all know is the SINGLE largest source of data available. Web mining
aims to extract and mine useful knowledge from the web. It is used to understand the
customer behavior, evaluate the effectiveness of a website and also to help quantify
the success of a marketing campaign. Web mining is the integration of information
gathered by traditional data mining methodologies and techniques with information
gathered over the World Wide Web.
Data is structured and has well defined tables, columns, rows, keys and constraints.
Web Mining has Dynamic and rich in features and patterns. It involves analysis of
web server logs of a website whereas data mining involves using techniques to find
relationships in large amounts of data. It often need to react to evolving usage patterns
in real time eg. Merchandizing.
Data mining is also called knowledge discovery and data mining (KDD). It is the
extraction of useful patterns from data sources, e.g. databases, texts, web, images, etc.
Patterns must be valid, novel, potentially useful, understandable. Classic data mining
tasks
Classification: mining patterns that can classify future (new) data into known
classes.
Association rule mining: mining any rule of the form X ® Y, where X and Y
are sets of data items. E.g.,Cheese, Milk® Bread [sup =5%, confid=80%]
Clustering: identifying a set of similarity groups in the data
Sequential pattern mining: A sequential rule: A® B, says that event A will be
immediately followed by event B with a certain confidence
Web is a collection of inter-related files on one or more Web servers. Web mining is a
multi-disciplinary effort that draws techniques from fields like information retrieval,
statistics, machine learning, natural language processing, and others. Web mining has
new character compared with the traditional data mining. First, the objects of Web
mining are a large number of Web documents which are heterogeneously distributed
and each data source are heterogeneous; second, the Web document itself is semi-
structured or unstructured and lack the semantics the machine can understand.
4.2. HISTORY
The term “Web Mining” first used in [E1996], defined in a ‘task oriented’ manner.
Alternate ‘data oriented’ definition given in [CMS1997]. Its First panel discussion at
ICTAI 1997 [SM1997]. It is a continuing forum.
WebKDD workshops with ACM SIGKDD, 1999, 2000, 2001, 2002, … ; 60 –
90 attendees
SIAM Web analytics workshop 2001, 2002, …
Special issues of DMKD journal, SIGKDD Explorations
Papers in various data mining conferences & journals
Surveys [MBNL 1999, BL 1999, KB2000]
1. Today World Wide Web is flooded with billions of static and dynamic web
pages created with programming languages such as HTML, PHP and ASP. It
is significant challenge to search useful and relevant information on the web.
2. Creating knowledge from available information.
3. As the coverage of information is very wide and diverse, personalization of
the information is a tedious process.
4. Learning customer and individual user patterns.
5. Complexity of Web pages far exceeds the complexity of any conventional text
document. Web pages on the internet lack uniformity and standardization.
6. Much of the information present on web is redundant, as the same piece of
information or its variant appears in many pages.
7. The web is noisy i.e. a page typically contains a mixture of many kinds of
information like, main content, advertisements, copyright notice, navigation
panels.
8. The web is dynamic, information keeps on changing constantly. Keeping up
with the changes and monitoring them are very important.
9. The Web is not only disseminating information but it also about services.
Many Web sites and pages enable people to perform operations with input
parameters, i.e., they provide services.
10. The most important challenge faced is Invasion of Privacy. Privacy is
considered lost when information concerning an individual is obtained, used,
or disseminated, when it occurs without their knowledge or consent.
Web Content mining is the discovery of useful information from web contents / data /
documents. Web data contents: text, image, audio, video, metadata and hyperlinks.
Web content mining is an automatic process that goes beyond keyword extraction.
Since the content of a text document presents no machine readable semantic, some
approaches have suggested to restructure the document content in a representation
that could be exploited by machines. The usual approach to exploit known structure in
documents is to use wrappers to map documents to some data model. Techniques
using lexicons for content interpretation are yet to come. There are two groups of web
content mining strategies: Those that directly mine the content of documents and
those that improve on the content search of other tools like search engines.
Web Content Mining deals with discovering useful information or knowledge from
web page contents. Web content mining analyzes the content of Web resources.
Content data is the collection of facts that are contained in a web page. It consists of
unstructured data such as free texts, images, audio, video, semi-structured data such
as HTML documents, and a more structured data such as data in tables or database
generated HTML pages. The primary Web resources that are mined in Web content
mining are individual pages. They can be used to group, categorize, analyze, and
retrieve documents. There are issues in Web Content Mining : Developing intelligent
tools for IR Finding keywords for key phrases, Discovering grammatical rules and
collocations , Hypertext classification/categorization, Extracting key phrases from text
documents, Learning extraction models/rules, Hierarchical clustering, Predicting
words relationship, Developing Web query systems WebOQL, XML-QL, Mining
multimedia data Mining image from satellite(Fayyad, et al. 1996) , Mining image to
identify small volcanoes on Venus (Smyth, et al 1996)
World Wide Web can reveal more information than just the information contained in
documents. For example, links pointing to a document indicate the popularity of the
document, while links coming out of a document indicate the richness or perhaps the
variety of topics covered in the document. This can be compared to bibliographical
citations. When a paper is cited often, it ought to be important. The PageRank and
CLEVER methods take advantage of this information conveyed by the links to find
pertinent web pages. By means of counters, higher levels cumulate the number of
artifacts subsumed by the concepts they hold. Counters of hyperlinks, in and out
documents, retrace the structure of the web artifacts summarized.
Web structure mining is the process of discovering structure information from the
web. The structure of a typical web graph consists of web pages as nodes, and
hyperlinks as edges connecting related pages. This can be further divided into two
kinds based on the kind of structure information used.
Hyperlinks
A hyperlink is a structural unit that connects a location in a web page to a different
location, either within the same web page or on a different web page. A hyperlink that
connects to a different part of the same page is called an Intra-document hyperlink,
and a hyperlink that connects two different pages is called an inter-document
hyperlink.
Document Structure
In addition, the content within a Web page can also be organized in a tree structured
format, based on the various HTML and XML tags within the page. Mining efforts
here have focused on automatically extracting document object model (DOM)
structures out of documents.
Web structure mining focuses on the hyperlink structure within the Web itself. The
different objects are linked in some way. Simply applying the traditional processes
and assuming that the events are independent can lead to wrong conclusions.
However, the appropriate handling of the links could lead to potential correlations,
and then improve the predictive accuracy of the learned models.
Two algorithms that have been proposed to lead with those potential correlations are:
1. HITS and
2. Page Rank.
6.2.2. HITS
Hyperlink-induced topic search (HITS) is an iterative algorithm for mining the Web
graph to identify topic hubs and authorities. Authorities are the pages with good
sources of content that are referred by many other pages or highly ranked pages for a
given topic; hubs are pages with good sources of links. The algorithm takes as input,
search results returned by traditional text indexing techniques, and filters these results
to identify hubs and authorities. The number and weight of hubs pointing to a page
determine the page's authority. The algorithm assigns weight to a hub based on the
authoritativeness of the pages it points to. If many good hubs point to a page p, then
authority of that page p increases. Similarly if a page p points to many good
authorities, then hub of page p increases.
After the computation, HITS outputs the pages with the largest hub weight and the
pages with the largest authority weights, which is the search result of a given topic.
Web usage mining is a process of extracting useful information from server logs i.e.
users history. Web usage mining is the process of finding out what users are looking
for on the Internet.
Web usage mining focuses on techniques that could predict the behavior of users
while they are interacting with the WWW. It collects the data from Web log records
to discover user access patterns of Web pages. Usage data captures the identity or
origin of web users along with their browsing behavior at a web site.
Web servers record and accumulate data about user interactions whenever requests for
resources are received. Analyzing the web access logs of different web sites can help
understand the user behavior and the web structure, thereby improving the design of
this colossal collection of resources. There are two main tendencies in Web Usage
Mining driven by the applications of the discoveries: General Access Pattern Tracking
and Customized Usage Tracking. The general access pattern tracking analyzes the
web logs to understand access patterns and trends. These analyses can shed light on
better structure and grouping of resource providers. Many web analysis tools existed
but they are limited and usually unsatisfactory. We have designed a web log data
mining tool, Web Log Miner, and proposed techniques for using data mining and
OnLine Analytical Processing (OLAP) on treated and transformed web access files.
Applying data mining techniques on access logs unveils interesting access patterns
that can be used to restructure sites in a more efficient grouping, pinpoint effective
advertising locations, and target specific users for specific selling ads.
Customized usage tracking analyzes individual trends. Its purpose is to customize
web sites to users. The information displayed, the depth of the site structure and the
format of the resources can all be dynamically customized for each user over time
based on their access patterns.
While it is encouraging and exciting to see the various potential applications of web
log file analysis, it is important to know that the success of such applications depends
on what and how much valid and reliable knowledge one can discover from the large
raw log data. Current web servers store limited information about the accesses. Some
scripts custom-tailored for some sites may store additional information. However, for
an effective web usage mining, an important cleaning and data transformation step
before analysis may be needed.
In the using and mining of Web data, the most direct source of data are Web log files
on the Web server. Web log files records of the visitor's browsing behavior very
clearly. Web log files include the server log, agent log and client log (IP address,
URL, page reference, access time, cookies etc.).
Web servers have the ability to log all requests. Web server log format is the mostly
used Common Log Format(CLF). New, Extended Log Format allows configuration of
log file. Design of a Web Log Miner : Web log is filtered to generate a relational
database. A data cube is generated from the database. OLAP is used to drill-down and
roll-up in the cube. OLAM is used for mining interesting knowledge
Knowledge
There are several available research projects and commercial products that analyze
those patterns for different purposes. The applications generated from this analysis
can be classified as personalization, system improvement, site modification, business
intelligence and usage characterization.
The Web Usage Mining can be decomposed into the following three main sub tasks:
6.3.1. Pre-processing
It is necessary to perform a data preparation to convert the raw data for further
process. The actual data collected generally have the features that incomplete,
redundancy and ambiguity. In order to mine the knowledge more effectively, pre-
processing the data collected is essential. Preprocessing can provide accurate, concise
data for data mining. Data preprocessing, includes data cleaning, user identification,
user sessions identification, access path supplement and transaction identification.
The main task of data cleaning is to remove the Web log redundant data which
is not associated with the useful data, narrowing the scope of data objects.
Determining the single user must be done after data cleaning. The purpose of
user identification is to identify the users uniqueness. It can be complete by
means of cookie technology, user registration techniques and investigative
rules.
User session identification should be done on the basis of the user
identification. The purpose is to divide each user's access information into
several separate session processes. The simplest way is to use time-out
estimation approach, that is, when the time interval between the page requests
exceeds the given value, namely, that the user has started a new session.
Because the widespread use of the page caching technology and the proxy
servers, the access path recorded by the Web server access logs may not be the
complete access path of users. Incomplete access log does not accurately
reflect the user's access patterns, so it is necessary to add access path. Path
supplement can be achieved using the Web site topology to make the page
analysis.
The transaction identification is based on the user's session recognition, and its
purpose is to divide or combine transactions according to the demand of data
mining tasks in order to make it appropriate for demand of data mining
analysis.
7. WEB CRAWLER
A Web crawler is a computer program that browses the World Wide Web in a
methodical, automated manner. Other terms for Web crawlers are ants, automatic
indexers, bots, and worms or Web spider, Web robot. Search engines, use spidering as
a means of providing up-to-date data . Crawlers can also be used for automating
maintenance tasks on a Web site, such as checking links or validating HTML code.
Crawlers can be used to gather specific types of information from Web pages, such as
harvesting e-mail addresses (usually for spam), eg. anita at cs dot sunysb dot edu ;
mueller{remove this}@cs.sunysb.edu. A Web crawler is one type of bot, or software
agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler
visits these URLs, it identifies all the hyperlinks in the page and adds them to the list
of URLs to visit, called the crawl frontier
The so-called personalized service, that is, when the user browses Web sites, as far as
possible to meet each user's browsing interest and constantly adjusts to adapt to the
users browsing interests change, so that make each user feel he/she is a unique user of
this Web site.
In order to achieve personalized service, it first has to obtain and collect information
on clients to grasp customer's spending habits, hobbies, consumer psychology, etc.,
and then can be targeted to provide personalized service. To obtain consumer
spending behavior patterns, the traditional marketing approach is very difficult, but it
can be done using Web mining techniques.
Web mining can predict trend within the retrieved information to indicate future
values. For example, an electronic auction company provides information about items
to auction, previous auction details, etc. Predictive modeling can be utilized to
analyze the existing information, and to estimate the values for auctioneer items or
number of people participating in future auctions.
The predicting capability of the mining application can also benefit society by
identifying criminal activities.
9.1. PROS
Web mining essentially has many advantages which makes this technology attractive
to corporations including the government agencies. This technology has enabled
ecommerce to do personalized marketing, which eventually results in higher trade
volumes. The government agencies are using this technology to classify threats and
fight against terrorism. The predicting capability of the mining application can benefit
the society by identifying criminal activities. The companies can establish better
customer relationship by giving them exactly what they need. Companies can
understand the needs of the customer better and they can react to customer needs
faster. The companies can find, attract and retain customers; they can save on
production costs by utilizing the acquired insight of customer requirements. They can
increase profitability by target pricing based on the profiles created. They can even
find the customer who might default to a competitor the company will try to retain the
customer by providing promotional offers to the specific customer, thus reducing the
risk of losing a customer.
9.2. CONS
Web mining, itself, doesn’t create issues, but this technology when used on data of
personal nature might cause concerns. The most criticized ethical issue involving web
mining is the invasion of privacy. Privacy is considered lost when information
concerning an individual is obtained, used, or disseminated, especially if this occurs
without their knowledge or consent. The obtained data will be analyzed, and clustered
to form profiles; the data will be made anonymous before clustering so that there are
no personal profiles. Thus these applications de-individualize the users by judging
them by their mouse clicks. De-individualization, can be defined as a tendency of
judging and treating people on the basis of group characteristics instead of on their
own individual characteristics and merits.
Another important concern is that the companies collecting the data for a specific
purpose might use the data for a totally different purpose, and this essentially violates
the user’s interests. The growing trend of selling personal data as a commodity
encourages website owners to trade personal data obtained from their site. This trend
has increased the amount of data being captured and traded increasing the likeliness
of one’s privacy being invaded. The companies which buy the data are obliged make
it anonymous and these companies are considered authors of any specific release of
mining patterns. They are legally responsible for the contents of the release; any
inaccuracies in the release will result in serious lawsuits, but there is no law
preventing them from trading the data.
Some mining algorithms might use controversial attributes like sex, race, religion, or
sexual orientation to categorize individuals. These practices might be against the anti-
discrimination legislation. The applications make it hard to identify the use of such
controversial attributes, and there is no strong rule against the usage of such
algorithms with such attributes. This process could result in denial of service or a
privilege to an individual based on his race, religion or sexual orientation, right now
this situation can be avoided by the high ethical standards maintained by the data
mining company. The collected data is being made anonymous so that, the obtained
data and the obtained patterns cannot be traced back to an individual. It might look as
if this poses no threat to one’s privacy, actually many extra information can be
inferred by the application by combining two separate unscrupulous data from the
user.
With the rapidly growth of World Wide Web, it therefore becomes a very hot and
popular popular topic in web mining, and web mining now plays an important role for
E-commerce website and E-service to understand how their websites and services are
used and to provide better services for their customers and users.
Recently, many researchers are focusing on developing new web mining techniques
and algorithms have not been applied in real application environment. Therefore, it
could be an important time to shift the research focus to application area, such as E-
commerce and E-services.
Recently, many researchers are focusing on developing new web mining techniques
and algorithms have not been applied in real application environment.
Input is Web pages and Web server log files. web robot (webbot) is used to retrieve
the pages of the website. The webbot is a very fast Web walker with support for
regular expressions, SQL logging facilities, and many other features. It can be used to
check links, find bad HTML, map out a web site, download images, etc. In parallel,
Web Server Log files are downloaded and processed through a sessionizer and a
LOGML file is generated.
The Integration Engine is a suite of programs for data preparation, i.e., extracting,
cleaning, transforming and integrating data and finally loading into database and later
generating graphs in XGML. User sessions from web logs are extracted, which yields
results roughly related to a specific user. User sessions are then converted into a
special format for Sequence Mining using cSPADE (continues Spade - Sequential
Pattern Discovery Using Equivalent Class).The Outputs are frequent contiguous
sequences with a given minimum support. These are imported into a database, and
non-maximal frequent sequences are removed. Different queries are executed against
this data according to some criterion, e.g. support of each pattern, length of patterns,
etc. Different URLs which correspond to the same webpage are unified in the final
results.The Visualization Stage maps the extracted data and attributes into visual
images, realized through VTK extended with support for graphs. Result: Interactive
3D/2D visualizations which could be used by analysts to compare actual web surfing
patterns with expected patterns.
1
One can see user access paths scattering from first page of website (the node in
center) to cluster of web pages corresponding to faculty pages, course home pages,
etc. Strahler numbers is a numerical measure of the branching complexity for
assigning colors to the edges.
Adding third dimension enables visualization of more information and clarifies user
behavior in and between clusters. Center node of circular basement is first page of
web site from which users scatter to different clusters of web pages. Color spectrum
from Red (entry point into clusters) to Blue (exit points) clarifies behavior of users.
The cylinder like part of this figure is visualization of web usage of surfers as they
browse a long HTML document
Frequent access patterns extracted by the web mining process are visualized as a
white graph on top of an embedded and colorful graph of web usage.
Similar to the above picture with addition of another attribute, i.e., frequency of
pattern which is rendered as thickness of white tubes. This helps in the analysis of
results.
Superimposition of Web Usage on top of Web Structure with higher order layout.
Top node is the first page of the website. Hierarchical output of layouts make
analysis easier.
Figure(xxi). Higher Order layout for clear visualization and easier analysis
Using the visualizations, a web analyzer can easily identify which parts of the website
are cold parts with few hits and which parts are hot ones with many hits and classify
them accordingly. This also paves way for making exploratory changes in website.
For e.g., adding links from hot parts of web site to cold parts and then extracting,
visualizing and interpreting changes in access patterns.
11. CONCLUSION
As the Web and its usage continue to grow, so does the opportunity to analyze Web
data and extract all manner of useful knowledge from it. The past few years have seen
the emergence of Web mining as a rapidly growing area, due to the efforts of the
research community as well as various organizations that are practicing. The key
component of web mining is the mining process itself. This area of research is so huge
today due to the tremendous growth of information sources available on the Web and
the recent interest in e-commerce. Web mining is used to understand customer
behavior, evaluate the effectiveness of a particular Web site, and help quantify the
success of a marketing campaign. Here we have described the key computer science
contributions made in this field, including the overview of web mining, taxonomy of
web mining, the prominent successful applications, and outlined some promising
areas of future research.
12. REFERENCE