You are on page 1of 65

WEB MINING

Data Mining vs. Web Mining


• Traditional data mining
– data is structured and relational
– well-defined tables, columns, rows, keys, and
constraints.
• Web data
– Semi-structured and unstructured
– readily available data
– rich in features and patterns
Web Mining
• The term created by Orem Etzioni (1996)

• Application of data mining techniques to automatically


discover and extract information from
Web data
Web Mining
• Web is the single largest data source in the
world
• Due to heterogeneity and lack of structure of
web data, mining is a challenging task
• Multidisciplinary field:
– data mining, machine learning, natural language
– processing, statistics, databases, information
– retrieval, multimedia, etc.
Mining the World-Wide Web
• The WWW is huge, widely distributed, global
information service center for
– Information services: news, advertisements,
consumer information, financial management,
education, government, e-commerce, etc.
– Hyper-link information
– Access and usage information
• WWW provides rich sources for data mining

112/12/21 6
Web Mining: A more challenging
task
• Searches for
– Web access patterns
– Web structures
– Regularity and dynamics of Web contents
• Problems
– The “abundance” problem
– Limited coverage of the Web: hidden Web sources,
majority of data in DBMS
– Limited query interface based on keyword-oriented search
– Limited customization to individual users
112/12/21 8
Why web mining is done
•Discovering new knowledge from the web
•Personalized web page synthesis
•Learning about Individual Users
 Web contains many kinds of data.
 Government information
 Institutions data
 Web applications
 Some of the web content are hidden data and some are generated dynamically
as a result of queries.
 Web consists of several types of data such as textual , image , audio , video ,
metadata as well as hyperlinks.
 Mining multiple types of data is called as multimedia mining.
 The textual parts of web content data consist of unstructured data such as free
texts , semi structured data such as HTML documents and more structured
data such as data in the tables or database-generated HTMLpages.
 The text mining techniques can be employed for web content mining directly.
Web Structure Mining

Web structure mining is concerned with discovering the model underlying the link
structures of the web.
It is used to study the topology of the hyperlinks with or without the description of
the links
This model can be used to categorize web pages and is useful to generate
information such as similarity and relationship between different web sites.
Web structure mining can be used to discover authority sites for the subjects and
overview sites for the subjects that point to many authorities.
It may be noted that while web content mining attempts to explore the structure
within a document(intra document structure),web structure mining studies the
structures of documents within the web itself(inter document structure)
 A Collection of hyperlinked pages can be viewed as directed graph G=(V,E)
 Out-degree: For node p out degree is the number of nodes to which it has links.
 In-degree: For node p In degree is the number of nodes that have links to it.
 The algorithms used for modeling web topology are
HITS
PageRank
CLEVER
Page Rank
Importance of a document is measured by counting citations or backlinks to a given
document.
A page can have high PageRank if there are many pages that point to it, or if there
are some pages that point to it which have a high PageRank.

 Where d is the damping factors.


 PageRank can be calculated using iterative algorithm, and correspponds to
the principal eigenvector of the normalized link matrix of the web.
 Page rank form a sort of a probability distribution over the web pages.
Social Network
Social network uses an exponentially varying damping factor.
The social studies ways to measure the relative standing or importance of
individuals in a network . The same process can be mapped to study the link
structures of the web pages.
The basic principle here is if a web page points a link to another web page, then
the former is, in some sense, endorsing the importance of the latter.
Links in this network may have different weights , corresponding to the strength of
endorsement.
The standing of a Node can be defined as
Transverse and Intrinsic links
A link is said to be a transverse link if it is between pages with different domain
names.
A link is said to be a Intrinsic link if it is between pages with same domain
names.
Intrinsic links convey much less information than transverse links about
importance of pages to which they point.
Thus Klinberg proposes to delete all intrinsic links from the graph.

Reference nodes and Index nodes


(Botafogo et al proposal)
Index Node

Reference Node
Clustering and Determining similar pages

Bibliographic coupling

Co-citation

The similarity measure between two sub clusters Sx and Sy is


 The secondary data includes the data from the web server access
logs,proxy server logs,browser logs,user profiles,registration data,user
sessions or transactions,cookies,user queries, bookmark data,mouse
clicks ans scrolls and any other data which are the results of these
interactions
Web Usage Mining
• Mining Web log records to discover user access
patterns of Web pages
• Applications
– Target potential customers for electronic commerce
– Enhance the quality and delivery of Internet information
services to the end user
– Improve Web server system performance
– Identify potential prime advertisement locations
• Web logs provide rich information about Web
dynamics
– Typical Web log entry includes the URL requested, the IP
address from which the request originated, and a
timestamp
112/12/21 46
Techniques for Web usage mining
• Construct multidimensional view on the Weblog
database
– Perform multidimensional OLAP analysis to find the top N
users, top N accessed Web pages, most frequently accessed
time periods, etc.
• Perform data mining on Weblog records
– Find association patterns, sequential patterns, and trends of
Web accessing
– May need additional information,e.g., user browsing
sequences of the Web pages in the Web server buffer
• Conduct studies to
– Analyze system performance, improve system design by Web
caching, Web page prefetching, and Web page swapping
112/12/21 50
Mining the World-Wide Web
• Design of a Web Log Miner
– Web log is filtered to generate a relational database
– A data cube is generated form database
– OLAP is used to drill-down and roll-up in the cube
– OLAM is used for mining interesting knowledge
Knowledge
Web log Database Data Cube Sliced and diced
cube

1 2
Data Cleaning 3 4
Data Cube OLAP Data Mining
Creation
112/12/21 51
Text Mining
 The amount of information available on the Web has increased rapidly
(Information-explosion era) The amount of information available on the Web
has increased rapidly (Information-explosion era)
 Users demand useful and reliable information from the Web in the shortest time
possible
 Text mining ,therefore corresponds to the extension of the data mining
approach to textual data and is concerned with various tasks, such as
extraction of information implicitly contained in collection of documents, or
similarity-based structuring.
 Text collection in general ,lacks the imposed structure of traditional database.
 It contains vast range of information but it is difficult to decipher automatically.
 Data-mining deals with Extraction of interesting information (or patterns) from
structured data.
 As Text collection is unstructured particularly free running text , specific
techniques called Text Mining techniques have to be developed to aid
knowledge discovery.
To perform Text Mining
1.Impose a structure on textual data and use any known data mining techniques.
2.Develop a specific technique .

During the approach of Text Mining other related subjects also has to be
considered.
 Information Retrieval
 Information Extraction
 Computational Linguistics

Information Retrieval

IR is the automatic retrieval of all relevant documents at the same time retrieving as
few of the non relevant ones as possible.
 Recent trends in IR include modeling , clustering , classification ,data
visualization , user interfaces, filtering etc which is an instance of Text Mining.

Information Extraction
 Information Extraction has the goal of transforming a collection of documents,
usually with the help of an IR system, into information that is more readily
digested and analyzed.
 IE extracts relevant facts from the documents ,while IR selects relevant
documents.
 More IE systems use machine learning or data mining techniques to learn the
extraction patterns or rules for documents semi-automatically or automatically.
 IE is a kind of preprocessing stage in the Text Mining process which is the step
after IR and before DM techniques are performed
 IE can also be used to improve the indexing process , which is part of IR
process.
 IE is an instance of Text Mining as summary of document given by IE is a form of
information that did not exist before.
Computational Linguistics
Unstructured Text
Unstructured documents are free texts, such as news stories.
Traditionally most of the research uses bags of words to represent unstructured
documents and extract different features from it.

Features

Word Occurrences
STOP WORDS
This feature selection includes removing the case , punctuation , infrequent words ,
and stop words.

LATENT SEMANTIC INDEXING

STEMMING

N-GRAM
PART-OF-SPEECH(POS)

POSITIONAL COLLOCATIONS

HIGHER ORDER FEATURES

Once the features are extracted the text is represented as structured data and
traditional data mining techniques can be used.
Episode Rule Discovery for Texts
Here Text is considered as sequential data which consists of a sequence of
pairs(feature vector , index)
Where feature vector is an ordered set of features and the index contains
information about the position of the word in the sequence.
Text episode is defined as a pair
Where V is a collection of feature vectors and is a partial order on V.
Given a text sequence S, a text episode occurs within S if there is a way
of satisfying the feature vectors in V, using the feature vectors in S so that the
partial order is respected.
i.e feature vectors of V can be found within S in an order that satisfies
 The support of in S is defined as the number of minimal occurrences of in
S.
 The episode discovery technique of sequence mining can be used to discover
frequent episodes in a text.
Hierarchy of Categories
It is necessary to organize search result documents into meaning full groups
Same author
Same year
Same publisher
Subject matter

Most of the documents discuss several different topics simultaneously.


So better way is to describe documents as set of categories as well as
attributes.
Feldman et al proposed data structure called Concept Hierarchy.
Concept hierarchy is a directed acyclic graph of concepts, where each of the
concept is identified by a unique name.
An arc from concept a to b denotes that a is a more general concept than b.
Each text document is tagged by a set of concepts that correspond to its
content.
Tagging a document with a concept implicitly entails its tagging with all the
ancestors of the concept hierarchy.
So the document should be tagged with the lowest concepts possible.
The method to automatically tag the document to the hierarchy is a top-down-
approach.
An evaluation function determines whether a
document currently tagged to a node can also be
tagged to any of its child nodes.
If so the tag is moved down the hierarchy till it
cannot be moved any further.
The result of this process is a hierarchy of
documents and at each node there is a set of
documents having common concept associated
with a node.

 Popescuel et al proposed a related problem of tagging key words to the set of


documents arranged in a hierarchy by some probability.
 The method is a two phase principle.
 It starts with a bag of words at the leaf level and moves up the hierarchy.
 The set of keywords for a non leaf node is obtained by combining all the
keywords of all its child nodes.
 After finding keywords for root node, the process starts with TOP Down
Approach.
 If a key word at any node is equally probable for all of its child nodes , then the
key word is associated with parent .otherwise it is moved down to the specific
child node.
Text Clustering
One popular text clustering algorithm is ward’s Minimum Variance method.
It is an agglomerative hierarchal clustering technique and it tends to generate very
compact clusters.
We can take either the Euclidean metric or Hamming distance as the measure of
dissimilarities between feature vectors.
The clustering method begins with n clusters, one for each text.
At each stage two clusters are merged to generate a new cluster.
The clusters Ck and Ci are merged to get anew cluster Cki based on the following
criterion.
Scatter/Gather
It is a method of grouping the documents using clustering.
Scatter/Gather is so named because it allows the user to scatter documents into
clusters, or groups, then gather a subset of these groups and re-scatter them to form
new groups.
Each cluster in scatter/gather is represented by a list of topical terms that is a list of
words that attempts to give the user the gist of what the documents in the cluster are
about.

You might also like