Professional Documents
Culture Documents
112/12/21 6
Web Mining: A more challenging
task
• Searches for
– Web access patterns
– Web structures
– Regularity and dynamics of Web contents
• Problems
– The “abundance” problem
– Limited coverage of the Web: hidden Web sources,
majority of data in DBMS
– Limited query interface based on keyword-oriented search
– Limited customization to individual users
112/12/21 8
Why web mining is done
•Discovering new knowledge from the web
•Personalized web page synthesis
•Learning about Individual Users
Web contains many kinds of data.
Government information
Institutions data
Web applications
Some of the web content are hidden data and some are generated dynamically
as a result of queries.
Web consists of several types of data such as textual , image , audio , video ,
metadata as well as hyperlinks.
Mining multiple types of data is called as multimedia mining.
The textual parts of web content data consist of unstructured data such as free
texts , semi structured data such as HTML documents and more structured
data such as data in the tables or database-generated HTMLpages.
The text mining techniques can be employed for web content mining directly.
Web Structure Mining
Web structure mining is concerned with discovering the model underlying the link
structures of the web.
It is used to study the topology of the hyperlinks with or without the description of
the links
This model can be used to categorize web pages and is useful to generate
information such as similarity and relationship between different web sites.
Web structure mining can be used to discover authority sites for the subjects and
overview sites for the subjects that point to many authorities.
It may be noted that while web content mining attempts to explore the structure
within a document(intra document structure),web structure mining studies the
structures of documents within the web itself(inter document structure)
A Collection of hyperlinked pages can be viewed as directed graph G=(V,E)
Out-degree: For node p out degree is the number of nodes to which it has links.
In-degree: For node p In degree is the number of nodes that have links to it.
The algorithms used for modeling web topology are
HITS
PageRank
CLEVER
Page Rank
Importance of a document is measured by counting citations or backlinks to a given
document.
A page can have high PageRank if there are many pages that point to it, or if there
are some pages that point to it which have a high PageRank.
Reference Node
Clustering and Determining similar pages
Bibliographic coupling
Co-citation
1 2
Data Cleaning 3 4
Data Cube OLAP Data Mining
Creation
112/12/21 51
Text Mining
The amount of information available on the Web has increased rapidly
(Information-explosion era) The amount of information available on the Web
has increased rapidly (Information-explosion era)
Users demand useful and reliable information from the Web in the shortest time
possible
Text mining ,therefore corresponds to the extension of the data mining
approach to textual data and is concerned with various tasks, such as
extraction of information implicitly contained in collection of documents, or
similarity-based structuring.
Text collection in general ,lacks the imposed structure of traditional database.
It contains vast range of information but it is difficult to decipher automatically.
Data-mining deals with Extraction of interesting information (or patterns) from
structured data.
As Text collection is unstructured particularly free running text , specific
techniques called Text Mining techniques have to be developed to aid
knowledge discovery.
To perform Text Mining
1.Impose a structure on textual data and use any known data mining techniques.
2.Develop a specific technique .
During the approach of Text Mining other related subjects also has to be
considered.
Information Retrieval
Information Extraction
Computational Linguistics
Information Retrieval
IR is the automatic retrieval of all relevant documents at the same time retrieving as
few of the non relevant ones as possible.
Recent trends in IR include modeling , clustering , classification ,data
visualization , user interfaces, filtering etc which is an instance of Text Mining.
Information Extraction
Information Extraction has the goal of transforming a collection of documents,
usually with the help of an IR system, into information that is more readily
digested and analyzed.
IE extracts relevant facts from the documents ,while IR selects relevant
documents.
More IE systems use machine learning or data mining techniques to learn the
extraction patterns or rules for documents semi-automatically or automatically.
IE is a kind of preprocessing stage in the Text Mining process which is the step
after IR and before DM techniques are performed
IE can also be used to improve the indexing process , which is part of IR
process.
IE is an instance of Text Mining as summary of document given by IE is a form of
information that did not exist before.
Computational Linguistics
Unstructured Text
Unstructured documents are free texts, such as news stories.
Traditionally most of the research uses bags of words to represent unstructured
documents and extract different features from it.
Features
Word Occurrences
STOP WORDS
This feature selection includes removing the case , punctuation , infrequent words ,
and stop words.
STEMMING
N-GRAM
PART-OF-SPEECH(POS)
POSITIONAL COLLOCATIONS
Once the features are extracted the text is represented as structured data and
traditional data mining techniques can be used.
Episode Rule Discovery for Texts
Here Text is considered as sequential data which consists of a sequence of
pairs(feature vector , index)
Where feature vector is an ordered set of features and the index contains
information about the position of the word in the sequence.
Text episode is defined as a pair
Where V is a collection of feature vectors and is a partial order on V.
Given a text sequence S, a text episode occurs within S if there is a way
of satisfying the feature vectors in V, using the feature vectors in S so that the
partial order is respected.
i.e feature vectors of V can be found within S in an order that satisfies
The support of in S is defined as the number of minimal occurrences of in
S.
The episode discovery technique of sequence mining can be used to discover
frequent episodes in a text.
Hierarchy of Categories
It is necessary to organize search result documents into meaning full groups
Same author
Same year
Same publisher
Subject matter