You are on page 1of 10

DWM Assignment 1

Roll No. N215 Name: Soumya Chhajed


Class : MBA Tech Date of Assignment: 21/9/20

1. Write detailed notes on the following: -


a. Web Content Mining

Web content mining is related but different from data mining and text mining. It is related to
data mining because many data mining techniques can be applied in Web content mining. It
is related to text mining because much of the web contents are texts. However, it is also quite
different from data mining because Web data are mainly semi-structured and/or unstructured,
while data mining deals primarily with structured data. Web content mining is also different
from text mining because of the semi-structure nature of the Web, while text mining focuses
on unstructured texts. Web content mining thus requires creative applications of data mining
and/or text mining techniques and also its own unique approaches. In the past few years,
there was a rapid expansion of activities in the Web content mining area. This is not
surprising because of the phenomenal growth of the Web contents and significant economic
benefit of such mining. However, due to the heterogeneity and the lack of structure of Web
data, automated discovery of targeted or unexpected knowledge information still present
many challenging research problems. In this tutorial, we will examine the following
important Web content mining problems and discuss existing techniques for solving these
problems. Some other emerging problems will also be surveyed.

 Data/information extraction: Our focus will be on extraction of structured data from


Web pages, such as products and search results. Extracting such data allows one to
provide services. Two main types of techniques, machine learning and automatic
extraction are covered.
 Web information integration and schema matching: Although the Web contains a
huge amount of data, each web site (or even page) represents similar information
differently. How to identify or match semantically similar data is a very important
problem with many practical applications. Some existing techniques and problems are
examined.
 Opinion extraction from online sources: There are many online opinion sources,
e.g., customer reviews of products, forums, blogs and chat rooms. Mining opinions
(especially consumer opinions) is of great importance for marketing intelligence and
product benchmarking. We will introduce a few tasks and techniques to mine such
sources.
 Knowledge synthesis: Concept hierarchies or ontology are useful in many
applications. However, generating them manually is very time consuming. A few
existing methods that explores the information redundancy of the Web will be
presented. The main application is to synthesize and organize the pieces of
information on the Web to give the user a coherent picture of the topic domain.
 Segmenting Web pages and detecting noise: In many Web applications, one only
wants the main content of the Web page without advertisements, navigation links,
copyright notices. Automatically segmenting Web page to extract the main content of
the pages is interesting problem. A number of interesting techniques have been
proposed in the past few years.
All these tasks present major research challenges and their solutions also have immediate
real-life applications. The tutorial will start with a short motivation of the Web content
mining. We then discuss the difference between web content mining and text mining, and
between Web content mining and data mining. This is followed by presenting the above
problems and current state-of-the-art techniques. Various examples will also be given to help
participants to better understand how this technology can be deployed and to help businesses.
All parts of the tutorial will have a mix of research and industry flavour, addressing seminal
research concepts and looking at the technology from an industry angle.
There are two approaches that are used for Web Content Mining:
1. Agent-based approach:
This approach involves intelligent systems. It usually relies on autonomous agents,
that can identify websites that are relevant.

2. Data-based approach:
Data-Based approach is used to organize semi-structured data present on the internet
into structured data.

 Crawlers:
1. Robot (spider) traverses the hypertext structure in the Web.
2. Collect information from visited pages
3. Used to construct indexes for search engines
4. Traditional Crawler – visits entire Web (?) and replaces index
5. Periodic Crawler – visits portions of the Web and updates subset of index
6. Incremental Crawler – selectively searches the Web and incrementally
modifies index
7. Focused Crawler – visits pages related to a particular subject

 Focused Crawlers
1. Only visit links from a page if that page is determined to be relevant.
2. Classifier is static after learning phase.
3. Components:
 Classifier which assigns relevance score to each page based on crawl
topic.
 Distiller to identify hub pages.
 Crawler visits pages to base on crawler and distiller scores.
4. Classifier to related documents to topics
5. Classifier also determines how useful outgoing links are
6. Hub Pages contain links to many relevant pages. Must be visited even if not
high relevance score.

 Context Focused Crawlers


1. Context Graph:
 Context graph created for each seed document.
 Root is the seed document.
 Nodes at each level show documents with links to documents at next
higher level.
 Updated during crawl itself.
2. Approach:
 Construct context graph and classifiers using seed documents as
training data.
 Perform crawling using classifiers and context graph created.
 Virtual Web View
1. Multiple Layered DataBase (MLDB) built on top of the Web.
2. Each layer of the database is more generalized (and smaller) and centralized
than the one beneath it.
3. Upper layers of MLDB are structured and can be accessed with SQL type
queries.
4. Translation tools convert Web documents to XML.
5. Extraction tools extract desired information to place in first layer of MLDB.
6. Higher levels contain more summarized data obtained through generalizations
of the lower levels.

 Personalization
1. Web access or contents tuned to better fit the desires of each user.
2. Manual techniques identify user’s preferences based on profiles or
demographics.
3. Collaborative filtering identifies preferences based on ratings from similar
users.
4. Content based filtering retrieves pages based on similarity between pages and
user profiles.

 Harvest System
The Harvest system is based on the use of caching, indexing, and crawling. Harvest is
actually a set of tools that facilitate gathering of information from diverse sources.
The Harvest design is cantered around the use of gatherers and brokers. A gatherer
obtains information for indexing from an Internet service provider, while a broker
provides the index and query interface. The relationship between brokers and
gatherers can vary. Brokers may interface directly with gatherers or may go through
other brokers to get to the gatherers. Indices in Harvest are topic-specific, as are
brokers. This is used to avoid the scalability problems found without this approach.
Web Content Mining Problems/Challenges
Data/Information Extraction: Extraction of structured data from Web pages, such as products
and search results is a difficult task. Extracting such data allows one to provide services. Two
main types of techniques, machine learning and automatic extraction are used to solve this
problem.
Web Information Integration and Schema Matching: Although the Web contains a huge
amount of data, each web site (or even page) represents similar information differently.
Identifying or matching semantically similar data is a very important problem with many
practical applications.
Opinion extraction from online sources: There are many online opinion sources, e.g.,
customer reviews of products, forums, blogs and chat rooms. Mining opinions (especially
consumer opinions) is of great importance for marketing intelligence and product
benchmarking.
Knowledge synthesis: Concept hierarchies or ontology are useful in many applications.
However, generating them manually is very time consuming. A few existing methods that
explores the information redundancy of the Web will be presented. The main application is to
synthesize and organize the pieces of information on the Web to give the user a coherent
picture of the topic domain.
Segmenting Web pages and detecting noise: In many Web applications, one only wants the
main content of the Web page without advertisements, navigation links, copyright notices.
Automatically segmenting Web page to extract the main content of the pages is interesting
problem.
Difference Between Web Content, Web Structure, and Web Usage Mining:

WEB CONTENT WEB


Criterion WEB USAGE
IR VIEW DB VIEW STRUCTURE
 Unstructured  Semi-structured  Link structure  Interactivity
View of data
 Structured  Website as DB
 Text documents Hypertext Link structure  Server logs
Main data  Hypertext documents  Browser
documents logs
 Machine  Proprietary Proprietary  Machine
Learning algorithm algorithm learning
Method  Statistical  Association  Statistical
(Including NLP) rules  Association
Rules
 Bag of words, n-  Edged labelled  Graph  Relational
gram terms graph Table
 Phrases,  Relational  Graph
Representation
concepts or
ontology
 Relational
 Categorization  Finding  Categorization  Site
 Clustering frequent sub  Clustering construction
Application  Finding Extract structures  Adaptation and
Categories rules  Web site management
 Finding Patterns schema
in text discovery
b. Web Structure Mining
Web Structure Mining focuses on analysis of the link structure of the web and one of its
purposes is to identify more preferable documents. The different objects are linked in some
way. The intuition is that a hyperlink from document A to document B implies that the author
of document. A thinks document B contains worthwhile information. Web structure mining
helps in discovering similarities between web sites or discovering important sites for a
particular topic or discipline or in discovering web communities.
Simply applying the traditional processes and assuming that the events are independent can
lead to wrong conclusions. However, the appropriate handling of the links could lead to
potential correlations, and then improve the predictive accuracy of the learned models.
The goal of Web structure mining is to generate structural summary about the Web site and
Web page. Technically, Web content mining mainly focuses on the structure of inner-
document, while Web structure mining tries to discover the link structure of the hyperlinks at
the inter-document level. Based on the topology of the hyperlinks, Web structure mining will
categorize the Web pages and generate the information, such as the similarity and
relationship between different Web sites.
Web structure mining can also have another direction – discovering the structure of Web
document itself. This type of structure mining can be used to reveal the structure (schema) of
Web pages; this would be good for navigation purpose and make it possible to
compare/integrate Web page schemes. This type of structure mining will facilitate
introducing database techniques for accessing information in Web pages by providing a
reference schema.

 Mine structure (links, graph) of the Web


 Techniques
– PageRank
– CLEVER
 Create a model of the Web organization.
 May be combined with content mining to more effectively retrieve important pages.

 Page Rank:
o Used by Google
o Prioritize pages returned from search by looking at Web structure.
o Importance of page is calculated based on number of pages which point to it –
Backlinks.
o Weighting is used to provide more importance to backlinks coming form
important pages.
o PR(p) = c (PR(1)/N1 + … + PR(n)/Nn)
 PR(i): PageRank for a page i which points to target page p.
 Ni: number of links coming out of page i

 CLEVER:
o Identify authoritative and hub pages.
o Authoritative Pages :
 Highly important pages.
 Best source for requested information.
o Hub Pages :
 Contain links to highly important pages.
 HITS:
o Hyperlink-Induces Topic Search
o Based on a set of keywords, find set of relevant pages – R.
o Identify hub and authority pages for these.
 Expand R to a base set, B, of pages linked to or from R.
 Calculate weights for authorities and hubs.
o Pages with highest ranks in R are returned.

c. Web Usage Mining

Web usage mining, a subset of Data Mining, is basically the extraction of various types of
interesting data that is readily available and accessible in the ocean of huge web pages,
Internet- or formally known as World Wide Web (WWW). Being one of the applications of
data mining technique, it has helped to analyze user activities on different web pages and
track them over a period of time. Basically, Web Usage Mining can be divided into 2 major
subcategories based on web usage data.
1. Web Content Data: The common forms of web content data are HTML, web pages,
images audio-video, etc. The main being the HTML format. Though it may differ from
browser to browser the common basic layout/structure would be the same everywhere. Since
it’s the most popular in web content data. XML and dynamic server pages like JSP, PHP, etc.
are also various forms of web content data.
2. Web Structure Data: On a web page, there is content arranged according to HTML tags
(which are known as intrapage structure information). The web pages usually have hyperlinks
that connect the main webpage to the sub-web pages. This is called Inter-page structure
information. So basically relationship/links describing the connection between webpages is
web structure data.
3. Web Usage Data: The main source of data here is-Web Server and Application Server. It
involves log data which is collected by the main above two mentioned sources. Log files are
created when a user/customer interacts with a web page. The data in this type can be mainly
categorized into three types based on the source it comes from:

 Server-side
 Client-side
 Proxy side.
There are other additional data sources also which include cookies, demographics, etc.

Types of Web Usage Mining based upon the Usage Data:

1. Web Server Data: The web server data generally includes the IP address, browser logs,
proxy server logs, user profiles, etc. The user logs are being collected by the web server data.
2. Application Server Data: An added feature on the commercial application servers is to
build applications on it. Tracking various business events and logging them into application
server logs is mainly what application server data consists of.
3. Application-level data: There are various new kinds of events that can be there in an
application. The logging feature enabled in them helps us get the past record of the events.

Advantages of Web Usage Mining

 Government agencies are benefited from this technology to overcome terrorism.


 Predictive capabilities of mining tools have helped identify various criminal activities.
 Customer Relationship is being better understood by the company with the aid of these mining
tools. It helps them to satisfy the needs of the customer faster and efficiently.

Disadvantages of Web Usage Mining

 Privacy stands out as a major issue. Analyzing data for the benefit of customers is good. But
using the same data for something else can be dangerous. Using it within the individual’s
knowledge can pose a big threat to the company.
 Having no high ethical standards in a data mining company, two or more attributes can be
combined to get some personal information of the user which again is not respectable.
Some Techniques in Web Usage Mining

1. Association Rules:The most used technique in Web usage mining is Association Rules.
Basically, this technique focuses on relations among the web pages that frequently appear
together in users’ sessions. The pages accessed together are always put together into a single
server session. Association Rules help in the reconstruction of websites using the access logs.
Access logs generally contain information about requests which are approaching the
webserver. The major drawback of this technique is that having so many sets of rules
produced together may result in some of the rules being completely inconsequential. They
may not be used for future use too.
2. Classification: Classification is mainly to map a particular record to multiple predefined
classes. The main target here in web usage mining is to develop that kind of profile of
users/customers that are associated with a particular class/category. For this exact thing, one
requires to extract the best features that will be best suitable for the associated class.
Classification can be implemented by various algorithms – some of them include- Support
vector machines, K-Nearest Neighbors, Logistic Regression, Decision Trees, etc. For
example, having a track record of data of customers regarding their purchase history in the
last 6 months the customer can be classified into frequent and non-frequent
classes/categories. There can be multiclass also in other cases too.
3. Clustering: Clustering is a technique to group together a set of things having similar
features/traits. There are mainly 2 types of clusters- the first one is the usage cluster and the
second one is the page cluster. The clustering of pages can be readily performed based on the
usage data. In usage-based clustering, items that are commonly accessed /purchased together
can be automatically organized into groups. The clustering of users tends to establish groups
of users exhibiting similar browsing patterns. In page clustering, the basic concept is to get
information quickly over the web pages.

Applications of Web Usage Mining

1. Personalization of Web Content: The World Wide Web has a lot of information and is
expanding very rapidly day by day. The big problem is that on an everyday basis the specific
needs of people are increasing and they quite often don’t get that query result. So, a solution
to this is web personalization. Web personalization may be defined as catering to the user’s
need-based upon its navigational behavior tracking and their interests. Web Personalization
includes recommender systems, check-box customization, etc. Recommender systems are
popular and are used by many companies.
2. E-commerce: Web-usage Mining plays a very vital role in web-based companies. Since
their ultimate focus is on Customer attraction, customer retention, cross-sales, etc. To build a
strong relationship with the customer it is very necessary for the web-based company to rely
on web usage mining where they can get a lot of insights about customer’s interests. Also, it
tells the company about improving its web-design in some aspects.
3. Prefetching and Catching: Prefetching basically means loading of data before it is
required to decrease the time waiting for that data hence the term ‘prefetch’. All the results
which we get from web usage mining can be used to produce prefetching and caching
strategies which in turn can highly reduce the server response time.

You might also like