You are on page 1of 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/337590907

Research on Web Data Mining

Article · November 2019

CITATIONS READS
0 1,744

4 authors, including:

Archana Shirke
Fr. C. Rodrigues Institute of Technology
14 PUBLICATIONS   98 CITATIONS   

SEE PROFILE

All content following this page was uploaded by Archana Shirke on 28 November 2019.

The user has requested enhancement of the downloaded file.


Research on Web Data Mining
Mrs. Sunita S. Sane, Mrs. Archana A. Shirke,
Veermata Jijabai Technological Institute, Veermata Jijabai Technological Institute,
Mumbai. Mumbai.
Email- sssane@vjti.org.in Email- archanashirke25@gmail.com
Mining and sequential pattern Mining [15].
ABSTRACT
Web data mining is the mining of Web data. Web Classification: Stored data is used to locate data in
Mining aims to discover useful information or predetermined groups. For example, a restaurant chain
knowledge from Web Hyperlink Structure, Page could mine customer purchase data to determine when
Content and Usage data. Although Web Mining uses customers visit and what they typically order. This
many data Mining techniques, it is not purely an information could be used to increase traffic by having
application of traditional Data Mining due to the daily specials.
heterogeneity and semi-structure nature of the Web
data. Clustering: Data items are grouped according to
logical relationships or consumer preferences. For
Keywords example, data can be mined to identify market
Web Data Mining, Web Mining, Data Mining, Web segments or consumer affinities.
Content Mining. Web Usage Mining, Web Structure
Mining Association rule Mining: Data can be mined to
identify associations. The beer-diaper example is an
1. INTRODUCTION example of associative Mining.
The Web Mining research is a converging research
area from several research communities such as Sequential pattern Mining: Data is mined to
database, Information Retrieval and Artificial anticipate behavior patterns and trends. For example,
Intelligent research communities [1]. It has become an outdoor equipment retailer could predict the
increasingly necessary for users to utilize automated likelihood of a backpack being purchased based on a
tools in order to find, extract, filter, and evaluate the consumer's purchase of sleeping bags and hiking
desired information and resources. These factors give shoes.
rise to the necessity of creating server-side and client- Data Mining consists of five major elements [8]:
side intelligent systems that can effectively mine for • Extract, transform, and load transaction data
knowledge both across the Internet and in particular onto the data warehouse system.
Web localities[2]. • Store and manage the data in a
multidimensional database system.
The heterogeneity and the lack of structure that • Provide data access to business analysts and
permeates much of the ever expanding information information technology professionals.
sources on the World Wide Web, such as hypertext • Analyze the data by application software.
documents, makes automated discovery, organization, • Present the data in a useful format, such as a
and management of Web-based information difficult. graph or table.

1.1 INTRODUCTION TO DATA 1.2 INTRODUCTION TO WEB


MINING MINING
Data Mining is also called knowledge discovery in Huge amount of information available on the World
databases (KDD). It is commonly defined as the Wide Web leads to the Mining of Web. So Web
process of discovering useful patterns or knowledge Mining can be defined as the use of Data Mining
from data sources e.g. databases, texts, images, the techniques to automatically discover and extract
Web etc [13]. The pattern must be valid, potentially information from Web documents and services [2].
useful and understandable. Data Mining software is Web Mining is a cross area of Data Mining,
one of a number of analytical tools for analyzing data. Information retrieval, Information Extraction and
It allows users to analyze data from many different Artificial Intelligent. The Web is huge, diverse and
dimensions or angles, categorize it, and summarize the dynamic and thus raises the scalability and multimedia
relationships identified. Technically, data mining is issues [1]. Information users could encounter
the process of finding correlations or patterns among following problems when interacting with Web
dozens of fields in large relational databases. There finding relevant information, creating new knowledge
are many Data Mining tasks. Some of the common out of information available on the Web,
ones are supervised learning (or classification), Personalization of information, learning about
unsupervised learning (or clustering), association rule consumers or individual users. Web Mining
techniques could be used to solve the information
overload problems above directly or indirectly. IE has the goal of transforming a collection of
However there are techniques from different research documents usually with the help of IR system, into
areas such as database (DB), Information Retrieval information that is more readily digested and
(IR), Natural Language Processing (NLP) and Web analyzed. IE aims to extract relevant facts from the
Document Community could also be used [2]. documents while IR aims to select relevant document.
IE is interested in structure or representation of a
Web Mining is decomposed into following different document while IR views the text in a document just
subtask namely [2]: as a bag of unordered words. Some IE systems use
1. Resource Finding : the task of retrieving Machine Learning or Data Mining techniques to learn
intended Web documents the extraction patterns or rules for Web documents
2. Information selection & pre processing : semi-automatically or automatically. Within this view,
automatically selecting and pre processing Web Mining is part of the (Web) IE process [2].
specific information from retrieved Web
resources The Web Mining process is similar to the data Mining
3. Generalization : automatically discovers process. The difference is usually in the data
general patterns at individual Web sites as collection. In traditional Data Mining, the data is often
well as across multiple sites already collected and stored in a data warehouse. For
4. Analysis : validation and/or interpretation Web Mining, data collection can be a substantial task,
of the mined patterns especially for Web Structure and Content Mining
Resource finding means that process of retrieving the which involves crawling a large number of target Web
data that is either online or offline from the text pages. The classification of retrieval and mining tasks
sources available on the Web such as electronic for different types of data is given below [1].
newsletters, electronic newswire, newsgroup, the text
contents of HTML documents obtained by removing Data Web-
HTML tags, and also the manual selection of Web Any Textual
Related
Data Data
resources. Information selection and pre processing Purpose Data
step is any kind of transformation processes of the Retrieving
original data retrieved in the IR process. These known data
transformation could be either a kind of preprocessing or
that are mentioned above such as removing stop Data Information Web
documents
words, stemming etc or a pre processing aimed at Retrieval Retrieval Retrieval
efficiently
obtaining the desired representation such as finding and
phrases in the training corpus, transforming the effectively
representation to relational or first order logic form Finding new
etc. Machine learning or Data Mining techniques are patterns or
typically used for generalization. Humans play an Data Web
knowledge Text Mining
important role in information and knowledge Mining Mining
previously
discovery process on the Web since Web is an unknown
interactive medium. Thus query triggered knowledge Figure 1: Classification of retrieval and mining
discovery is as important as the more automatic data process
triggered knowledge discovery.

Thus Web Mining refers to the overall process of


2. WEB MINING CATEGORIES
Web Mining tasks can be categorized into three types
discovering potentially useful and previously
[2]
unknown information or knowledge from the Web
1. Web Content Mining (WCM) - Web content
data. It is extension of the standard process of
Mining refers to the discovery of useful information
knowledge discovery in databases (KDD) [2]. Web
from Web contents, including text, image, audio,
Mining is often associated with IR or IE. However
video, etc. Research in Web content Mining
Web Mining or information discovery on the Web is
encompasses resource discovery from the Web,
not the same as IR or IE. IR is automatic retrieval of
document categorization and clustering, and
all relevant documents while at the same time
information extraction from Web pages.
retrieving as few of the non relevant as possible. IR
2. Web Structure Mining (WSM) - Web structure
has a primary goal of indexing text and searching for
Mining studies the Web’s hyperlink structure. It
useful documents in a collection and nowadays
usually involves analysis of the in-links and out-links
research in IR includes modeling, document
of a Web page, and it has been used for search engine
classification and categorization, user interfaces, data
result ranking.
visualization, filtering etc. The task that can be
3. Web Usage Mining (WUM) - Web usage Mining
considered to be an instance of Web Mining is Web
focuses on analyzing search logs or other activity logs
document classification or categorization which could
to find interesting patterns. One of the main
be used for indexing. Viewed in this respect, Web
applications of Web usage Mining is to learn user
Mining is part of the (Web) IR process [2].
profiles.
2.1. Web Content Mining strategies across products and the effectiveness of
Web Content Mining is related but different from Data promotional campaigns, etc.
Mining and Text Mining. It is related to Data Mining
because many Data Mining techniques can be applied Web Usage Mining is the type of Web Mining activity
in Web content Mining. It is related to Text Mining that involves the automatic discovery of user access
because much of the Web contents are texts. However, patterns from one or more Web servers. As more
it is also quite different from Data Mining because organizations rely on the Internet and the World Wide
Web data are mainly semi-structured and/or Web to conduct business, the traditional strategies and
unstructured, while Data Mining deals primarily with techniques for market analysis need to be revisited in
structured data. Web Content Mining is also different this context. Organizations often generate and collect
from Text Mining because of the semi-structure nature large volumes of data in their daily operations. Most
of the Web, while text Mining focuses on unstructured of this information is usually generated automatically
texts. Web Content Mining thus requires creative by Web servers and collected in server access logs.
applications of Data Mining and/or Text Mining Other sources of user information include referrer
techniques and also its own unique approaches. In the logs, which contain information about the referring
past few years, there was a rapid expansion of pages for each page reference, and user registration or
activities in the Web Content Mining area. This is not survey data gathered via tools such as CGI scripts.
surprising because of the phenomenal growth of the Analyzing such data can help these organizations to
Web contents and significant economic benefit of such determine the life time value of customers, cross
Mining. However, due to the heterogeneity and the marketing strategies across products, and effectiveness
lack of structure of Web data, automated discovery of of promotional campaigns, among other things.
targeted or unexpected knowledge information still Analysis of server access logs and user registration
present many challenging research problems. In this data can also provide valuable information on how to
report, you can examine the following important Web structure a Web site in order to create a more effective
content Mining problems and discuss existing presence for the organization. Using intranet
techniques for solving these problems [1]. technologies in organizations, such analysis can shed
light on more effective management of workgroup
In recent years these factors have prompted communication and organizational infrastructure.
researchers to develop more intelligent tools for Finally, for organizations that sell advertising on the
information retrieval, such as intelligent Web agents, World Wide Web, analyzing user access patterns helps
as well as to extend database and Data Mining in targeting ads to specific groups of users.
techniques to provide a higher level of organization
WUM can be decomposed into the following subtasks:
2.2 Web Structure Mining
Web structure Mining is the process of using graph 2.2.1 Data Pre-processing for Mining:
theory to analyse the node and connection structure of It is necessary to perform a data preparation to convert
a Web site [1]. According to the type of Web the raw data for further process. It has separate
structural data, Web structure Mining can be divided subsections as follows.
into two kinds. The first kind of Web structure Mining • Content Preprocessing: Content
is extracting patterns from hyperlinks in the Web. A preprocessing is the process of converting
hyperlink is a structural component that connects the text, image, scripts and other files into the
Web page to a different location. The other kind of the forms that can be used by the usage Mining.
Web structure Mining is Mining the document • Structure Preprocessing: The structure of a
structure. It is using the tree-like structure to analyse Web site is formed by the hyperlinks
and describe the HTML (Hyper Text Markup between page views. The structure
Language) or XML (eXtensible Markup Language) preprocessing can be treated similar as the
tags within the Web page content preprocessing. However, each server
session may have to construct a different site
2.3 Web Usage Mining structure than others.
Web Usage Mining is the application that uses data • Usage Preprocessing: The inputs of the
Mining to analyse and discover interesting patterns of preprocessing phase may include the Web
user’s usage data on the Web. The usage data records server logs, referral logs, registration files,
the user’s behaviour when the user browses or makes index server logs, and optionally usage
transactions on the Web site. It is an activity that statistics from a previous analysis. The
involves the automatic discovery of patterns from one outputs are the user session file, transaction
or more Web servers. Organizations often generate file, site topology, and page classifications.
and collect large volumes of data; most of this
information is usually generated automatically by Web 2.2.2 Pattern Discovery
servers and collected in server log. Analyzing such This is the key component of the Web Mining. Pattern
data can help these organizations to determine the discovery covers the algorithms and techniques from
value of particular customers, cross marketing several research areas, such as Data Mining, Machine
Learning, Statistics, and Pattern Recognition. It has human consumption, and thus need to be transform to
separate subsections as follows. a format can be assimilate easily. There are two most
• Statistical Analysis: Statistical analysts may common approaches for the patter analysis. One is to
perform different kinds of descriptive use the knowledge query mechanism such as SQL,
statistical analyses based on different while another is to construct multi-dimensional data
variables when analyzing the session file. cube before perform OLAP operations. All these
By analyzing the statistical information methods assume the output of the previous phase has
contained in the periodic Web system report, been structured.
the extracted report can be potentially useful
for improving the system performance, 3. BASIC MODELS
enhancing the security of the system, In its full generality, a model must build machine
facilitation the site modification task, and representations of world knowledge, and therefore
providing support for marketing decisions involve a NL grammar for text, hypertext, and semi
• Association Rules: In the Web domain, the structured data which will useful for our learning
pages, which are most often referenced applications. We discuss some such models in this
together, can be put in one single server section [8].
session by applying the association rule
generation. Association rule mining
techniques can be used to discover
3.1 Models for structured text
1. Boolean Model: The simplest statistical model is
unordered correlation between items found
the Boolean model. It uses the notion of exact
in a database of transactions.
matching documents to the user query. Both the query
• Clustering: Clustering analysis is a
and the retrieval are based on the Boolean algebra.
technique to group together users or data
2. Vector Space Model: A document in the vector
items (pages) with the similar
spacer model is represented as a weight vector, in
characteristics. Clustering of user
which each component weight is computed based on
information or pages can facilitate the
some variation of TF or TF-IDF scheme. Document
development and execution of future
are tokenized using simple syntactic rules (such as
marketing strategies.
white space delimiters in English) and tokens
• Classification: Classification is the stemmed to canonical form (e.g., 'reading' to 'read,' 'is,'
technique to map a data item into one of 'was,' 'are' to 'be'). Each canonical token represents an
several predefined classes. The classification axis in a Euclidean space.
can be done by using supervised inductive 3. Statistical Language Model: This model is based
learning algorithms such as decision tree on probability and has foundations in statistical theory.
classifiers, naïve Bayesian classifiers, k- It first estimates a language model for each document
nearest neighbor classifier, Support Vector and then ranks documents by the likelihood of the
Machines etc. query given the language model.
• Sequential Pattern: This technique intends 4. Probabilistic Model: This model used for
to find the inter-session pattern, such that a document generation with the disclaimer that these
set of the items follows the presence of models have no bearing on grammar and semantic
another in a time-ordered set of sessions or coherence.
episodes. Sequential patterns also include In spite of minor variations all these models regard
some other types of temporal analysis such documents as multisets of terms, without paying
as trend analysis, change point detection, or attention to ordering between terms. Therefore they
similarity analysis. are collectively called bag-of-words models.
• Dependency Modeling: The goal of this
technique is to establish a model that is able
to represent significant dependencies among 3.2 Models for semi structured data
the various variables in the Web domain. Semi structured data is a point of convergence for the
The modeling technique provides a Web and database communities: the former deals with
theoretical framework for analyzing the documents, the latter with data. The form of that data
behavior of users, and is potentially useful is evolving from rigidly structured relational tables
for predicting future Web resource with numbers and strings to enable the natural
consumption. representation of complex real-world objects like
books, papers, movies, jet engine components, and
chip designs without sending the application writer
2.2.3 Pattern Analysis into contortions.
Pattern Analysis is a final stage of the whole Web Object Exchange Model (OEM): In OEM, data is in
usage Mining. The goal of this process is to eliminate the form of atomic or compound objects: atomic
the irrelative rules or patterns and to extract the objects may be integers or strings; compound objects
interesting rules or patterns from the output of the refer to other objects through labeled edges. HTML is
pattern discovery process. The output of Web Mining a special case of such 'intra-document' structure.
algorithms is often not in the form suitable for direct
The above forms of irregular structures naturally could be imagined for such ontology. For example, it
encourage Data Mining techniques from the domain of could enhance the capabilities of search engines by
'standard' structured warehouses to be applied, enabling them to answer queries like “Who teaches
adapted, and extended to discover useful patterns from course X at university Y?” or “How many students are
semi structured sources as well. in department Z?”, or serve as a backbone for Web
catalogues. A description of the first prototype system
4. LINK ANALYSIS can be found in. Semantic Web Mining emerged as
In recent years, Web link structure has been research field that focuses on the interactions of Web
widely used to infer important information about Web mining and the Semantic Web.
pages. Web structure mining has been largely
influenced by research in social network analysis and 6. WEB CRAWLING
citation analysis [8]. Citations (linkages) among Web Web crawlers are mainly used to create a copy of all
pages are usually indicators of high relevance or good the visited pages for later processing by a search
quality. We use the term in-links to indicate the engine that will index the downloaded pages to
hyperlinks pointing to a page and the term out-links to provide fast searches[13]. Crawlers can also be used
indicate the hyperlinks found in a page. Usually, the for automating maintenance tasks on a Website, such
larger the number of in-links, the more useful a page is as checking links or validating HTML code. Also,
considered to be. The rationale is that a page crawlers can be used to gather specific types of
referenced by many people is likely to be more information from Web pages, such as harvesting e-
important than a page that is seldom referenced. As in mail addresses. A Web crawler is one type of bot, or
citation analysis, an often cited article is presumed to software agent. In general, it starts with a list of URLs
be better than one that is never cited. In addition, it is to visit, called the seeds. As the crawler visits these
reasonable to give a link from an authoritative source URLs, it identifies all the hyperlinks in the page and
(such as Yahoo!) a higher weight than a link from an adds them to the list of URLs to visit, called the crawl
unimportant personal home page. By analyzing the frontier. URLs from the frontier are recursively visited
pages containing a URL, we can also obtain the according to a set of policies.
anchor text that describes it. Anchor text shows how There are three important characteristics of the Web
other Web page authors annotate a page and can be that make crawling it very difficult:
useful in predicting the content of the target page. • its large volume,
Several algorithms have been developed to address • its fast rate of change, and
this issue.
• dynamic page generation,
which combine to produce a wide variety of possible
5. THE SEMANTIC WEB crawlable URLs.The large volume implies that the
The Semantic Web is a term coined by Berner-Lee crawler can only download a fraction of the Web
[17] for the vision of making the information on the pages within a given time, so it needs to prioritize its
Web machine-processable. The basic idea is to enrich downloads. The high rate of change implies that by the
Web pages with machine-processable knowledge that time the crawler is downloading the last pages from a
is represented in the form of ontologies [19]. site, it is very likely that new pages have been added
Ontologies define certain types of objects and the to the site, or that pages have already been updated or
relations between them. As Ontologies are readily even deleted.
accessible (like other Web documents), a computer As Edwards et al. noted, "Given that the bandwidth for
program can use them to draw inferences about the conducting crawls is neither infinite nor free, it is
information provided on Web pages. One of the becoming essential to crawl the Web in not only a
research challenges in that area is to annotate the scalable, but efficient way, if some reasonable
information that is currently available on the Web with measure of quality or freshness is to be maintained.".
semantic tags. Typically, techniques from text A crawler must carefully choose at each step which
classification, hyper-text classification and pages to visit next.
information extraction are used for that purpose. A The behavior of a Web crawler is the outcome of a
landmark application in this area was the Web®KB combination of policies:
project at Carnegie-Mellon University (Craven, • A selection policy that states which pages to
DiPasquo, Freitag, McCallum, Mitchell, Nigam & download.
Slattery, 2000). Its goal was to assign Web pages or
• A re-visit policy that states when to check
parts of Web pages to entities in ontology. A simple
for changes to the pages.
test ontology modeled knowledge about computer
science departments: there are entities like students • A politeness policy that states how to avoid
(graduate and undergraduate), faculty members overloading Websites.
(professors, researchers, lecturers, post-docs, ...), • A parallelization policy that states how to
courses, projects, etc., and relations between these coordinate distributed Web crawlers.
entities, such as “courses are taught by one lecturer
and attended by several students” or “every graduate The importance of a page for a crawler can also be
student is advised by a professor”. Many applications expressed as a function of the similarity of a page to a
given query. Web crawlers that attempt to download
pages that are similar to each other are called focused Conference on Computational Intelligence and
crawler or topical crawlers. The main problem in Multimedia Applications (ICCIMA’03), 2003.
focused crawling is that in the context of a Web [3] R. Cooley, B. Mobasher, and J. Srivastava. “Web
crawler, we would like to be able to predict the Mining: Information and pattern discovery on the
similarity of the text of a given page to the query World Wide Web”. In Proceedings of the 9th IEEE
before actually downloading the page. International Conference on Tools with Artificial
Intelligence (ICTAI’97), 1997
7. WEB DATA MINING AND [4] S. Green, L. Hurst, B. Nangle, P. Cunningham, F.
Somers, and R. Evans. “Software agents: A review ,
AGENT PARADIGM Technical Report TCD-CS-1997-06” , Technical
Web Mining is often viewed from or implemented Report of Trinity College, University of Dublin, 1997.
within an agent paradigm. Thus, Web Mining has a [5] J. A. Delgado ” Agent-Based Information Filtering
close relationship with software agents or intelligent and Recommender System On the Internet.” PhD
agents. Indeed some of these agents perform data thesis, Dept of Intelligence Computer Science, Nagoya
Mining tasks to achieve their goals. According to Institute of Technology, March 2000.
Green [4] there are three sub-categories of software [6] Web Site : http://ww.celi.it
agents: User Interface Agents, Distributed Agents, and [7] Soumen Chakrabarti, “Mining the Web :
Mobile Agents. User Interface agents that can be Discovering Knowledge from Hypertext Data, Morgan
classified into the Web Mining agent category are Kaufmann,2003
information retrieval agents, information filtering [8] Bing Liu, “Web Data Mining: Exploring
agent and personal assistant agents. Distributed agents Hyperlinks, Contents and Usage Data”, Springer, 2007
technology is concerned with problem solving by a [9]Hsincun Chen and Michael Chau, “Web Mining :
group of agents and relevant agents in this category Machine Learning for Web Application” , Annual
are distributed agents for knowledge discovery or Data Review of Information Science and Technology,
Mining. Delgado classifies the user interface agents by University of Arizona
the underlying information filtering technology into [10] Chakrabarti, S. (2000). Data mining for
content based filters, event based filters and hybrid hypertext: A tutorial survey. SIGKDD
filters. In event based filtering, the system tracks and Explorations, 1(1), 1-11.
follows the events that are inferred form the surfing [11]Chakrabarti, S., Dom, B., & Indyk, P. (1998).
habits of people in the Web. Some examples of those Enhanced hypertext categorization using hyperlink.
events are saving a URL into a bookmark folder, Proceedings of the 1998 ACM SIGMOD International
mouse clicks and scrolls, link traverse behavior etc. Conference on Management of Data, 307-318.
[12] Johannes Fürnkranz, “Web Mining” chapter, TU
8. CONCLUSIONS Darmstadt, Knowledge Engineering Group.
Web Data Mining is a new field and there are [13] Web site : www.wikipedia.com\
researchers ventured in this field, especially Text [14] Margaret H. Dunham, “Data Mining: introductory
-Mining techniques. The key component of Web and advanced Topics”, Pearson Education, 2003.
Mining is the Mining process itself. A lot of work still [15] Jiawei Han and Michline Kamber, “Data Mining
remains to be done in adapting known Mining Concepts and Techniques”, Elevier publication,
techniques as well as developing new ones. second edition, 2006.
[16] Tom Mitchell, “Machine Learning” McGraw-Hill
9. REFERENCES , 1997
[1] Raymond Kosala, Hendrik Blockeel, “Web Mining [17] Berners-Lee, Hendler & Lassila, “Semantic
Research: A Survey”, SIGKDD Expirations, ACM Web”, 2001.
SIGKDD, July 2000. [18] Search Engine : http://www.google.com.
[2] Wang Bin, Liu Zhijing “Web Mining Research”. [19] Dieter Fensel, “Ontology versioning on the
In Proceedings of the 5th IEEE International semantic web”, 2001.

View publication stats

You might also like