You are on page 1of 53

WEB MINING

Presentation 1
CSE 590 DATA MINING Prof. Anita Wasilewska SUNY Stony Brook Presented By: Alka Simha 106677801 Avanthi Gupta 106616697 Megha Krishnamurthy 106616749

REFERENCES
Data Mining: Concepts & Techniques by Jiawei Han and Micheline Kamber Presentation Slides of Prof. Anita Wasilewska http://en.wikipedia.org/wiki/Web_mining http://www.ieee.org.ar/downloads/Srivastava-tut-pres.pdf http://searchcrm.techtarget.com/sDefinition/0,,sid11_gci789009,00.html http://www.cs.rpi.edu/~youssefi/research/VWM/ http://www.galeas.de/webimining.html R. Kosala. and H. Blockeel, Web Mining Research: A Survey, SIGKDD Explorations, 2(1):1-15, 2000. R. Cooley, B. Mobasher, and J. Srivastava. Data preparation for mining world wide web browsing patterns. Journal of Knowledge and Information Systems 1, 5-32, 1999 S. Chakrabarti, Data mining for hypertext: A tutorial survey. ACM SIGKDD Explorations, 1(2):1-11, 2000System, 1(1), 1999 Mining the Web Discovering Knowledge from Hypertext Data - Soumen Chakrabarti Y. S. Mareek and I. Z. B. Shaul. Automatically organizing bookmarks per contents. Proc. Fifth International World Wide Web Conference, May 6-10 1996.

OVERVIEW
What is Web Mining Challenges in Web Mining Data Mining V/s Web Mining Classification or Taxonomy Applications of Web Mining Conclusion

What is Web Mining


The web as we all know is the SINGLE largest source of data available. Web mining aims to extract and mine useful knowledge from the web. It is used to understand the customer behavior, evaluate the effectiveness of a website and also to help quantify the success of a marketing campaign.

Due to the large availability of data the world wide web, it has become very important for users to use automated tools to find the desired information resources. For example a user uses Google or Yahoo search for finding information. These factors thus give rise to the necessity of creating server and client side intelligent systems which can effectively mine for knowledge. The information gathered through the Web is further evaluated by using traditional data mining techniques such as clustering, classification and association.

SEARCHING THE WEB

http://infolab.stanford.edu/~ullman/mining/2008/slides/web_mining_overview.pdf

HOW BIG IS THE WEB


224,749,695 (Mar 2009) Netcraft survey Total no of sites across all domains

http://news.netcraft.com/archives/web_server_survey.html

CHALLENGES IN WEB MINING


Finding useful and relevant information. Creating knowledge from available information. As the coverage of information is very wide and diverse, personalization of the information is a tedious process. Learning customer and individual user patterns. Much of the web information is redundant, as the same piece of information or its variant appears in many pages. The web is noisy i.e. a page typically contains a mixture of many kinds of information like, main content, advertisements, copyright notice, navigation panels. The web is dynamic, information keeps changing constantly. Keeping up with the changes and monitoring them are very important.

The Web is also about services. Many Web sites and pages enable people to perform operations with input parameters, i.e., they provide services. The most important challenge faced is Invasion of Privacy. Privacy is considered lost when information concerning an individual is obtained, used, or disseminated, when it occurs without their knowledge or consent.

http://en.wikipedia.org/wiki/Web_mining

USES OF WEB MINING


This technology has enabled ecommerce to do personalized marketing, which eventually results in higher trade volumes. The predicting capability of the mining application can benefit the society by identifying criminal activities. The companies can establish better customer relationship by giving them exactly what they need. Companies can understand the needs of the customer better and they can react to customer needs faster. The companies can find, attract and retain customers, they can save on production costs by utilizing the acquired insight of customer requirements. They can increase profitability by target pricing based on the profiles created. They can even find the customer who might default to a competitor the company will try to retain the customer by providing promotional offers to the specific customer, thus reducing the risk of losing a customer.
http://en.wikipedia.org/wiki/Web_mining

WEB MINING vs DATA MINING


STRUCTURE Data Mining Data is structured and has well columns, rows, keys and constraints. Web Mining Dynamic and rich in features and patterns. defined tables,

Web mining involves analysis of web server logs of a website whereas data mining involves using techniques to find relationships in large amounts of data. SPEED Often need to react to evolving usage patterns in real time eg. Merchandizing.

http://www.information-management.com/news/5458-1.html

WEB CRAWLERS
A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms for Web crawlers are ants, automatic indexers, bots, and worms or Web spider, Web robot Search engines, use spidering as a means of providing up-to-date data Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code. Crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for spam), eg. anita at cs dot sunysb dot edu ; mueller{remove this}@cs.sunysb.edu A Web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier

April 21, 2009

Web Mining

13

WEB MINING TAXONOMY


Web Mining

Web Content Mining

Web Structure Mining

Web Usage Mining

Identify information within given web pages Distinguish personal home pages from other web pages

Infer knowledge from the World-Wide Web organization and the links between references and referents in the Web

Also known as Web Log Mining Extract interesting patterns and trends in web access logs

WEB CONTENT MINING


Discovery of useful information from web contents / data / documents Web data contents: text, image, audio, video, metadata and hyperlinks Pre-processing data before web content mining: feature selection Post-processing data can reduce ambiguous searching results Web Page Content Mining: Mines the contents of documents directly Search Engine Mining: Improves on the content search of other tools like search engines Web Content Mining is related to data mining and text mining It is related to data mining because many data mining techniques can be applied in Web content mining It is related to text mining because much of the web content is text

Issues in Web Content Mining


Developing intelligent tools for IR Finding keywords for key phrases Discovering grammatical rules and collocations Hypertext classification/categorization Extracting key phrases from text documents Learning extraction models/rules Hierarchical clustering Predicting words relationship Developing Web query systems WebOQL, XML-QL Mining multimedia data Mining image from satellite (Fayyad, et al. 1996) Mining image to identify small volcanoes on Venus (Smyth, et al 1996)

mandolin.cais.ntu.edu.sg/wise2002/web-mining-WISE-30.ppt

WEB STRUCTURE MINING


The structure of a typical Web graph consists of Web pages as nodes, and hyperlinks as edges connecting two related pages Web Structure Mining is the process of discovering information from the Web Finding information about the web pages and inference on Hyperlink Retrieving information about the relevance and the quality of the web page This type of mining can be performed either at the (intra-page) document level or at the (inter-page) hyperlink level Finding authoritative Web pages Retrieving pages that are not only relevant but are also of high quality, or authoritative on the topic

WEB STRUCTURE MINING


Hyperlinks can infer the notion of authority The Web consists not only of pages, but also of hyperlinks pointing from one page to another These hyperlinks contain an enormous amount of latent human annotation A hyperlink pointing to another Web page, this can be considered as the author's endorsement of the other page To discover the link structure of the hyperlinks at the inter-document level and to generate structural summary about the Website and Web page: Based on the hyperlinks, categorizing the Web pages and generated information Discovering the structure of Web document itself Discovering the nature of the hierarchy or network of hyperlinks in the Website of a particular domain The research at the hyperlink level is also called Hyperlink Analysis

mandolin.cais.ntu.edu.sg/wise2002/web-mining-WISE-30.ppt

WEB USAGE MINING


Web usage mining also known as Web log mining What is Usage mining? Discovering user navigation patterns from web data Prediction of user behavior while he interacts with the web Helps to improve large collection of resources Typical sources of data: Automatically generated data stored in server access logs, referrer logs, agent logs and client-side cookies User profiles Meta data: Page attributes, content attributes, usage data

mandolin.cais.ntu.edu.sg/wise2002/web-mining-WISE-30.ppt

WEB USAGE MINING

mandolin.cais.ntu.edu.sg/wise2002/web-mining-WISE-30.ppt

WEB USAGE MINING

mandolin.cais.ntu.edu.sg/wise2002/web-mining-WISE-30.ppt

WEB USAGE MINING


Applications: Target potential customers for electronic commerce Enhance the quality and delivery of Internet information

services to the end user Improve Web server system performance Identify potential prime advertisement locations Facilitates personalization of sites Improve site design Fraud/intrusion detection Predict users actions (allows pre-fetching)

mandolin.cais.ntu.edu.sg/wise2002/web-mining-WISE-30.ppt

Problems with Web Logs


Typically a 30 minute timeout is used Web content may be dynamic
May not be able to reconstruct what the user saw

Use of spiders and automated agents automatic request web pages Like most data mining tasks, web log mining requires preprocessing To identify users To match sessions to other data To fill in missing data Essentially, to reconstruct the click stream

mandolin.cais.ntu.edu.sg/wise2002/web-mining-WISE-30.ppt

Problems with Web Logs


Identifying users Clients may have multiple streams Clients may access web from multiple hosts Proxy servers: many clients/one address Proxy servers: one client/many addresses Data not in log POST data (i.e., CGI request) not recorded Cookie data stored elsewhere Other issues When does a session end Pages may be cached

mandolin.cais.ntu.edu.sg/wise2002/web-mining-WISE-30.ppt

Web Log Data Mining Applications


Association rules Find pages that are often viewed together Clustering Cluster users based on browsing patterns Cluster pages based on content Classification Relate user attributes to patterns
mandolin.cais.ntu.edu.sg/wise2002/web-mining-WISE-30.ppt

Web Logs
Web servers have the ability to log all requests Web server log formats: Most use the Common Log Format (CLF)
New, Extended Log Format allows configuration of log file

Design of a Web Log Miner: Web log is filtered to generate a relational database A data cube is generated from the database OLAP is used to drill-down and roll-up in the cube OLAM is used for mining interesting knowledge
Web log Database Data Cube

Knowledge

R ( q ) R+ (=() p/1 ) n q ) (pou qGde ,) re (


1 Data Cleaning 2 Data Cube Creation 3 OLAP

Sliced and diced cube

4 Data Mining

Web Logs

http://mate.dm.uba.ar/~pfmislej/web%20mining/web%20mining.pdf

WEB MINING APPLICATIONS


Personalization, Recommendation engines Web-commerce applications Intelligent web search Hypertext classification and Categorization Information/trend monitoring Analysis of online communities Improving the relationship between the website and the user Recommendations to modify the web site structure and content Web personalization Intelligent web site They are systems that based on the user behavior, allow implementation of changes to the current web site structure and content

paginas.fe.up.pt/~ec/files_0506/slides/06_WebMining.pdf

Personalization of Webpages

http://www.ieee.org.ar/downloads/Srivastava-tut-pres.pdf

CONCLUSION
Web has been adopted as a critical communication and information medium by a majority of the population. Web data is growing at a significant rate. A number of new Computer Science concepts and techniques have been developed. Many successful applications exist. Fertile area of research. Privacy real debate needed.

VISUAL WEB MINING


Presentation 2
CSE 590 DATA MINING Prof. Anita Wasilewska SUNY Stony Brook Presented By: Alka Simha 106677801 Avanthi Gupta 106616697 Megha Krishnamurthy 106616749

Visual Web Mining


WWW2004, May 1722, 2004, New York, New York, USA. ACM 1-58113-912-8/04/0005
Amir H. Youssefi David J. Duke Rensselaer Polytechnic Institute University of Bath Institute youssefi@cs.rpi.edu d.duke@bath.ac.uk Mohammed J. Zaki Rensselaer Polytechnic zaki@cs.rpi.edu

International World Wide Web Conference May 17 22, 2004

References
http://www.cs.rpi.edu/~zaki/PS/WWW04.p df http://www.cs.rpi.edu/~youssefi/research/V WM/ http://www.vtk.org/ http://www.w3.org/Robot/ http://www.cs.rpi.edu

Overview
What is Visual Web Mining Abstract Introduction Visual Web Mining Architecture Visual Representation Design and Implementation of diagrams Conclusion

What is Visual Web Mining


Application of Information visualization techniques on results of Web Mining in order to further amplify the perception of extracted patterns and visually explore new ones in web domain.

http://www.cs.rpi.edu/~youssefi/research/VWM/

Abstract
Analysis of web site usage data involves two significant challenges:
Volume of data arising from the growth of the web. Structural complexity of web sites.

In this paper
Applied Data Mining and Information Visualization techniques to the web domain; in order to benefit from the power of both human visual perception and computing. Applied Data Mining techniques to large web data sets and use Information Visualization methods on the results.

GOAL: - To correlate the outcomes of mining Web Usage Logs and the extracted Web Structure, by visually superimposing the results.

Introduction
Information Visualization
Visual representations of abstract data, using computer-supported, interactive visual interfaces to reinforce human cognition; thus enabling the viewer to gain knowledge about the internal structure of the data and relationships in it.

Visual Web Mining Framework


Provides a prototype implementation for applying information visualization techniques to the results of Data Mining. User Session Compact sequence of web accesses by a user.

Visualization in order to:


Understand the structure of a particular website. Web surfers behavior when visiting that website.

Due to the large dataset and the structural complexity of the sites, 3D visual representations are used. Implemented using an open source toolkit called the Visualization Tool Kit (VTK). - VTK consists of a C++ class library and several interpreted interface layers including Tcl/Tk, Java, and Python.

http://www.vtk.org/

Visual Web Mining Architecture

Visual Web Mining Architecture


Input: Web pages and Web server log files. web robot (webbot) is used to retrieve the pages of the website. - The webbot is a very fast Web walker with support for regular expressions, SQL logging facilities, and many other features. It can be used to check links, find bad HTML, map out a web site, download images, etc. In parallel, Web Server Log files are downloaded and processed through a sessionizer and a LOGML file is generated. The Integration Engine is a suite of programs for data preparation, i.e., extracting, cleaning, transforming and integrating data and finally loading into database and later generating graphs in XGML.
http://www.w3.org/Robot/

Visual Web Mining Architecture


User sessions from web logs are extracted, which yields results roughly related to a specific user. User sessions are then converted into a special format for Sequence Mining using cSPADE (continues Spade - Sequential PAttern Discovery Using Equivalent Class). Outputs: - Frequent contiguous sequences with a given minimum support. - These are imported into a database, and non-maximal frequent sequences are removed. - Different queries are executed against this data according to some criterion, e.g. support of each pattern, length of patterns, etc. - Different URLs which correspond to the same webpage are unified in the final results. The Visualization Stage: Maps the extracted data and attributes into visual images, realized through VTK extended with support for graphs. Result: Interactive 3D/2D visualizations which could be used by analysts to compare actual web surfing patterns with expected patterns.

Visual Representation
Structures : - Graphs
Extract spanning tree from the site structure, and use this as the framework for presenting access-related results through glyphs(an element of writing) and color mapping.

- Stream Tubes
Variable-width tubes showing access paths with different traffic are introduced on top of the web graph structure.

Design and Implementation of Diagrams visualization of the web graph This is a


of the Computer Science department of Rensselaer Polytechnic Institute. Strahler numbers are used for assigning colors to edges. One can see user access paths scattering from first page of website (the node in center) to cluster of web pages corresponding to faculty pages, course home pages, etc.

2D visualization layout with Strahler Coloring applied on web usage logs


Strahler numbers is a numerical measure of the branching complexity for assigning colors to the edges.

http://www.cs.rpi.edu

Adding third dimension enables visualization of more information and clarifies user behavior in and between clusters. Center node of circular basement is first page of web site from which users scatter to different clusters of web pages. Color spectrum from Red (entry point into clusters) to Blue (exit points) clarifies behavior of users. The cylinder like part of this figure is visualization of web usage of surfers as they browse a long HTML document.

3D visualization layout with Strahler Coloring applied on web usage logs

Left: One can observe long user sessions as strings falling off. Those are special type of long sessions when user navigates sequence of web pages which come one after the other e.g., sections of a long document. In many cases were found web pages with many nodes connected with Next/Up/Previous hyperlinks. Right: An enlarged view of the same visualization.

Frequent access patterns extracted by the web mining process are visualized as a white graph on top of an embedded and colorful graph of web usage.

Superimposition of Frequent Patterns extracted from Web Mining on top of Web Usage

Similar to last picture with addition of another attribute, i.e., frequency of pattern which is rendered as thickness of white tubes. This helps in the analysis of results.

Thickness of the tubes frequency of found patterns

represents

Superimposition of Web Usage on top of Web Structure with higher order layout. Top node is the first page of the website. Hierarchical output of layouts make analysis easier.

Higher Order layout for clear visualization and easier analysis

Left: Superimposition of website dynamics(colored) on top of its static structure(gray) Right: Zoom view of colored region with layout of Web Usage taken from Web Graph basement. The basement itself is removed for clarity

Conclusion
- Using the visualizations, a web analyzer can easily identify which parts of the website are cold parts with few hits and which parts are hot ones with many hits and classify them accordingly. This also paves way for making exploratory changes in website. For e.g., adding links from hot parts of web site to cold parts and then extracting, visualizing and interpreting changes in access patterns.

SPADE OVERVIEW
An algorithm based on Apriori for fast discovery of frequent sequences Needs three database scans in order to extract sequential patterns Given: A database of customer transactions, each of which having the following characteristics: sequence-id or customer-id, transaction-time and the item involved in the transaction. The aim is to obtain typical behaviors according to the user's viewpoint.

Users browsing access pattern is amplified by a different coloring Depending on link structure of underlying pages, we can see vertical access patterns of a user drilling down the cluster, making a cylinder shape. Also users following links going down a hierarchy of webpages makes a cone shape and users going up hierarchies, e.g., back to main page of website makes a funnel shape.

Amplification of a user session: Clickstream(Bottom Left) in drill down cylinder, Cone Scatter(Top Right) and Funnel Backoff to main page of website (Top Right)

You might also like