Professional Documents
Culture Documents
Presentation 1
CSE 590 DATA MINING Prof. Anita Wasilewska SUNY Stony Brook Presented By: Alka Simha 106677801 Avanthi Gupta 106616697 Megha Krishnamurthy 106616749
REFERENCES
Data Mining: Concepts & Techniques by Jiawei Han and Micheline Kamber Presentation Slides of Prof. Anita Wasilewska http://en.wikipedia.org/wiki/Web_mining http://www.ieee.org.ar/downloads/Srivastava-tut-pres.pdf http://searchcrm.techtarget.com/sDefinition/0,,sid11_gci789009,00.html http://www.cs.rpi.edu/~youssefi/research/VWM/ http://www.galeas.de/webimining.html R. Kosala. and H. Blockeel, Web Mining Research: A Survey, SIGKDD Explorations, 2(1):1-15, 2000. R. Cooley, B. Mobasher, and J. Srivastava. Data preparation for mining world wide web browsing patterns. Journal of Knowledge and Information Systems 1, 5-32, 1999 S. Chakrabarti, Data mining for hypertext: A tutorial survey. ACM SIGKDD Explorations, 1(2):1-11, 2000System, 1(1), 1999 Mining the Web Discovering Knowledge from Hypertext Data - Soumen Chakrabarti Y. S. Mareek and I. Z. B. Shaul. Automatically organizing bookmarks per contents. Proc. Fifth International World Wide Web Conference, May 6-10 1996.
OVERVIEW
What is Web Mining Challenges in Web Mining Data Mining V/s Web Mining Classification or Taxonomy Applications of Web Mining Conclusion
Due to the large availability of data the world wide web, it has become very important for users to use automated tools to find the desired information resources. For example a user uses Google or Yahoo search for finding information. These factors thus give rise to the necessity of creating server and client side intelligent systems which can effectively mine for knowledge. The information gathered through the Web is further evaluated by using traditional data mining techniques such as clustering, classification and association.
http://infolab.stanford.edu/~ullman/mining/2008/slides/web_mining_overview.pdf
http://news.netcraft.com/archives/web_server_survey.html
The Web is also about services. Many Web sites and pages enable people to perform operations with input parameters, i.e., they provide services. The most important challenge faced is Invasion of Privacy. Privacy is considered lost when information concerning an individual is obtained, used, or disseminated, when it occurs without their knowledge or consent.
http://en.wikipedia.org/wiki/Web_mining
Web mining involves analysis of web server logs of a website whereas data mining involves using techniques to find relationships in large amounts of data. SPEED Often need to react to evolving usage patterns in real time eg. Merchandizing.
http://www.information-management.com/news/5458-1.html
WEB CRAWLERS
A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms for Web crawlers are ants, automatic indexers, bots, and worms or Web spider, Web robot Search engines, use spidering as a means of providing up-to-date data Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code. Crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for spam), eg. anita at cs dot sunysb dot edu ; mueller{remove this}@cs.sunysb.edu A Web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier
Web Mining
13
Identify information within given web pages Distinguish personal home pages from other web pages
Infer knowledge from the World-Wide Web organization and the links between references and referents in the Web
Also known as Web Log Mining Extract interesting patterns and trends in web access logs
mandolin.cais.ntu.edu.sg/wise2002/web-mining-WISE-30.ppt
mandolin.cais.ntu.edu.sg/wise2002/web-mining-WISE-30.ppt
mandolin.cais.ntu.edu.sg/wise2002/web-mining-WISE-30.ppt
mandolin.cais.ntu.edu.sg/wise2002/web-mining-WISE-30.ppt
mandolin.cais.ntu.edu.sg/wise2002/web-mining-WISE-30.ppt
services to the end user Improve Web server system performance Identify potential prime advertisement locations Facilitates personalization of sites Improve site design Fraud/intrusion detection Predict users actions (allows pre-fetching)
mandolin.cais.ntu.edu.sg/wise2002/web-mining-WISE-30.ppt
Use of spiders and automated agents automatic request web pages Like most data mining tasks, web log mining requires preprocessing To identify users To match sessions to other data To fill in missing data Essentially, to reconstruct the click stream
mandolin.cais.ntu.edu.sg/wise2002/web-mining-WISE-30.ppt
mandolin.cais.ntu.edu.sg/wise2002/web-mining-WISE-30.ppt
Web Logs
Web servers have the ability to log all requests Web server log formats: Most use the Common Log Format (CLF)
New, Extended Log Format allows configuration of log file
Design of a Web Log Miner: Web log is filtered to generate a relational database A data cube is generated from the database OLAP is used to drill-down and roll-up in the cube OLAM is used for mining interesting knowledge
Web log Database Data Cube
Knowledge
4 Data Mining
Web Logs
http://mate.dm.uba.ar/~pfmislej/web%20mining/web%20mining.pdf
paginas.fe.up.pt/~ec/files_0506/slides/06_WebMining.pdf
Personalization of Webpages
http://www.ieee.org.ar/downloads/Srivastava-tut-pres.pdf
CONCLUSION
Web has been adopted as a critical communication and information medium by a majority of the population. Web data is growing at a significant rate. A number of new Computer Science concepts and techniques have been developed. Many successful applications exist. Fertile area of research. Privacy real debate needed.
References
http://www.cs.rpi.edu/~zaki/PS/WWW04.p df http://www.cs.rpi.edu/~youssefi/research/V WM/ http://www.vtk.org/ http://www.w3.org/Robot/ http://www.cs.rpi.edu
Overview
What is Visual Web Mining Abstract Introduction Visual Web Mining Architecture Visual Representation Design and Implementation of diagrams Conclusion
http://www.cs.rpi.edu/~youssefi/research/VWM/
Abstract
Analysis of web site usage data involves two significant challenges:
Volume of data arising from the growth of the web. Structural complexity of web sites.
In this paper
Applied Data Mining and Information Visualization techniques to the web domain; in order to benefit from the power of both human visual perception and computing. Applied Data Mining techniques to large web data sets and use Information Visualization methods on the results.
GOAL: - To correlate the outcomes of mining Web Usage Logs and the extracted Web Structure, by visually superimposing the results.
Introduction
Information Visualization
Visual representations of abstract data, using computer-supported, interactive visual interfaces to reinforce human cognition; thus enabling the viewer to gain knowledge about the internal structure of the data and relationships in it.
Due to the large dataset and the structural complexity of the sites, 3D visual representations are used. Implemented using an open source toolkit called the Visualization Tool Kit (VTK). - VTK consists of a C++ class library and several interpreted interface layers including Tcl/Tk, Java, and Python.
http://www.vtk.org/
Visual Representation
Structures : - Graphs
Extract spanning tree from the site structure, and use this as the framework for presenting access-related results through glyphs(an element of writing) and color mapping.
- Stream Tubes
Variable-width tubes showing access paths with different traffic are introduced on top of the web graph structure.
http://www.cs.rpi.edu
Adding third dimension enables visualization of more information and clarifies user behavior in and between clusters. Center node of circular basement is first page of web site from which users scatter to different clusters of web pages. Color spectrum from Red (entry point into clusters) to Blue (exit points) clarifies behavior of users. The cylinder like part of this figure is visualization of web usage of surfers as they browse a long HTML document.
Left: One can observe long user sessions as strings falling off. Those are special type of long sessions when user navigates sequence of web pages which come one after the other e.g., sections of a long document. In many cases were found web pages with many nodes connected with Next/Up/Previous hyperlinks. Right: An enlarged view of the same visualization.
Frequent access patterns extracted by the web mining process are visualized as a white graph on top of an embedded and colorful graph of web usage.
Superimposition of Frequent Patterns extracted from Web Mining on top of Web Usage
Similar to last picture with addition of another attribute, i.e., frequency of pattern which is rendered as thickness of white tubes. This helps in the analysis of results.
represents
Superimposition of Web Usage on top of Web Structure with higher order layout. Top node is the first page of the website. Hierarchical output of layouts make analysis easier.
Left: Superimposition of website dynamics(colored) on top of its static structure(gray) Right: Zoom view of colored region with layout of Web Usage taken from Web Graph basement. The basement itself is removed for clarity
Conclusion
- Using the visualizations, a web analyzer can easily identify which parts of the website are cold parts with few hits and which parts are hot ones with many hits and classify them accordingly. This also paves way for making exploratory changes in website. For e.g., adding links from hot parts of web site to cold parts and then extracting, visualizing and interpreting changes in access patterns.
SPADE OVERVIEW
An algorithm based on Apriori for fast discovery of frequent sequences Needs three database scans in order to extract sequential patterns Given: A database of customer transactions, each of which having the following characteristics: sequence-id or customer-id, transaction-time and the item involved in the transaction. The aim is to obtain typical behaviors according to the user's viewpoint.
Users browsing access pattern is amplified by a different coloring Depending on link structure of underlying pages, we can see vertical access patterns of a user drilling down the cluster, making a cylinder shape. Also users following links going down a hierarchy of webpages makes a cone shape and users going up hierarchies, e.g., back to main page of website makes a funnel shape.
Amplification of a user session: Clickstream(Bottom Left) in drill down cylinder, Cone Scatter(Top Right) and Funnel Backoff to main page of website (Top Right)