World Wide Web – a brief history  Introduction to Data Mining  Data Mining Process & Techniques  Web Mining  Data Mining Vs Web Mining  Classification of Web Mining  Benefits & Application Areas of Web Mining  Web Mining Softwares  Summary


World-Wide Web - a brief history
Who invented the World-Wide Web ? (Sir) Tim Berners-Lee in 1989, while working at CERN, invented the World Wide Web, including URL scheme, HTML, and in 1990 wrote the first server (httpd) and the first browser. Web’s Characteristics:  billions of documents authored by millions of diverse people  distributed over millions of computers, connected by variety of media  Large size, Dynamic content, Time dimension and Multilingual  Different data types: text, image, hyperlinks and user usage information.

Mining Large Data Sets - Motivation
There is often information “hidden” in the data that is not readily evident  Human analysts may take weeks to discover useful information  Much of the data is never analyzed at all

Data Mining - Definition » It is commonly defined as the process of extracting meaningful information from data sources e. texts.c » It is the process of performing automated extraction and generating predictive information from large data banks which enables us to understand the current market trends and enables us to proactive measures to gain maximum benefit from the same. 8/12/10 . the web e.t. images.g databases.

Data Mining Process 8/12/10 .

These algorithms examine the sample data of a problem and determine a model that fits close to solving the problem.Data Mining Tasks » Data mining makes use of various algorithms to perform a variety of tasks. The list of tasks that forms the part of predictive model are:  Classification  Regression  Time Series Analysis 8/12/10 . » A Predictive model enables you to predicts the values of data by making use of known results from a different set of sample data.

» A Descriptive model enables you to determine the patterns and relationships in a sample data.Data Mining Tasks Contd.. The list of tasks that forms the part of descriptive model are:  Clustering  Summarization  Association rules  Sequence discovery 8/12/10 .

» Time Series Analysis: enables to predict future values for the current set of values are time dependent (monthly. 8/12/10 ..) » Summarization:The use of summarization enables you to summarize a large chunk of data containing in a web page. » Classification: enables you to classify data in a large data bank into predefined set of classes.. yearly. Ex: People with age less than 40 and salary > 40k trade on-line » Regression: enables to forecast data values based on the present and past values Ex: helps the organization to predict the need for recruiting new employees and purchases based in the past and current growth rate.Data Mining Tasks Contd.

.Data Mining Tasks Contd. » Clustering: enables you to create new groups (clusters) based on the study of patterns and relation between values of data in a data bank. Ex:Find the items that tend to be purchased together and specify their relationship. Ex: crime detection. It is similar to classification but does not require you to predefine groups. 8/12/10 . » Sequence Discovery:enables to determine the sequential patterns that might exist in a large and unorganized data bank.(also called as Unsupervised Learning) Ex:Users A and B access similar URLs » Association Rules:It defines certain rules of associativity between data items and then use those rules to establish relationships.

8/12/10 .  Decision trees: is a tree-shaped structure.  Statistical techniques: is the branch of mathematics. Any technique that helps extract more out of your data is useful.Data Mining Techniques » Data mining is not so much a single technique as the idea that there is more knowledge hidden in the data than shows itself on the surface. in which each branch represents a classification question while leaves of the tree represents the partition of classified information.  Machine Learning: is the process of generating a computer system that is capable of acquiring data and integrating the data to generate useful knowledge. which deals with the collection and analysis of numerical data by using various methods and techniques. list of data mining techniques are.

8/12/10 . when provided with the present and previous events. The model provides the probability of a future event. » Genetic algorithms:If you have a certain set of sample data.Data Mining Techniques » Hidden Markov Models:enables you to predict future actions to be taken in time series. » Neural networks:In this a large set of historical data is analyzed in order to predict the output of a particular future situation or a problem. then GA enables to determine the best possible model out of a set of models in order to represent the sample data.

and constraints. columns. Web Mining 8/12/10 Traditional data mining  data is structured and relational  well-defined tables. keys.Data Mining vs. rows. Web data  Semi-structured (HTML documents)and unstructured (free text)  readily available data  rich in features and patterns .

Problems when interacting with the Web » Finding relevant information » Creating new knowledge out of the information available on the Web » Personalization of the information » Learning about consumers or individual users 8/12/10 .

Web Mining 8/12/10 .

Web Mining - Definition » “Web mining refers to the overall process of discovering potentially useful and previously unknown information or knowledge from the Web data.” » The web mining process is similar to the data mining process. 8/12/10 . data collection can be a substantial task. especially for web structure and content mining. the difference is usually in the data collection. » In data mining. which involves crawling a large number of target web pages. » In web mining. the data is often already collected and stored in a data warehouse.

Web Mining .Subtasks         Resource finding Retrieving intended documents Information selection/pre-processing Select and pre-process specific information from selected documents Generalization Discover general patterns at individual web sites as well as across multiple web sites Analysis Validation and/or interpretation of mined patterns 8/12/10 .

Web Mining Contd. Web Mining is not IR:  Information retrieval (IR) is the automatic retrieval of all relevant documents while at the same time retrieving as few of the non-relevant documents as possible Web Mining is not IE:  Information extraction (IE) aims to extract the relevant facts from given documents   IE systems for the general Web are not feasible Most focus on specific Web sites or content 8/12/10 ..

Classification of Web Mining 8/12/10 .

Web Usage Mining Web Usage Mining refers to the discovery of user access patterns from the web usage logs. It is an activity that involves the automatic discovery of patterns from one or more Web servers. The usage data records the user's behavior when the user browses or makes transactions on the web site in order to better understand and serve the needs of users or Web-based applications.

  Typical Sources of Data  automatically generated data stored in server access logs.. content attributes. etc. most of this information is usually generated automatically by Web servers and collected in server log.  Analyzing such data can help these organizations to determine: the value of particular customers  cross marketing strategies across products  the effectiveness of promotional campaigns. bookmark data. Organizations often generate and collect large volumes of data. usage data 8/12/10 . proxy server logs referrer logs.Web Usage Mining Contd. mouse clicks and scrolls and client-side cookies  user profiles  meta data: page attributes. browser logs.

it was possible to determine such information as: the number of accesses to the server  the times or time intervals of visits  the domain names and the URLs of users of the Web server. Using such tools..  Two main categories:  Learning a user profile (personalized) Web users would be interested in techniques that learn their needs and preferences automatically  Learning user navigation patterns (impersonalized) Information providers would be interested in techniques that improve the effectiveness of their Web site or biasing the users towards the goals of the site  8/12/10 .Web Usage Mining  Contd. The first web analysis tools simply provided mechanisms to report user activity as recorded in the servers.

what and when files have been requested. Web servers.  Web server log: Every visit to the pages. the error code.. the IP address of the request.Web Usage Mining  Contd. and the type of browser used…  By analyzing the Web usage data. and client applications can quite easily capture Web Usage data. the number of bytes sent to user. Web proxies. web mining systems can discover useful knowledge about a system’s usage characteristics and the users’ interests which has various applications: Personalization and Collaboration in Web-based systems  Marketing  Web site design and evaluation  Decision support  8/12/10 .

Web Server Log .A Sample 8/12/10 .

file consists of information about the browser that was used to explore the various web pages. The major types of log files are  Access Log..Web Usage Mining Contd.  Agent Log. 8/12/10 . The technique to retrieve visitor based information from web servers based log files and apply this information to analyze data is known as Web Log Mining.file maintains a list of all the web pages that the visitors have requested.

Web Content Mining Web Content Mining extracts or mines useful information or knowledge from web page contents. In this mining, patterns are extracted from online sources such as HTML files, Text documents, Images, E-books or email messages, Audio or Video. The concept of WCM is far wider than searching for any specific term or only keyword extraction or some simple statistics of words and phrases in documents. A tool that performs WCM can summarize a web page so that you need not read the complete document and save your time and energy.

The two basic approaches or models to implement WCM are  Local Knowledge base Model: The abstract characterizations of several web pages are stored locally. 8/12/10 .. Some web agents can apply individual user profiles for searching information from the web and organize and interpret the discovered information. (i.Web Content Mining Contd.e References to several web sites relating to the categories are stored in a database and based on the selection of the category the searching is performed with in the web site)  Agent Based Model: This approach applies the Artificial Intelligence systems known as Web Agents that can perform a search on behalf of a particular user for discovering and organizing documents in the web.

 Remove Stop Words. 8/12/10 .  Calculate per Document Term Frequencies (TF).  Perform Stemming.  Calculate Collection Wide Word Frequencies (DF).  Typically.Preprocessing Content Content Preparation:  Extract text from HTML. additional weight is given to terms appearing as keywords or in titles.  Each document (HTML page) is represented by a sparse vector of term weights. Vector Creation:  Common Information Retrieval Technique.

Common Mining Techniques The more basic and popular data mining techniques include: Classification.can be used to group users exhibiting similar browsing patterns.Classification on server logs using decision trees.  Clustering. The other significant ideas are:  Topic Identification. tracking and drift analysis  Concept hierarchy creation  Relevance of content.  8/12/10 . Naives-Bayes classifier to discover the profiles of users belonging to a particular category.  Associations.can be used to relate pages that are most often referenced together in a single server session.

Web Structure Mining Web Structure Mining discovers useful knowledge from hyper links, which represent the structure of the web. A hyperlink is a structural component that connects the web page to a different location. Extract patterns from hyperlinks in the web. Web structure mining can be divided into two kinds: The process of using the graph theory to analyze the node and connection structure of a web site. Mining the document structure. It is using the tree-like structure to analyze and describe the HTML or XML tags within the web page.

find all related pages 8/12/10 . Web Structure is a useful source for extracting information such as Web Page Classification  Classifying web pages according to various topics Quality of Web Page  The authority of a page on a topic  Ranking of web pages Which pages to crawl  Deciding which web pages to add to the collection of web pages Finding Related Pages  Given one relevant page..Web Structure Mining Contd.

. The Hyperlink Induced Topic Search (HITS) is the common method or algorithm for knowledge discovery in the Web. The Concept of HITS is 8/12/10 .Web Structure Mining Contd.

the larger the number of in-links.    in-links: the hyperlinks pointing to a page out-links: the hyperlinks found in a page. high-quality web pages on broad topics hubs: web pages that link to a collection of authorities A good authority is pointed to by many good hubs  A good hub points to many good authorities  Web structure mining has been largely influenced by research in  Social network analysis  Citation analysis (bibliometrics).Web Structure Mining Identication of   Authorities: authoritative. 8/12/10 . the better a page is. Usually.

The probability.Number of web pages 8/12/10 . Each Web page is a node of the Web-graph The out-degree of a node. at any step.Web Structure Mining Contd. that the person will continue is a damping factor d =0.. is the number of distinct links originating at that point to other nodes.85 N.

Application Areas of Web Mining E-commerce  Search Engines  Personalization  Website Design  Web mining applications           Google Double Click AOL Ebay MyYahoo CiteSeer I-MODE v-TAG Web Mining Server 8/12/10 .

’purchase circles’.g. ’wish-lists’. click-path analysis. Knowledge gained from Web mining is the key intelligence behind Amazon’s features such as ’instant recommendations’. Amazon: A host of Web mining techniques. associations between pages visited. etc. etc.Applications Contd. are used to improve the customer’s experience during a ’store visit’... 8/12/10 . e.

. Page Rank. is the underlying technology in all Google search products.  8/12/10 . that makes use of the structural information of the Web graph. The Page Rank technology. Google was the first to introduce the importance of the link structure in mining the information from the web.Applications Contd. Google  Earlier search engines concentrated on the Web content to return the relevant pages to a query. is the key to returning quality results relevant to a query. that measures an importance of a page.

Benefits of Web Mining  Match your available resources to visitor interests Increase the value of each visitor Improve the visitor's experience at the website Perform targeted resource management Collect information in new ways Test the relevance of content and web site architecture      8/12/10 .

Web Mining Softwares  Web Miner: Sinope Summarizer: Teleport Pro: Click Tracks    8/12/10 .

 Multilingual knowledge extraction: Web page translations  The Hidden Web: Forms. 8/12/10 .  Semantic Web  Wireless Web: WML and HDML.Summary Major Limitations of Web Mining research:  Difficult to collect Web Usage data across different Web Sites. Dynamically generated web pages.  Lack of suitable test collections that can be reused by researchers Future research directions:  Multimedia data mining: A picture is worth a thousand words.

