Professional Documents
Culture Documents
INTRODUCTION
The World Wide Web is without doubt, already know what and have used it extensively. The World Wide Web (or the Web for short) has impacted on almost every aspect of our lives. It is the biggest and most widely known information source that is easily accessible and searchable. It consists of billions of interconnected documents (called Web pages) which are authored by millions of people. Since its inception, the Web has dramatically changed our information seeking behaviour. Before the Web, finding information means asking a friend or an expert, or buying/borrowing a book to read. However, with the Web, everything is only a few clicks away from the comfort of our homes or offices. Not only can we find needed information on the Web, but we can also easily share our information and knowledge with others. The Web has also become an important channel for conducting businesses. We can buy almost anything from online stores without needing to go to a physical shop. The Web also provides convenient means for us to communicate with each other, to express our views and opinions on anything, and to discuss with people from anywhere in the world. The Web is truly a virtual society. In this chapter, we introduce the Web, its history, and the topics that we will discuss in the seminar. 1.1 WHAT IS THE WORLD WIDE WEB? The World Wide Web is officially defined as a wide-area hypermedia information retrieval initiative aiming to give universal access to a large universe of documents. In simpler terms, the Web is an Internet-based computer network that allows users of one computer to access information stored on another through the world-wide network called the Internet. The Web's implementation follows a standard client-server model. In this model, a user relies on a program (called the client) to connect to a remote machine (called the server) where the data is stored. Navigating through the Web is done by means of a client program called the browser, e.g., Netscape, Internet Explorer, Firefox, etc. Web browsers work by sending requests to remote servers for information and then interpreting the returned documents written in HTML and laying out the text and graphics on the users computer screen on the client side.
WEB MINING Page 1
The operation of the Web relies on the structure of its hypertext documents. Hypertext allows Web page authors to link their documents to other related documents residing on computers anywhere in the world. To view these documents, one simply follows the links (called hyperlinks). The idea of hypertext was invented by Ted Nelson in 1965, who also created the well known hypertext system Xanadu (http://xanadu. com/). Hypertext that also allows other media (e.g., image, audio and video files) is called hypermedia.
Hyper Text Transfer Protocol (HTTP), the Hyper Text Markup Language (HTML) used for authoring Web documents, and the Universal Resource Locator (URL). And so it began. MOSAIC AND NETSCAPE BROWSERS: The next significant event in the development of the Web was the arrival of Mosaic. In February of 1993, Marc Andreesen from the University of Illinois NCSA (National Center for Supercomputing Applications) and his team released the first "Mosaic for X" graphical Web browser for UNIX. A few months later, different versions of Mosaic were released for Macintosh and Windows operating systems. This was an important event. For the first time, a Web client, with a consistent and simple point-andclick graphical user interface, was implemented for the three most popular operating systems available at the time. It soon made big splashes outside the academic circle where it had begun. In mid-1994, Silicon Graphics founder Jim Clark collaborated with Marc Andreessen, and they founded the company Mosaic Communications (later renamed as Netscape Communications). Within a few months, the Netscape browser was released to the public, which started the explosive growth of the Web. The Internet Explorer from Microsoft entered the market in August, 1995 and began to challenge Netscape. The creation of the World Wide Web by Tim Berners-Lee followed by the release of the Mosaic browser are often regarded as the two most significant contributing factors to the success and popularity of the Web. INTERNET: The Web would not be possible without the Internet, which provides the communication network for the Web to function. The Internet started with the computer network ARPANET in the Cold War era. It was produced as the result of a project in the United States aiming at maintaining control over its missiles and bombers after a nuclear attack. It was supported by Advanced Research Projects Agency (ARPA), which was part of the Department of Defense in the United States. The first ARPANET connections were made in 1969, and in 1972, it was demonstrated at the First International Conference on Computers and Communication, held in Washington D.C. At the conference, ARPA scientists linked computers together from 40 different locations.
WEB MINING Page 3
In 1973, Vinton Cerf and Bob Kahn started to develop the protocol later to be called TCP/IP (Transmission Control Protocol/Internet Protocol). In the next year, they published the paper Transmission Control Protocol, which marked the beginning of TCP/IP. This new protocol allowed diverse computer networks to interconnect and communicate with each other. In subsequent years, many networks were built, and many competing techniques and protocols were proposed and developed. However, ARPANET was still the backbone to the entire system. During the period, the network scene was chaotic. In 1982, the TCP/IP was finally adopted, and the Internet, which is a connected set of networks using the TCP/IP protocol, was born. SEARCH ENGINES: With information being shared worldwide, there was a need for individuals to find information in an orderly and efficient manner. Thus began the development of search engines. The search system Excite was introduced in 1993 by six Stanford University students. EINet Galaxy was established in 1994 as part of the MCC Research Consortium at the University of Texas. Jerry Yang and David Filo created Yahoo! in 1994, which started out as a listing of their favourite Web sites, and offered directory search. In subsequent years, many search systems emerged, e.g., Lycos, Inforseek, AltaVista, Inktomi, Ask Jeeves, Northernlight, etc. Google was launched in 1998 by Sergey Brin and Larry Page based on their research project t at Stanford University. Microsoft started to commit to search in 2003, and launched the MSN search engine in spring 2005. It used search engines from others before. Yahoo! provided a general search capability in 2004 after it purchased Inktomi in 2003. W3C (THE WORLD WIDE WEB CONSORTIUM): W3C was formed in the December of 1994 by MIT and CERN as an international organization to lead the development of the Web. W3C's main objective was to promote standards for the evolution of the Web and interoperability between WWW products by producing specifications and reference software. The first International Conference on World Wide Web (WWW) was also held in 1994, which has been a yearly event ever since. From 1995 to 2001, the growth of the Web boomed. Investors saw commercial
WEB MINING Page 4
opportunities and became involved. Numerous businesses started on the Web, which led to irrational developments. Finally, the bubble burst in 2001. However, the development of the Web was not stopped, but has only become more rational since.
can write almost anything that one likes, a large amount of information on the Web is of low quality, erroneous, or even misleading. 6. The Web is also about services. Most commercial Web sites allow people to perform useful operations at their sites, e.g., to purchase products, to pay bills, and to fill in forms. 7. The Web is dynamic. Information on the Web changes constantly. Keeping up with the change and monitoring the change are important issues for many applications. 8. The Web is a virtual society. The Web is not only about data, information and services, but also about interactions among people, organizations and automated systems. One can communicate with people anywhere in the world easily and instantly, and also express ones views on anything in Internet forums, blogs and review sites. All these characteristics present both challenges and opportunities for mining and discovery of information and knowledge from the Web. To explore information mining on the Web, it is necessary to know data mining, which has been applied in many Web mining tasks. However, Web mining is not entirely an application of data mining. Due to the richness and diversity of information and other Web specific characteristics discussed above, Web mining has developed many of its own algorithms. 1.3.1 WHAT IS DATA MINING? Data mining is also called knowledge discovery in databases (KDD). It is commonly defined as the process of discovering useful patterns or knowledge from data sources, e.g., databases, texts, images, the Web, etc. The patterns must be valid, potentially useful, and understandable. Data mining is a multi-disciplinary field involving machine learning, statistics, databases, artificial intelligence, information retrieval, and
visualization. There are many data mining tasks. Some of the common ones are supervised learning (or classification), unsupervised learning (or clustering), association rule mining, and sequential pattern mining. We will discuss all of them in this seminar. A data mining application usually starts with an understanding of the application domain by data analysts (data miners), who then identify suitable data sources and the
WEB MINING
Page 6
target data. With the data, data mining can be performed, which is usually carried out in three main steps: Pre-processing: The raw data is usually not suitable for mining due to various reasons. It may need to be cleaned in order to remove noises or abnormalities. The data may also be too large and/or involve many irrelevant attributes, which call for data reduction through sampling and attribute selection. Data mining: The processed data is then fed to a data mining algorithm which will produce patterns or knowledge. Post-processing: In many applications, not all discovered patterns are useful. This step identifies those useful ones for applications. Various evaluation and visualization techniques are used to make the decision. The whole process (also called the data mining process) is almost always iterative. It usually takes many rounds to achieve final satisfactory results, which are then incorporated into real-world operational tasks. Traditional data mining uses structured data stored in relational tables, spread sheets, or flat files in the tabular form. With the growth of the Web and text documents, Web mining and text mining are becoming increasingly important and popular. 1.3.2 WHAT IS WEB MINING? Web mining aims to discover useful information or knowledge from the Web hyperlink structure, page content, and usage data. Although Web mining uses many data mining techniques, as mentioned above it is not purely an application of traditional data mining due to the heterogeneity and semi-structured or unstructured nature of the Web data. Many new mining tasks and algorithms were invented in the past decade. Based on the primary kinds of data used in the mining process, Web mining tasks can be categorized into three types: Web structure mining, Web content mining and Web usage mining. Web mining is the use of data mining techniques to automatically discover and extract information from Web documents and services. Web mining should be decomposed into these subtasks: 1. Resource finding: The task of retrieving intended Web documents.
WEB MINING Page 7
2. Information selection and pre processing: Automatically selecting and pre processing specific information from retrieved Web resources. 3. Generalization: Automatically discovers general patterns at individual Web sites as well as across multiple sites. 4. Analysis: Validation and/or interpretation of the mined patterns. Resource finding is the process of retrieving data from text sources available on the Web such as electronic magazines and newsletters or text contents of HTML documents. Information selection and pre processing step is transformation process retrieved in information retrieval (IR) process from original data. These transformations cover removing stop words, finding phrases in the training corpus, transforming the representation to relational or first order logic form, etc. Data mining techniques and machine learning are often used for generalization. In information and knowledge discovery process, people play very important role. This is important for validation and/or interpretation in last step.
WEB MINING
Page 8
Web. Finally, Web usage mining, also known as Web Log Mining, is the process of extracting interesting patterns in Web access logs. In this seminar, we will discuss all these three types of mining. However, due to the richness and diversity of information on the Web, there are a large number of Web mining tasks. We will not be able to cover them all. We will only focus on some important tasks and their algorithms. The Web mining process is similar to the data mining process. The difference is usually in the data collection. In traditional data mining, the data is often already collected and stored in a data warehouse. For Web mining, data collection can be a substantial task, especially for Web structure and content mining, which involves crawling a large number of target Web pages. We will devote a whole chapter on crawling. Once the data is collected, we go through the same three-step process: data preprocessing, Web data mining and post-processing. However, the techniques used for each step can be quite different from those used in traditional data mining.
WEB MINING
Page 9
2. DATA MINING
Data mining has attracted a great deal of attention in the information industry and in society as a whole in recent years, due to the wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge. The information and knowledge gained can be used for applications ranging from market analysis, fraud detection, and customer retention, to production control and science exploration. Data mining can be viewed as a result of the natural evolution of information technology. The database system industry has witnessed an evolutionary path in the development of the following functionalities: data collection and database creation, data management (including data storage and retrieval, and database transaction processing), and advanced data analysis (involving data warehousing and data mining). For instance, the early development of data collection and database creation mechanisms served as a prerequisite for later development of effective mechanisms for data storage and retrieval, and query and transaction processing. With numerous database systems offering query and transaction processing as common practice, advanced data analysis has naturally become the next target. Since the 1960s, database and information technology has been evolving systematically from primitive file processing systems to sophisticated and powerful database systems. The research and development in database systems since the 1970s has progressed from early hierarchical and network database systems to the development of relational database systems, data modelling tools, and indexing and accessing methods. In addition, users gained convenient and flexible data access through query languages, user interfaces, optimized query processing, and transaction management. Efficient methods for on-line transaction processing (OLTP), where a query is viewed as a read-only transaction, have contributed substantially to the evolution and wide acceptance of relational technology as a major tool for efficient storage, retrieval, and management of large amounts of data. Database technology since the mid-1980s has been characterized by the popular adoption of relational technology and an upsurge of research and development activities on new and powerful database systems. These promote the development of advanced data
WEB MINING Page 10
models such as extended-relational, object-oriented, object-relational, and deductive models. Application-oriented database systems, including spatial, temporal, multimedia, active, stream, and sensor, and scientific and engineering databases, knowledge bases, and office information bases, have flourished. Issues related to the distribution, diversification, and sharing of data have been studied extensively. Heterogeneous database systems and Internet-based global information systems such as the World Wide Web (WWW) have also emerged and play a vital role in the information industry. The steady and amazing progress of computer hardware technology in the past three decades has led to large supplies of powerful and affordable computers, data collection equipment, and storage media. This technology provides a great boost to the database and information industry, and makes a huge number of databases and information repositories available for transaction management, information retrieval, and data analysis. Data can now be stored in many different kinds of databases and information repositories. One data repository architecture that has emerged is the data warehouse, a repository of multiple heterogeneous data sources organized under a unified schema at a single site in order to facilitate management decision making. Data warehouse technology includes data cleaning, data integration, and on-line analytical processing (OLAP), that is, analysis techniques with functionalities such as summarization, consolidation, and aggregation as well as the ability to view information from different angles. Although OLAP tools support multidimensional analysis and decision making, additional data analysis tools are required for in-depth analysis, such data classification, clustering, and the characterization of data changes over time. In addition, huge volumes of data can be accumulated beyond databases and data warehouses. Typical examples include the World Wide Web and data streams, where data flow in and out like streams, as in applications like video surveillance, telecommunication, and sensor networks. The effective and efficient analysis of data in such different forms becomes a challenging task. The abundance of data, coupled with the need for powerful data analysis tools, has been described as a data rich but information poor situation. The fast-growing, tremendous amount of data, collected and stored in large and numerous data repositories, has far exceeded our human ability for comprehension without powerful tools. As a result, data collected in large data repositories become data tombsdata archives that are seldom visited. Consequently, important decisions are often made based not on the informationWEB MINING Page 11
rich data stored in data repositories, but rather on a decision makers intuition, simply because the decision maker does not have the tools to extract the valuable knowledge embedded in the vast amounts of data. In addition, consider expert system technologies, which typically rely on users or domain experts to manually input knowledge into knowledge bases. Unfortunately, this procedure is prone to biases and errors, and is extremely time-consuming and costly. Data mining tools perform data analysis and may uncover important data patterns, contributing greatly to business strategies, knowledge bases, and scientific and medical research. The widening gap between data and information calls for a systematic development of data mining tools that will turn data tombs into golden nuggets of knowledge.
WEB MINING
Page 12
WEB MINING
Page 13
Steps 1 to 4 are different forms of data pre processing, where the data are prepared for mining. The data mining step may interact with the user or a knowledge base. The interesting patterns are presented to the user and may be stored as new knowledge in the knowledge base. Note that according to this view, data mining is only one step in the entire process, albeit an essential one because it uncovers hidden patterns for evaluation. We agree that data mining is a step in the knowledge discovery process. However, in industry, in media, and in the at a base research milieu, the term data mining is becoming more popular than the longer term of knowledge discovery from data. Therefore, in this book, we choose to use the term data mining. We adopt a broad view of data mining functionality: data mining is the process of discovering interesting knowledge from large amounts of data stored in databases, data warehouses, or other information repositories. Based on this view, the architecture of a typical data mining system may have the following major components (Figure 2): Database, data warehouse, World Wide Web, or other information repository: This is one or a set of databases, data warehouses, spreadsheets, or other kinds of information repositories. Data cleaning and data integration techniques may be performed on the data. Database or data warehouse server: The database or data warehouse server is responsible for fetching the relevant data, based on the users data mining request. Knowledge base: This is the domain knowledge that is used to guide the search or evaluate the interestingness of resulting patterns. Such knowledge can include concept hierarchies, used to organize attributes or attribute values into different levels of abstraction. Knowledge such as user beliefs, which can be used to assess a patterns interestingness based on its unexpectedness, may also be included. Other examples of domain knowledge are additional interestingness constraints or thresholds, and metadata (e.g., describing data from multiple heterogeneous sources). Data mining engine: This is essential to the data mining system and ideally consists of a set of functional modules for tasks such as characterization, association and correlation analysis, classification, prediction, cluster analysis, outlier analysis, and evolution analysis.
WEB MINING Page 14
Figure 2: ARCHITECTURE OF A TYPICAL DATA MINING SYSTEM Pattern evaluation module: This component typically employs interestingness measures (Section 1.5) and interacts with the data mining modules so as to focus the search toward interesting patterns. It may use interestingness thresholds to filter out discovered patterns. Alternatively, the pattern evaluation module may be integrated with the mining module, depending on the implementation of the data mining method used. For efficient data mining, it is highly recommended to push the evaluation of pattern interestingness as deep as possible into the mining process so as to confine the search to only the interesting patterns. User interface: This module communicates between users and the data mining system, allowing the user to interact with the system by specifying a data mining query or task, providing information to help focus the search, and performing exploratory data mining based on the intermediate data mining results. In addition, this component allows the user
WEB MINING
Page 15
to browse database and data warehouse schemas or data structures, evaluate mined patterns, and visualize the patterns in different forms. From a data warehouse perspective, data mining can be viewed as an advanced stage of online analytical processing (OLAP). However, data mining goes far beyond the narrow scope of summarization-style analytical processing of data warehouse systems by incorporating more advanced techniques for data analysis. Although there are many data mining systems on the market, not all of them can perform true data mining. A data analysis system that does not handle large amounts of data should be more appropriately categorized as a machine learning system, a statistical data analysis tool, or an experimental system prototype. A system that can only perform data or information retrieval, including finding aggregate values, or that performs deductive query answering in large databases should be more appropriately categorized as a database system, an information retrieval system, or a deductive database system.
Cluster Detection/Market Basket Analysis - This is where the classic beer/diapers bought together analysis came from. It finds groupings. Basically, this technique finds relationships in product or customer or wherever you want to find associations in data. Link Analysis - This is another technique for associating like records. Not used too much, but there are some tools created just for this. As the name suggests, the technique tries to find links, either in customers, transactions, etc. and demonstrate those links. Visualization - This technique helps users understand their data. Visualization makes the bridge from text based to graphical presentation. Such things as decision tree, rule, cluster and pattern visualization help users see data relationships rather than read about them. Many of the stronger data mining programs have made strides in improving their visual content over the past few years. This is really the vision of the future of data mining and analysis. Data volumes have grown to such huge levels, it is going to be impossible for humans to process it by any text-based method effectively, soon. We will probably see an approach to data mining using visualization appear that will be something like Microsofts Photosynth. The technology is there, it will just take an analyst with some vision to sit down and put it together. Decision Tree/Rule Induction - Decision trees use real data mining algorithms. Decision trees help with classification and spit out information that is very descriptive, helping users to understand their data. A decision tree process will generate the rules followed in a process. For example, a lender at a bank goes through a set of rules when approving a loan. Based on the loan data a bank has, the outcomes of the loans (default or paid), and limits of acceptable levels of default, the decision tree can set up the guidelines for the lending institution. These decision trees are very similar to the first decision support (or expert) systems. Genetic Algorithms - GAs are techniques that act like bacteria growing in a Petri dish. You set up a data set then give the GA ability to do different things for whether a direction or outcome is favourable. The GA will move in a direction that will hopefully optimize the final result. GAs are used mostly for process optimization, such as scheduling, workflow, batching, and process re-engineering. Think of GA as simulations run over and over to find optimal results and the infrastructure around being able to both run the simulations and the ways to set up which results are optimal.
WEB MINING Page 17
OLAP Online Analytical Processing. OLAP allows users to browse data following logical questions about the data. OLAP generally includes the ability to drill down into data, moving from highly summarized views of data into more detailed views. This is generally achieved by moving along hierarchies of data. For example, if one were analyzing populations, one could start with the most populous continent, then drill down to the most populous country, then to the state level, then to the city level, then to the neighbourhood level. OLAP also includes browsing up hierarchies (drill up), across different dimensions of data (drill across), and many other advanced techniques for browsing data, such as automatic time variation when drilling up or down time hierarchies. OLAP is by far the most implemented and used technique. It is also generally the most intuitive and easy to use.
WEB MINING
Page 18
2.3.2 DATAWAREHOUSES A data warehouse is a repository of information collected from multiple sources, stored under a unified schema, and that usually resides at a single site. Data warehouses are constructed via a process of data cleaning, data integration, data transformation, data loading, and periodic data refreshing. To facilitate decision making, the data in a data warehouse are organized around major subjects, such as customer, item, supplier, and activity. The actual physical structure of a data warehouse may be a relational data store or a multidimensional data cube. A data cube provides a multidimensional view of data and allows the pre computation and fast accessing of summarized data. 2.3.3 TRANSACTIONAL DATABASES In general, a transactional database consists of a file where each record represents a transaction. A transaction typically includes a unique transaction identity number (trans ID) and a list of the items making up the transaction. The transactional database may have additional tables associated with it, which contain other information.
WEB MINING
Page 19
Classification according to the kinds of databases mined: A data mining system can be classified according to the kinds of databases mined. Database systems can be classified according to different criteria (such as data models, or the types of data or applications involved), each of which may require its own data mining technique. Data mining systems can therefore be classified accordingly. For instance, if classifying according to data models, we may have a relational, transactional, object-relational, or data warehouse mining system. If classifying according to the special types of data handled, we may have a spatial, time-series, text, stream data, multimedia data mining system, or a World Wide Web mining system. Classification according to the kinds of knowledge mined: Data mining systems can be categorized according to the kinds of knowledge they mine, that is, based on data mining functionalities, such as characterization, discrimination, association and correlation analysis, classification, prediction, clustering, outlier analysis, and evolution analysis. A comprehensive data mining system usually provides multiple and/or integrated data mining functionalities. Classification according to the kinds of techniques utilized: Data mining systems can be categorized according to the underlying data mining techniques employed. These techniques can be described according to the degree of user interaction involved (e.g., autonomous systems, interactive exploratory systems, query-driven systems) or the methods of data analysis employed (e.g., database-oriented or data warehouse oriented techniques, machine learning, statistics, visualization, pattern recognition, neural networks, and so on). A sophisticated data mining system will often adopt multiple data mining techniques or work out an effective, integrated technique that combines the merits of a few individual approaches. Classification according to the applications adapted: Data mining systems can also be categorized according to the applications they adapt. For example, data mining Systems may be tailored specifically for finance, telecommunications, DNA, stock markets, e-mail, and so on. Different applications often require the integration of application-specific methods. Therefore, a generic, all-purpose data mining system may not fit domain-specific mining tasks.
WEB MINING
Page 20
WEB MINING
Page 21
3. WEB MINING
The following figure shows the architecture of web mining briefly. It divides into two stages. The stage1 contains all information of data & stage2 contains analysis. According to analysis target, web mining can be divided into three different types, which are web usage mining, web content mining and web structure mining.
Data Cleaning
Transaction Identification
Data Integration
Transformation
Pattern Discovery
Pattern Analysis
Clean log
Transaction data
Path Analysis
Registration data
Sequential Pattern
Intelligent Agent
STAGE 1
STAGE 2
Figure 3: ARCHITECTURE OF WEB MINING Data mining is the nontrivial process of identifying valid novel, potentially useful, and ultimately understandable patterns in data Fayyad. The most commonly used techniques in data mining is artificial neural networks, decision trees, genetic algorithm, nearest neighbour method, and rule induction. Data mining research has drawn on a number of other fields such as inductive learning, machine learning and statistics etc.
WEB MINING Page 22
Machine learning is the automation of a learning process and learning is based on observations of environmental statistics and transitions. Machine learning examines
previous examples and their outcomes and learns how to reproduce these make generalizations about new uses. Inductive learning Induction means inference of information from data and Inductive learning is a model building process where the database is analyzed to find patterns. Main strategies are supervised learning and unsupervised learning. Statistics: used to detect unusual patterns and explain patterns using statistical models such as linear models. Data mining models can be a discovery model it is the system automatically discovering important information hidden in the data or verification model takes an hypothesis from the user and tests the validity of it against the data. The web contains collection of pages that includes countless hyperlinks and huge volumes of access and usage information. Because of the ever-increasing amount of information in cyberspace, knowledge discovery and web mining are becoming critical for successfully conducting business in the cyber world. Web mining is the discovery and analysis of useful information from the web. Web mining is the use of data mining techniques to automatically discover and extract information from web documents and services (content, structure, and usage).
1. Association Rule Mining: Predict the association and correlation among set of items where the presence of one set of items in a transaction implies (with a certain degree of confidence) the presence of other itms. That is, 1) Discovers the correlations between pages that are most often referenced together in a single server session/user session. 2) Provide the information: i. What are the set of pages frequently accessed together by web users? ii. What page will be fetched next? iii. What are paths frequently accessed by web users? 3) Associations and correlations: i. Page association from usage data user sessions, user transactions. ii. Page associations from content data similarity based on content analysis iii. Page associations based on structure -- link connectivity between pages. Advantages: Guide for web site restructuring by adding links that interconnect pages often viewed together. Improve the system performance by pre fetching web data.
2. Sequential pattern discovery: Applied to web access server transaction logs. The purpose is to discover sequential patterns that indicate user visit patterns over a certain period. That is, the order in which URLs tend to be accessed. Advantage: Useful user trends can be discovered. Predictions concerning visit pattern can be made. To improve website navigation. Personalize advertisements. Dynamically reorganize link structure and adopt web site contents to individual client requirements or to provide clients with automatic recommendations that best suit customer profiles. 3. Clustering: Group together items (users, pages, etc.,) that have similar characteristics. a) Page clusters: groups of pages that seem to be conceptually related according to users perception.
WEB MINING
Page 24
b) User Cluster: groups or users that seem to be behave similarly when navigating through a web site. 4. Classification: maps a data item into one of several predetermined classes. Example: describing each users category using profiles. Classification algorithms are decision tree, nave Bayesian classifier, neural networks. 5. Path Analysis: A technique that involves the generation of some form of graph that represents relation defined on web pages. This can be the physical layout of a web site in which the web pages are nodes and links between these pages are directed edges. Most graphs are involved in determining frequent traversal
patterns/ more frequently visited paths in a web site. Example: What paths do users traversal before they go to a particular URL? To use data mining on our web site, we have to establish and record visitor and item characteristics, and visitor interactions. Visitor characteristics include: i. Demographics are tangible attributes such as home address, income, property, etc. ii. Psychographics are personality types such as early technology interest, buying tendencies. iii. Techno graphics are attributes of visitors system, such as operating system, browser, and modem speed. Item characteristics include: i. Web content information media type, content category, URL. ii. Product information - product category, colour, size, price Visitor interactions include: i. Visitor item interactions include purchase history, advertising history, and preference information. ii. Visitor site statistics are per session characteristics, such as total time, pages viewed, and so on. We have a lot of information about web visitors and content, but we probably are not making the best use of it. The existing OLAP systems can report only on directly
WEB MINING Page 25
observed and easily correlated information. They rely on users to discover patterns and decide what to do with them. The information is even too complex for humans to discover these patterns using an OLAP system. To solve these problems, data mining techniques are utilized. The scope of data mining is i. Automated prediction of trends, and behaviours ii. Automated discovery of previously unknown patterns. Web mining is searches for i. Web access patterns, ii. Web structure, iii. Regularity and dynamics of web contents. The web mining research is a converging research area from several research communities, such as database, information retrieval, and AI research communities, especially from machine learning and natural language processing. World wide web is a popular and interactive medium to gather information today. The WWW provides every Internet citizen with access to an abundance of information. Users encounter some problems when interacting with the web. i. Finding relevant information (information overload Only a small portion of the web pages contain truly relevant/useful information): a) low precision (the abundance problem 99% of information of no interest to 99% of people) which is due to the irrelevance of many of the search results. This results in a difficulty of finding the relevant information. b) Low recall (limited coverage of the web-Internet sources hidden behind search interface) due to the inability to index all the information available on the web. This results in a difficulty of finding the unindexed
information that is relevant. ii. Discovery of existing but hidden knowledge (retrieve 1/3rd of the indexable web) iii. Personalization of the information (type & presentation of information) Limited customization to individual users. iv. Learning about customers/individual users.
WEB MINING Page 26
v. Lack of feedback on human activities. vi. Lack of multidimensional analysis and data mining support. vii. The web constitutes a highly dynamic information source. Not only does the web continue to grow rapidly, the information I holds also receives constant updates. News, stock market, service centre, and corporate sites revise their web pages regularly. Linkage information and access records also undergo frequent updates. viii. The web serves a broad spectrum of user communities. The Internets rapidly expanding user community connects millions of workstations, and usage purposes. Many lack good knowledge of the information networks structure, are unaware of a particular searchs heavy cost, frequently get lost within the webs ocean of information and lengthy waits required to retrieve search results. ix. Web page complexity far exceeds the complexity of any traditional text document collection. Although the web functions as a huge digital library, the pages themselves lack a uniform structure and contain far more authoring style and content variations than any set of books or traditional text-based documents. Moreover, searching it is extremely difficult. Common problems web marketers want to solve are how to target advertisements (Targeting), Personalize web pages (Personalization), create web pages that show products often bought together (associations), classify articles automatically (Classification), characterize group of similar visitors (clustering), estimate missing data and predict future behaviour. In general web mining tasks are: i. Mining web search engine data ii. Analyzing the webs link structures iii. Classifying web document automatically iv. Mining web page semantic structure and page contents v. Mining web dynamics vi. Personalization. Thus, web mining refers to the overall process of discovering potentially useful and previously unknown information or knowledge from the web data. Web mining aims
WEB MINING Page 27
at finding and extracting relevant information that is hidden in web-related data, in particular in text documents that are published on the web like data mining is a multidisciplinary effort that draws technique from fields like information retrieval, statistics, machine learning, natural language processing and others. Web mining can be a
promising tool to address ineffective search engines that produce incomplete indexing, retrieval of irrelevant information/unverified reliability or retrieved information. It is essential to have a system that helps the user find relevant and reliable information easily and quickly on the web. Web mining discovers information from mounds of data on the www, but it also monitors and predicts user visit patterns. This gives designers more reliable information in structuring and designing a web site. Given the rate of growth of the web, scalability of search engines is a key issue, as the amount of hardware and network resources needed is large, and expensive. In
addition, search engines are popular tools, so they have heavy constraints on query answer time. So, the efficient use of resources can improve both scalability and answer time. One tool to achieve this goal is web mining.
WEB MINING
Page 28
Figure 4: WEB MINING TAXONOMY 3.3.1 Web Content Mining Web content mining is the process of extracting useful information from the contents of web documents. Content data is the collection of facts a web page is designed to contain. It may consist of text, images, audio, video, or structured records such as lists and tables. Application of text mining to web content has been the most widely researched. Issues addressed in text mining include topic discovery and tracking, extracting association patterns, clustering of web documents and classification of web pages. Research activities on this topic have drawn heavily on techniques developed in other disciplines such as Information Retrieval (IR) and Natural Language Processing (NLP). While there exists a significant body of work in extracting knowledge from images in the fields of image processing and computer vision, the application of these techniques to web content mining has been limited.
WEB MINING Page 29
Web content mining is an automatic process that goes beyond keywords extraction. Since the content of a text document presents no machine-readable semantic, some approaches have suggested to restricted the document content in a representation that could be exploited by machines. The usual approach to exploit known structure in document is to use wrappers to map document to some data model. Techniques using lexicons for content interpretation are yet to come. There are two groups of web content mining strategies: those that directly mine the content of document and those that improve on the content search of other tools like search engines. 3.3.2 Web Structure Mining The structure of a typical web graph consists of web pages as nodes, and hyperlinks as edges connecting related pages. Web structure mining is the process of discovering structure information from the web. This can be further divided into two kinds based on the kind of structure information used. Hyperlinks A hyperlink is a structural unit that connects a location in a web page to a different location, either within the same web page or on a different web page. A hyperlink that connects to a different part of the same page is called an intra-document hyperlink, and a hyperlink that connects two different pages is called an inter-document hyperlink. Document Structure In addition, the content within a Web page can also be organized in a tree structured format, based on the various HTML and XML tags within the page. Mining efforts here have focused on automatically extracting document object model (DOM) structures out of documents (Wang and Liu 1998; Moh, Lim, and Ng 2000). World Wide Web can reveal more information than just the information contained in documents. For example, links pointing to a document indicate the popularity of the document, while links coming out of a document indicate the richness or perhaps the variety of topics covered in the document. This can be compared to bibliography citations. When a paper is cited often, it ought to be important. The page rank and CLEVER methods take advantage of this information conveyed by the links to find pertinent web
WEB MINING
Page 30
pages. By means of counters, higher levels cumulate the number of artefacts subsumed by the concepts they hold. 3.3.3 Web Usage Mining Web usage mining is the application of data mining techniques to discover interesting usage patterns from web usage data, in order to understand and better serve the needs of web-based applications. Usage data captures the identity or origin of web users along with their browsing behaviour at a web site. Web usage mining itself can be classified further depending on the kind of usage data considered: Web Server Data User logs are collected by the web server and typically include IP address, page reference and access time. Application Server Data Commercial application servers such as Weblogic1,2 StoryServer3 have significant features to enable E-commerce applications to be built on top of them with little effort. A key feature is the ability to track various kinds of business events and log them in application server logs. Application Level Data New kinds of events can be defined in an application, and logging can be turned on for them generating histories of these events. It must be noted, however, that many end applications require a combination of one or more of the techniques applied in the above the categories. Web servers record and accumulate data about user interactions whenever requests for resources are received. Analyzing the web access logs of different web sites can help understand the user behaviour and the web structure, thereby improving the design of this colossal collection of resources. There are two main tendencies in web usage mining driven by the applications of discoveries: general access pattern tracking and customized usage tracking. The general access pattern tracking analyzes the web logs to understand accesses patterns and trends.
WEB MINING
Page 31
WEB MINING
Page 32
are considered authors of any specific release of mining patterns. They are legally responsible for the contents of the release; any inaccuracies in the release will result in serious lawsuits, but there is no law preventing them from trading the data. Some mining algorithms might use controversial attributes like sex, race, religion, or sexual orientation to categorize individuals. These practices might be against the anti-discrimination legislation. The applications make it hard to identify the use of such controversial attributes, and there is no strong rule against the usage of such algorithms with such attributes. This process could result in denial of service or a privilege to an individual based on his race, religion or sexual orientation, right now this situation can be avoided by the high ethical standards maintained by the data mining company. The collected data is being made anonymous so that the obtained data and the obtained patterns cannot be traced back to an individual. It might look as if this poses no threat to ones privacy, actually many extra information can be inferred by the application by combining two separate unscrupulous data from the user.
WEB MINING
Page 34
CONTENT MINING
Agent Based Approach 1. Intelligent Search Agents 2. Information Filtering/Categorization 3. Personalized Web Agents
4.2.1 AGENT BASED APPROACHES:
Involves AI systems that can act autonomously or semi autonomously on behalf of a particular user, to discover and organize web based information. Agent Based approaches focus on intelligent and autonomous web mining tools based on agent technology. i. Some intelligent web agents can use a user profile to search for relevant information, then organize and interpret the discovered information. Example: Harvest.
WEB MINING Page 36
ii. Some use various information retrieval techniques and the characteristics of open hypertext documents to organize and filter retrieved information. Example: Hypursuit. iii. Learn user preferences and use those preferences to discover information sources for those particular users. Agent-based Web mining systems can be placed into the following three categories: Intelligent Search Agents: Several intelligent Web agents have been developed that search for relevant information using domain characteristics and user pro les to organize and interpret the discovered information. Agents such as Harvest, FAQ-Finder, Information Manifold, OCCAM, and Para Site rely either on pre-specified domain information about particular types of documents, or on hard coded models of the information sources to retrieve and interpret documents. Agents such as Shop Bot and ILA (Internet Learning Agent) interact with and learn the structure of unfamiliar information sources. Shop Bot retrieves product information from a variety of vendor sites using only general information about the product domain. ILA learns models of various information sources and translates these into its own concept hierarchy. Information Filtering/Categorization: A number of Web agents use various information retrieval techniques and characteristics of open hypertext Web documents to automatically retrieve categorize them. HyPursuit uses semantic information embedded in link structures and document content to create cluster hierarchies of hypertext documents, and structure an information space. BO (Bookmark Organizer) combines hierarchical clustering techniques and user interaction to organize a collection of Web documents based on conceptual information. Personalized Web Agents: This category of Web agents learn user preferences and discover Web information sources based on these preferences, and those of other individuals with similar interests.
4.2.1 DATA BASE APPROACHES: Data base approach: focuses on integrating and organizing the heterogeneous and semi-structured data on the web into more structured and high level collections of resources. These organized resources can then be accessed and analyzed. These metadata, or generalization are then organized into structured collections and can be analyzed.
WEB MINING
Page 37
Database approaches to Web mining have focused on techniques for organizing the semi-structured data on the Web into more structured collections of resources, and using standard database querying mechanisms and data mining techniques to analyze it. Multilevel Databases: The main idea behind this approach is that the lowest level of the database contains semi-structured information stored in various Web repositories, such as hypertext documents. At the higher level meta data or generalizations are extracted from lower levels and organized in structured collections, i.e. relational or object-oriented databases. For example, Han, et. Al. use a multilayered database where each layer is obtained via generalization and transformation operations performed on the lower layers. Kholsa, et. al. propose the creation and maintenance of meta-databases at each information providing domain and the use of a global schema for the meta-database. The incremental integration of a portion of the schema from each information source, rather than relying on a global heterogeneous database schema. The ARANEUS system extracts relevant information from hypertext documents and integrates these into higher-level derived Web Hypertexts which are generalizations of the notion of database views. Web Query Systems: Many Web-based query systems and languages utilize standard database query languages such as SQL, structural information about Web documents, and even natural language processing for the queries that are used in World Wide Web searches. W3QL combines structure queries, based on the organization of hypertext documents, and content queries, based on information retrieval techniques. Web Log Logic-based query language for restructuring extracts information from Web in- formation sources. Lorel and UnQL query heterogeneous and semi-structured information on the Web using a labelled graph data model. TSIMMIS extracts data from heterogeneous and semi-structured information sources and correlates them to generate an integrated database representation of the extracted information.
products and services. Extracting such data allows one to provide value added services, e.g., comparative shopping, and meta-search. Structured data is also easier to extract compared to unstructured texts. This problem has been studied by researchers in AI, database and data mining, and Web communities. There are several approaches to structured data extraction, which is also called wrapper generation. The first approach is to manually write an extraction program for each Web site based on observed format patterns of the site. This approach is very labour intensive and time consuming. It thus does not scale to a large number of sites. The second approach is wrapper induction or wrapper learning, which is the main technique currently. Wrapper learning works as follows: The user first manually labels a set of trained pages. A learning system then generates rules from the training pages. The resulting rules are then applied to extract target items from Web pages. The third approach is the automatic approach. Since structured data objects on the Web are normally database records retrieved from underlying databases and displayed in Web pages with some fixed templates. Automatic methods aim to find patterns/grammars from the Web pages and then use them to extract data. 4.3.2 Unstructured Text Extraction Most Web pages can be seen as text documents. Extracting information from Web documents has also been studied by many researchers. The research is closely related to text mining, information retrieval and natural language processing. Current techniques are mainly based on machine learning and natural language processing to learn extraction rules. Recently, a number of researchers also make use of common language patterns (common sentence structures used to express certain facts or relations) and redundancy of information on the Web to find concepts, relations among concepts and named entities. The patterns can be automatically learnt or supplied by human users. Another direction of research in this area is Web question-answering. Although question-answering was first studied in information retrieval literature, it becomes very important on the Web as Web offers the largest source of information and the objectives of many Web search queries are to obtain answers to some simple questions. Extend question-answering to the Web by query transformation, query expansion, and then selection. 4.3.3 Web Information Integration
WEB MINING
Page 39
Due to the sheer scale of the Web and diverse authorships, various Web sites may use different syntaxes to express similar or related information. In order to make use of or to extract information from multiple sites to provide value added services, e.g., metasearch, deep Web search, etc, one needs to semantically integrate information from multiple sources. Recently, several researchers attempted this task. Two popular problems related to the Web are (1) Web query interface integration, to enable querying multiple Web databases and (2) schema matching, e.g., integrating Yahoo and Google.s directories to match concepts in the hierarchies. The ability to query multiple deep Web databases is attractive and interesting because the deep Web contains a huge amount of information or data that is not indexed by general search engines. 4.3.4 Building Concept Hierarchies Because of the huge size of the Web, organization of information is obviously an important issue. Although it is hard to organize the whole Web, it is feasible to organize Web search results of a given query. A linear list of ranked pages produced by search engines is insufficient for many applications. The standard method for information organization is concept hierarchy and/or categorization. The popular technique for hierarchy construction is text clustering, which groups similar search results together in a hierarchical fashion. Instead, it exploits existing organizational structures in the original Web documents, emphasizing tags and language patterns to perform data mining to find important concepts, sub-concepts and their hierarchical relationships. In other words, it makes use of the information redundancy property and semi-structure nature of the Web to find what concepts are important and what their relationships might be. This work aim to compile a survey article or a book on the Web automatically. 4.3.5 Segmenting Web Pages & Detecting Noise A typical Web page consists of many blocks or areas, e.g., main content areas, navigation areas, advertisements, etc. It is useful to separate these areas automatically for several practical applications. For example, in Web data mining, e.g., classification and clustering, identifying main content areas or removing noisy blocks (e.g., advertisements, navigation panels, etc) enables one to produce much better results. It was shown in that the information contained in noisy blocks can seriously harm Web data mining. Another application is Web browsing using a small screen device, such as a PDA. Identifying
WEB MINING Page 40
different content blocks allows one to re-arrange the layout of the page so that the main contents can be seen easily without losing any other information from the page. 4.3.6 Mining Web Opinion Sources Consumer opinions used to be very difficult to obtain before the Web was available. Companies usually conduct consumer surveys or engage external consultants to find such opinions about their products and those of their competitors. Now much of the information is publicly available on the Web. There are numerous Web sites and pages containing consumer opinions, e.g., customer reviews of products, forums, discussion groups, and blogs. This online word-of-mouth behaviour represents new and measurable sources of information for marketing intelligence. Techniques are now being developed to exploit these sources to help companies and individuals to gain such information effectively and easily. For instance, proposes a feature based summarization method to automatically analyze consumer opinions in customer reviews from online merchant sites and dedicated review sites. The result of such a summary is useful to both potential customers and product manufacturers.
WEB MINING
Page 41
identification, Web pages categorization andWeb site completeness evaluation. Web structure mining can be divided into two categories based on the kind of structured data used:
WEB MINING
Page 42
1. Web graph mining: The Web provides additional information about how different documents are connected to each other via hyperlinks. The Web can be viewed as a (directed) graph whose nodes are Web pages and whose edges are hyperlinks between them. 2. Deep Web mining: Web also contains a vast amount of non crawlable content. his hidden part of the Web is referred to as the deep Web or the hidden Web. Compared to the static surface Web, the deep Web contains a much larger amount of high-quality structured information. Most of mining algorithms, that are improving the performance of Web search, are based on two assumptions. (a) Hyperlinks convey human endorsement. If there exists a link from page A to page B, and these two pages are authored by different people, then the first author found the second page valuable. Thus the importance of a page can be propagated to those pages it links to. (b) Pages that are co-cited by a certain page are likely related to the same topic. The popularity or importance of a page is correlated to the number of incoming links to some extendt, and related pages tend to be clustered together through dense linkages among them. Web information extraction has the goal of pulling out information from a collection of Web pages and converting it to a homogeneous form that is more readily digested and analyzed for both humans and machines. The result of IE could be used to improve the indexing process, because IE removes irrelevant information in Web pages and facilitates other advanced search functions due to the structured nature of data. It is usually difficult or even impossible to directly obtain the structures of the Web sites backend databases without cooperation from the sites. Instead, the sites present two other distinguishing structures: Interface schema and result schema. The interface schema is the schema of the query interface, which exposes attributes that can be queried in the backend database. The result schema is the schema of the query results, which exposes attributes that are shown to users.
WEB MINING
Page 43
performed at this stage, such as combining multiple logs, incorporating referrer logs, etc. After the data cleaning, the log entries must be partitioned into logical clusters using one or a series of transaction identification modules. The goal of trans- action identification is to create meaningful clusters of references for each user. The task of identifying transactions is one of either dividing a large transaction into multiple smaller ones or merging small transactions into fewer larger ones. The input and output transaction formats match so that any number of modules to be combined in any order, as the data analyst sees _t. Once the domain-dependent data transformation phase is completed, the resulting transaction data must be formatted to conform to the data model of the appropriate data mining task. For instance, the format of the data for the association rule discovery task may be different than the format necessary for mining sequential patterns. Finally, a query mechanism will allow the user (analyst) to provide more control over the discovery process by specifying various constraints.
WEB MINING
Page 45
searching for pages relevant to their information needs. Besides usage data, the server side also provides content data, structure information and Web page meta-information. The Web server also relies on other utilities such as CGI scripts to handle data sent back from client browsers. Web servers implementing the CGI standard parse the URI 1 of the requested file to determine if it is an application program. The URI for CGI programs may contain additional parameter values to be passed to the CGI application. Once the CGI program has completed its execution, the Web server send the output of the CGI application back to the browser. 2. PROXY SERVER LOGS: A Web proxy is a caching mechanism which lies between client browsers and Web servers. It helps to reduce the load time of Web pages as well as the network traffic load at the server and client side. Proxy server logs contain the HTTP requests from multiple clients to multiple Web servers. This may serve as a data source to discover the usage pattern of a group of anonymous users, sharing a common proxy server. A Web proxy acts as an intermediate level of caching between client browsers and Web servers. Proxy caching can be used to reduce the loading time of a Web page experienced by users as well as the network traffic load at the server and client sides. The performance of proxy caches depends on their ability to predict future page requests correctly. Proxy traces may reveal the actual HTTP requests from multiple clients to multiple Web servers. This may serve as a data source for characterizing the browsing behaviour of a group of anonymous users sharing a common proxy server. 3. BROWSER LOGS: Various browsers like Mozilla, Internet Explorer etc. can be modified or various JavaScript and Java applets can be used to collect client side data. This implementation of client-side data collection requires user cooperation, either in enabling the functionality of the JavaScript and Java applets, or to voluntarily use the modified browser. Client-side collection scores over server-side collection because it reduces both the bot and session identification problems. Client-side data collection can be implemented by using a remote agent such as Java scripts or Java applets or by modifying the source code of an existing browser such as Mosaic or Mozilla to enhance its data collection capabilities. The implementation of
WEB MINING Page 47
client-side data collection methods requires user cooperation, either in enabling the functionality of the Java scripts and Java applets, or to voluntarily use the modified browser. Client-side collection has an advantage over server-side collection because it ameliorates both the caching and session identification problems. However, Java applets perform no better than server logs in terms of determining the actual view time of a page. In fact, it may incur some additional overhead especially when the Java applet is loaded for the first time. Java scripts, on the other hand, consume little interpretation time but cannot capture all user clicks (such as reload or back buttons). These methods will collect only single-user, single-site browsing behaviour. A modified browser is much more versatile and will allow data collection about a single user over multiple Web sites. The most difficult part of using this method is convincing the users to use the browser for their daily browsing activities. This can be done by offering incentives to users who are willing to use the browser, similar to the incentive programs ordered by companies such as NetZero and All Advantage that reward users for clicking on banner advertisements while surfing the Web.
6. Path Analysis: Path analysis gives the analysis of the path a particular user has followed in accessing contents of a Website. 7. Visitor IP address: This information gives the Internet Protocol(I.P.) address of the visitors who visited the Website in consideration. 8. Browser Type: This information gives the information of the type of browser that was used for accessing the Website. 9. Cookies: A message given to a Web browser by a Web server. The browser stores the message in a text file called cookie. The message is then sent back to the server each time the browser requests a page from the server. The main purpose of cookies is to identify users and possibly prepare customized Web pages for them. When you enter a Web site using cookies, you may be asked to fill out a form providing such information as your name and interests. This information is packaged into a cookie and sent to your Web browser which stores it for later use. The next time you go to the same Web site, your browser will send the cookie to the Web server. The server can use this information to present you with custom Web pages. So, for example, instead of seeing just a generic welcome page you might see a welcome page with your name on it. 10. Platform: This information gives the type of Operating System etc. that was used to access the Website.
WEB MINING
Page 49
3. Redesigning Pages to help User Navigation: To help the user to navigate through the website in the best possible manner, the information obtained can be used to redesign the structure of the Website. 4. Redesigning Pages For Search Engine Optimization: The content as well as other information in the website can be improved from analyzing user patterns and this information can be used to redesign pages for Search Engine Optimization so that the search engines index the website at a proper rank. 5. Help Evaluating Effectiveness of Advertising Campaigns: Important and business critical advertisements can be put up on pages that are frequently accessed.
WEB MINING
Page 50
Multiple IP address/Single User - A user that accesses the Web from different machines will have a different IP address from session to session. This makes tracking repeat visits from the same user difficult.
Multiple Agent/Singe User - Again, a user that uses more than one browser, even on the same machine, will appear as multiple users. Assuming each user has now been identified (through cookies, logins, or
IP/agent/path analysis), the click-stream for each user must be divided into sessions. Since page requests from other servers are not typically available, it is difficult to know when a user has left a Web site. A thirty minute timeout is often used as the default method of breaking a user's click-stream into sessions. When a session ID is embedded in each URI, the definition of a session is set by the content server. While the exact content served as a result of each user action is often available from the request field in the server logs, it is sometimes necessary to have access to the content server information as well. Since content servers can maintain state variables for each active session, the information necessary to determine exactly what content is served by a user request is not always available in the URI. The final problem encountered when preprocessing usage data is that of inferring cached page references. The only variable method of tracking cached page views is to monitor usage from the client side. The referrer field for each request can be used to detect some of the instances when cached pages have been viewed. IP address 123.456.78.9 is responsible for three server sessions, and IP addresses 209.456.78.2 and 209.45.78.3 are responsible for a fourth session. Using a combination of referrer and agent information, lines 1 through 11 can be divided into three sessions of A-B-F-O-G, L-R, and A-B-C-J. Path completion would add two page references to the first session A-B-F-O-F-B-G, and one reference to the third session A-BA-C-J. Without using cookies, an embedded session ID, or a client-side data collection method, there is no method for determining that lines 12 and 13 are actually a single server session. 2. Content Pre-Processing: Pre-Processing of content accessed. Content preprocessing consists of converting the text, image, scripts, and other les such as multimedia into forms that are useful for the Web Usage Mining process. Often, this consists of performing content mining such as classification or clustering. While applying
WEB MINING Page 51
data mining to the content of Web sites is an interesting area of research in its own right, in the context of Web Usage Mining the content of a site can be used to filter the input to, or output from the pattern discovery algorithms. For example, results of a classification algorithm could be used to limit the discovered patterns to those containing page views about a certain subject or class of products. In addition to classifying or clustering page views based on topics, page views can also be classified according to their intended use. Page views can be intended to convey information (through text, graphics, or other multimedia), gather information from the user, allow navigation (through a list of hypertext links), or some combination uses. The intended use of a page view can also filter the sessions before or after pattern discovery. In order to run content mining algorithms on page views, the information must first be converted into a quantifiable format. Some version of the vector space model is typically used to accomplish this. Text files can be broken up into vectors of words. Keywords or text descriptions can be substituted for graphics or multimedia. The content of static page views can be easily preprocessed by parsing the HTML and reformatting the information or running additional algorithms as desired. Dynamic page views present more of a challenge. Content servers that employ personalization techniques and/or draw upon databases to construct the page views may be capable of forming more page views than can be practically preprocessed. A given set of server sessions may only access a fraction of the page views possible for a large dynamic site. Also the content may be revised on a regular basis. The content of each page view to be pre- processed must be \assembled", either by an HTTP request from a crawler, or a combination of template, script, and database accesses. If only the portion of page views that are accessed are preprocessed, the output of any classification or clustering algorithms may be skewed. 3. Structure Pre-Processing: Pre-Processing related to structure of the website. The structure of a site is created by the hypertext links between page views. The structure can be obtained and pre- processed in the same manner as the content of a site. Again, dynamic content (and therefore links) pose more problems than static page views. A different site structure may have to be constructed for each server session.
WEB MINING
Page 52
6.6.1 PATTERN DISCOVERY: Web Usage mining can be used to uncover patterns in server logs but is often carried out only on samples of data. The mining process will be ineffective if the samples are not a good representation of the larger body of data. Pattern discovery draws upon methods and algorithms developed from several fields such as statistics, data mining, machine learning and pattern recognition. However, it is not the intent of this paper to describe all the available algorithms and techniques derived from these fields. Interested readers should consult references such as. This section describes the kinds of mining activities that have been applied to the Web domain. Methods developed from other fields must take into consideration the different kinds of data abstractions and prior knowledge available for Web Mining. For example, in association rule discovery, the notion of a transaction for market-basket analysis does not take into consideration the order in which items are selected. How- ever, in Web Usage Mining, a server session is an ordered sequence of pages requested by a user. Furthermore, due to the difficulty in identifying unique sessions, additional prior knowledge is required (such as imposing a default timeout period, as was pointed out in the previous section). 1. Statistical Analysis Statistical techniques are the most common method to extract knowledge about visitors to a Web site. By analyzing the session file, one can perform different kinds of descriptive statistical analyses (frequency, mean, median, etc.) on variables such as page views, viewing time and length of a navigational path. Many Web traffic analysis tools produce a periodic report containing statistical information such as the most frequently accessed pages, average view time of a page or average length of a path through a site. This report may include limited low-level error analysis such as detecting unauthorized entry points or finding the most common invalid URI. Despite lacking in the depth of its analysis, this type of knowledge can be potentially useful for improving the system performance, enhancing the security of the system, facilitating the site modification task, and providing support for marketing decisions.
WEB MINING
Page 53
2. Association Rules Association rule generation can be used to relate pages that are most often referenced together in a single server session. In the context of Web Usage Mining, association rules refer to sets of pages that are accessed together with a support value exceeding some specified threshold. These pages may not be directly connected to one another via hyperlinks. For example, association rule discovery using the Apriori algorithm may reveal a correlation between users who visited a page containing electronic products to those who access a page about sporting equipment. Aside from being applicable for business and marketing applications, the presence or absence of such rules can help Web designers to restructure their Web site. The association rules may also serve as a heuristic for pre fetching documents in order to reduce user-perceived latency when loading a page from a remote site. 3. Clustering Clustering is a technique to group together a set of items having similar characteristics. In the Web Usage domain, there are two kinds of interesting clusters to be discovered : usage clusters and page clusters. Clustering of users tends to establish groups of users exhibiting similar browsing pat- terns. Such knowledge is especially useful for inferring user demographics in order to perform market segmentation in E-commerce applications or provide personalized Web con- tent to the users. On the other hand, clustering of pages will discover groups of pages having related content. This information is useful for Internet search engines and Web assistance providers. In both applications, permanent or dynamic HTML pages can be created that suggest related hyperlinks to the user according to the user's query or past history of information needs. 4. Classification Classification is the task of mapping a data item into one of several predefined classes. In the Web domain, one is interested in developing a profile of users belonging to a particular class or category. This requires extraction and selection of features that best describe the properties of a given class or category. Classification can be done by using supervised inductive learning algorithms such as decision tree classifiers, naive Bayesian classifiers, k-nearest neighbour classifiers, Support Vector Machines etc. For example, classification on server logs may lead to the discovery of interesting rules such as : 30% of users who
WEB MINING Page 54
placed an online order in /Product/Music are in the 18-25 age group and live on the West Coast. 5. Sequential Patterns The technique of sequential pattern discovery attempts to find inter-session patterns such that the presence of a set of items is followed by another item in a time-ordered set of sessions or episodes. By using this approach, Web marketers can predict future visit patterns which will be helpful in placing advertisements aimed at certain user groups. Other types of temporal analysis that can be performed on sequential patterns includes trend analysis, change point detection, or similarity analysis. 6. Dependency Modeling Dependency modeling is another useful pattern discovery task in Web Mining. The goal here is to develop a model capable of representing significant dependencies among the various variables in the Web domain. As an example, one may be interested to build a model representing the different stages a visitor undergoes while shopping in an online store based on the actions chosen (i.e. from a casual visitor to a serious potential buyer). There are several probabilistic learning techniques that can be employed to model the browsing behaviour of users. Such techniques include Hidden Markov Models and Bayesian Belief Networks. Modeling of Web usage patterns will not only provide a theoretical framework for analyzing the behaviour of users but is potentially useful for predicting future Web resource consumption. Such information may help develop strategies to increase the sales of products offered by the Web site or improve the navigational convenience of users. 6.6.3 PATTERN ANALYSIS This is the final step in the Web Usage Mining process. After the preprocessing and pattern discovery, the obtained usage patterns are analyzed to filter uninteresting information and extract the useful information. The methods like SQL(Structured Query Language) processing and OLAP (Online Analytical Processing) can be used. The motivation behind pattern analysis is to filter out uninteresting rules or patterns from the set found in the pattern discovery phase. The exact analysis methodology is usually governed by the application for which Web mining is done. The most common
WEB MINING Page 55
form of pattern analysis consists of a knowledge query mechanism such as SQL. Another method is to load usage data into a data cube in order to perform OLAP operations. Visualization techniques, such as graphing patterns or as- signing colours to different values, can often highlight overall patterns or trends in the data. Content and structure information can be used to filter out patterns containing pages of a certain usage type, content type, or pages that match a certain hyperlink structure.
WEB MINING
Page 56
3. ADAPTIVE WEBSITES An adaptive website adjusts the structure, content, or presentation of information in response to measured user interaction with the site, with the objective of optimizing future user interactions. Adaptive websites are web sites that automatically improve their organization and presentation by learning from their user access patterns. User interaction patterns may be collected directly on the website or may be mined from Web server logs. A model or models are created of user interaction using artificial intelligence and statistical methods. The models are used as the basis for tailoring the website for known and specific patterns of user interaction.
WEB MINING
Page 57
pointed to by good hubs. The scores are computed for a set of pages related to a topic using an iterative procedure called HITS (Kleinberg 1999). First a query is submitted to a search engine and a set of relevant documents is retrieved. This set, called the root set, is then expanded by including web pages that point to those in the root set and are pointed by those in the root set. This new set is called the base set. An adjacency matrix, A is formed such that if there exists at least one hyperlink from page i to page j, then Ai,j = 1, otherwise Ai,j = 0. HITS algorithm is then used to compute the hub and authority scores for these set of pages. There have been modifications and improvements to the basic page rank and hubs and authorities approaches such as SALSA (Lempel and Moran 2000), topic sensitive page rank, (Haveliwala 2002) and web page reputations (Mendelzon and Rafiei 2000). These different hyperlink based metrics have been discussed by Desikan, Srivastava, Kumar, and Tan (2002).
7.2 ROBOT DETECTION AND FILTERING: SEPARATING HUMAN AND NON HUMAN WEB BEHAVIOUR
Web robots are software programs that automatically traverse the hyperlink structure of the web to locate and retrieve information. The importance of separating robot behaviour from human behaviour prior to building user behaviour models has been illustrated by Kohavi. First, e-commerce retailers are particularly concerned about the unauthorized deployment of robots for gathering business intelligence at their web sites. Second, web robots tend to consume considerable network bandwidth at the expense of other users. Sessions due to web robots also make it difficult to perform click-stream analysis effectively on the web data. Conventional techniques for detecting web robots are based on identifying the IP address and user agent of the web clients. While these techniques are applicable to many well-known robots, they are not sufficient to detect camouflaged and previously unknown robots. Tan and Kumar proposed a classification based approach that uses the navigational patterns in click-stream data to determine if it is due to a robot. Experimental results have shown that highly accurate classification models can be built using this approach. Furthermore, these models are able to discover many camouflaged and previously unidentified robots.
WEB MINING
Page 59
successful examples of building web-wide behavioural profiles such as Alexa Research and DoubleClick. These approaches require browser cookies of some sort, and can provide a fairly detailed view of a users browsing behaviour across the web.
WEB MINING
Page 61
preprocessing. Preprocessing of web structure data, especially link information, has been carried out for some applications, the most notable being Google style web search.
WEB MINING
Page 63
8. PROMINENT APPLICATIONS
Excitement about the web in the past few years has led to the web applications being developed at a much faster rate in the industry than research in web related technologies. Many of these are based on the use of web mining concepts, even though the organizations that developed these applications, and invented the corresponding technologies, did not consider it as such. We describe some of the most successful applications in this section. Clearly, realizing that these applications use web mining is largely a retrospective exercise. For each application category discussed below, we have selected a prominent representative, purely for exemplary purposes. This in no way implies that all the techniques described were developed by that organization alone. On the contrary, in most cases the successful techniques were developed by a rapid copy and improve approach to each others ideas.
WEB MINING
Page 64
WEB MINING
Page 65
effort. However, the publishing industry is not very convinced about a fully automated approach to news distillation.
its communities, which it has used for targeted marketing through advertisements and email solicitation. Recently, it has started the concept of community sponsorship, whereby an organization, say Nike, may sponsor a community called Young Athletic Twenty Something. In return, consumer survey and new product development experts of the sponsoring organization get to participate in the community, perhaps without the knowledge of other participants. The idea is to treat the community as a highly specialized focus group, understand its needs and opinions on new and existing products, and also test strategies for influencing opinions.
8.7
CITESEER:
DIGITAL
LIBRARY
AND
AUTONOMOUS
CITATION INDEXING
NEC Research Index, also known as CiteSeer is one of the most popular online bibiliographic indices related to computer science. The key contribution of the CiteSeer repository is its Autonomous Citation Indexing (ACI) (Lawrence, Giles, and Bollacker 1999). Citation indexing makes it possible to extract information about related articles. Automating such a process reduces a lot of human effort, and makes it more effective and faster. CiteSeer works by crawling the web and downloading research related papers. Information about citations and the related context is stored for each of these documents. The entire text and information about the document is stored in different formats. Information about documents that are similar at a sentence level (percentage of sentences that match between the documents), at a text level or related due to co citation is also given. Citation statistics for documents are computed that enable the user to look at the most cited or popular documents in the related field. They also maintain a directory for computer science related papers, to make search based on categories easier. These documents are ordered by the number of citations
WEB MINING
Page 68
9. RESEARCH DIRECTIONS
Although we are going through an inevitable phase of irrational despair following a phase of irrational exuberance about the commercial potential of the web, the adoption and usage of the web continues to grow unabated. As the web and its usage grows, it will continue to generate ever more content, structure, and usage data, and the value of web mining will keep increasing. Outlined here are some research directions that must be pursued to ensure that we continue to develop web mining technologies that will enable this value to be realized.
understanding users behaviour in traditional shops. Research needs to be carried out in (1) extracting process models from usage data, (2) understanding how different parts of the process model impact various web metrics of interest, and (3) how the process models change in response to various changes that are made, i.e. changing stimuli to the user.
characterize them and recognize emerging frauds. The issues in cyber threat analysis and intrusion detection are quite similar in nature.
WEB MINING
Page 71