You are on page 1of 71

1.

INTRODUCTION
The World Wide Web is without doubt, already know what and have used it extensively. The World Wide Web (or the Web for short) has impacted on almost every aspect of our lives. It is the biggest and most widely known information source that is easily accessible and searchable. It consists of billions of interconnected documents (called Web pages) which are authored by millions of people. Since its inception, the Web has dramatically changed our information seeking behaviour. Before the Web, finding information means asking a friend or an expert, or buying/borrowing a book to read. However, with the Web, everything is only a few clicks away from the comfort of our homes or offices. Not only can we find needed information on the Web, but we can also easily share our information and knowledge with others. The Web has also become an important channel for conducting businesses. We can buy almost anything from online stores without needing to go to a physical shop. The Web also provides convenient means for us to communicate with each other, to express our views and opinions on anything, and to discuss with people from anywhere in the world. The Web is truly a virtual society. In this chapter, we introduce the Web, its history, and the topics that we will discuss in the seminar. 1.1 WHAT IS THE WORLD WIDE WEB? The World Wide Web is officially defined as a wide-area hypermedia information retrieval initiative aiming to give universal access to a large universe of documents. In simpler terms, the Web is an Internet-based computer network that allows users of one computer to access information stored on another through the world-wide network called the Internet. The Web's implementation follows a standard client-server model. In this model, a user relies on a program (called the client) to connect to a remote machine (called the server) where the data is stored. Navigating through the Web is done by means of a client program called the browser, e.g., Netscape, Internet Explorer, Firefox, etc. Web browsers work by sending requests to remote servers for information and then interpreting the returned documents written in HTML and laying out the text and graphics on the users computer screen on the client side.
WEB MINING Page 1

The operation of the Web relies on the structure of its hypertext documents. Hypertext allows Web page authors to link their documents to other related documents residing on computers anywhere in the world. To view these documents, one simply follows the links (called hyperlinks). The idea of hypertext was invented by Ted Nelson in 1965, who also created the well known hypertext system Xanadu (http://xanadu. com/). Hypertext that also allows other media (e.g., image, audio and video files) is called hypermedia.

1.2 A BRIEF HISTORY OF THE WEB AND THE INTERNET


CREATION OF THE WEB: The Web was invented in 1989 by Tim Berners-Lee, who, at that time, worked at CERN (Centre European pour la Recherche Nucleaire, or European Laboratory for Particle Physics) in Switzerland. He coined the term World Wide Web, wrote the first World Wide Web server, httpd, and the first client program (a browser and editor), WORLD WIDE WEB: It began in March 1989 when Tim Berners-Lee submitted a proposal titled Information Management: A Proposal to his superiors at CERN. In the proposal, he discussed the disadvantages of hierarchical information organization and outlined the advantages of a hypertext-based system. The proposal called for a simple protocol that could request information stored in remote systems through networks, and for a scheme by which information could be exchanged in a common format and documents of individuals could be linked by hyperlinks to other documents. It also proposed methods for reading text and graphics using the display technology at CERN at that time. The proposal essentially outlined a distributed hypertext system, which is the basic architecture of the Web. Initially, the proposal did not receive the needed support. However, in 1990, Berners-Lee re-circulated the proposal and received the support to begin the work. With this project, Berners-Lee and his team at CERN laid the foundation for the future development of the Web as a distributed hypertext system. They introduced their server and browser, the protocol used for communication between clients and the server, the
WEB MINING Page 2

Hyper Text Transfer Protocol (HTTP), the Hyper Text Markup Language (HTML) used for authoring Web documents, and the Universal Resource Locator (URL). And so it began. MOSAIC AND NETSCAPE BROWSERS: The next significant event in the development of the Web was the arrival of Mosaic. In February of 1993, Marc Andreesen from the University of Illinois NCSA (National Center for Supercomputing Applications) and his team released the first "Mosaic for X" graphical Web browser for UNIX. A few months later, different versions of Mosaic were released for Macintosh and Windows operating systems. This was an important event. For the first time, a Web client, with a consistent and simple point-andclick graphical user interface, was implemented for the three most popular operating systems available at the time. It soon made big splashes outside the academic circle where it had begun. In mid-1994, Silicon Graphics founder Jim Clark collaborated with Marc Andreessen, and they founded the company Mosaic Communications (later renamed as Netscape Communications). Within a few months, the Netscape browser was released to the public, which started the explosive growth of the Web. The Internet Explorer from Microsoft entered the market in August, 1995 and began to challenge Netscape. The creation of the World Wide Web by Tim Berners-Lee followed by the release of the Mosaic browser are often regarded as the two most significant contributing factors to the success and popularity of the Web. INTERNET: The Web would not be possible without the Internet, which provides the communication network for the Web to function. The Internet started with the computer network ARPANET in the Cold War era. It was produced as the result of a project in the United States aiming at maintaining control over its missiles and bombers after a nuclear attack. It was supported by Advanced Research Projects Agency (ARPA), which was part of the Department of Defense in the United States. The first ARPANET connections were made in 1969, and in 1972, it was demonstrated at the First International Conference on Computers and Communication, held in Washington D.C. At the conference, ARPA scientists linked computers together from 40 different locations.
WEB MINING Page 3

In 1973, Vinton Cerf and Bob Kahn started to develop the protocol later to be called TCP/IP (Transmission Control Protocol/Internet Protocol). In the next year, they published the paper Transmission Control Protocol, which marked the beginning of TCP/IP. This new protocol allowed diverse computer networks to interconnect and communicate with each other. In subsequent years, many networks were built, and many competing techniques and protocols were proposed and developed. However, ARPANET was still the backbone to the entire system. During the period, the network scene was chaotic. In 1982, the TCP/IP was finally adopted, and the Internet, which is a connected set of networks using the TCP/IP protocol, was born. SEARCH ENGINES: With information being shared worldwide, there was a need for individuals to find information in an orderly and efficient manner. Thus began the development of search engines. The search system Excite was introduced in 1993 by six Stanford University students. EINet Galaxy was established in 1994 as part of the MCC Research Consortium at the University of Texas. Jerry Yang and David Filo created Yahoo! in 1994, which started out as a listing of their favourite Web sites, and offered directory search. In subsequent years, many search systems emerged, e.g., Lycos, Inforseek, AltaVista, Inktomi, Ask Jeeves, Northernlight, etc. Google was launched in 1998 by Sergey Brin and Larry Page based on their research project t at Stanford University. Microsoft started to commit to search in 2003, and launched the MSN search engine in spring 2005. It used search engines from others before. Yahoo! provided a general search capability in 2004 after it purchased Inktomi in 2003. W3C (THE WORLD WIDE WEB CONSORTIUM): W3C was formed in the December of 1994 by MIT and CERN as an international organization to lead the development of the Web. W3C's main objective was to promote standards for the evolution of the Web and interoperability between WWW products by producing specifications and reference software. The first International Conference on World Wide Web (WWW) was also held in 1994, which has been a yearly event ever since. From 1995 to 2001, the growth of the Web boomed. Investors saw commercial
WEB MINING Page 4

opportunities and became involved. Numerous businesses started on the Web, which led to irrational developments. Finally, the bubble burst in 2001. However, the development of the Web was not stopped, but has only become more rational since.

1.3 WEB DATA MINING


The rapid growth of the Web in the last decade makes it the largest publicly accessible data source in the world. The Web has many unique characteristics, which make mining useful information and knowledge a fascinating and challenging task. Let us review some of these characteristics. 1. The amount of data/information on the Web is huge and still growing. The coverage of the information is also very wide and diverse. One can find information on almost anything on the Web. 2. Data of all types exist on the Web, e.g., structured tables, semi structured Web pages, unstructured texts, and multimedia files (images, audios, and videos). 3. Information on the Web is heterogeneous. Due to the diverse authorship of Web pages, multiple pages may present the same or similar information using completely different words and/or formats. This makes integration of information from multiple pages a challenging problem. 4. A significant amount of information on the Web is linked. Hyperlinks exist among Web pages within a site and across different sites. Within a site, hyperlinks serve as information organization mechanisms. Across different sites, hyperlinks represent implicit conveyance of authority to the target pages. That is, those pages that are linked (or pointed) to by many other pages are usually high quality pages or authoritative pages simply because many people trust them. 5. The information on the Web is noisy. The noise comes from two main sources. First, a typical Web page contains many pieces of information, e.g., the main content of the page, navigation links, advertisements, copyright notices, privacy policies, etc. For a particular application, only part of the information is useful. The rest is considered noise. To perform fine-grain Web information analysis and data mining, the noise should be removed. Second, due to the fact that the Web does not have quality control of information, i.e., one
WEB MINING Page 5

can write almost anything that one likes, a large amount of information on the Web is of low quality, erroneous, or even misleading. 6. The Web is also about services. Most commercial Web sites allow people to perform useful operations at their sites, e.g., to purchase products, to pay bills, and to fill in forms. 7. The Web is dynamic. Information on the Web changes constantly. Keeping up with the change and monitoring the change are important issues for many applications. 8. The Web is a virtual society. The Web is not only about data, information and services, but also about interactions among people, organizations and automated systems. One can communicate with people anywhere in the world easily and instantly, and also express ones views on anything in Internet forums, blogs and review sites. All these characteristics present both challenges and opportunities for mining and discovery of information and knowledge from the Web. To explore information mining on the Web, it is necessary to know data mining, which has been applied in many Web mining tasks. However, Web mining is not entirely an application of data mining. Due to the richness and diversity of information and other Web specific characteristics discussed above, Web mining has developed many of its own algorithms. 1.3.1 WHAT IS DATA MINING? Data mining is also called knowledge discovery in databases (KDD). It is commonly defined as the process of discovering useful patterns or knowledge from data sources, e.g., databases, texts, images, the Web, etc. The patterns must be valid, potentially useful, and understandable. Data mining is a multi-disciplinary field involving machine learning, statistics, databases, artificial intelligence, information retrieval, and

visualization. There are many data mining tasks. Some of the common ones are supervised learning (or classification), unsupervised learning (or clustering), association rule mining, and sequential pattern mining. We will discuss all of them in this seminar. A data mining application usually starts with an understanding of the application domain by data analysts (data miners), who then identify suitable data sources and the

WEB MINING

Page 6

target data. With the data, data mining can be performed, which is usually carried out in three main steps: Pre-processing: The raw data is usually not suitable for mining due to various reasons. It may need to be cleaned in order to remove noises or abnormalities. The data may also be too large and/or involve many irrelevant attributes, which call for data reduction through sampling and attribute selection. Data mining: The processed data is then fed to a data mining algorithm which will produce patterns or knowledge. Post-processing: In many applications, not all discovered patterns are useful. This step identifies those useful ones for applications. Various evaluation and visualization techniques are used to make the decision. The whole process (also called the data mining process) is almost always iterative. It usually takes many rounds to achieve final satisfactory results, which are then incorporated into real-world operational tasks. Traditional data mining uses structured data stored in relational tables, spread sheets, or flat files in the tabular form. With the growth of the Web and text documents, Web mining and text mining are becoming increasingly important and popular. 1.3.2 WHAT IS WEB MINING? Web mining aims to discover useful information or knowledge from the Web hyperlink structure, page content, and usage data. Although Web mining uses many data mining techniques, as mentioned above it is not purely an application of traditional data mining due to the heterogeneity and semi-structured or unstructured nature of the Web data. Many new mining tasks and algorithms were invented in the past decade. Based on the primary kinds of data used in the mining process, Web mining tasks can be categorized into three types: Web structure mining, Web content mining and Web usage mining. Web mining is the use of data mining techniques to automatically discover and extract information from Web documents and services. Web mining should be decomposed into these subtasks: 1. Resource finding: The task of retrieving intended Web documents.
WEB MINING Page 7

2. Information selection and pre processing: Automatically selecting and pre processing specific information from retrieved Web resources. 3. Generalization: Automatically discovers general patterns at individual Web sites as well as across multiple sites. 4. Analysis: Validation and/or interpretation of the mined patterns. Resource finding is the process of retrieving data from text sources available on the Web such as electronic magazines and newsletters or text contents of HTML documents. Information selection and pre processing step is transformation process retrieved in information retrieval (IR) process from original data. These transformations cover removing stop words, finding phrases in the training corpus, transforming the representation to relational or first order logic form, etc. Data mining techniques and machine learning are often used for generalization. In information and knowledge discovery process, people play very important role. This is important for validation and/or interpretation in last step.

1.4 WEB MINING CATEGORIES


Web mining is categorized into three areas of interest based on part of Web to mine: 1. Web content mining describes discovery of useful information from contents, data and documents two different points of view: ir view and db view 2. Web structure mining model of link structures, topology of hyperlinks categorizing of web pages 3. Web usage mining mines secondary data derived from user interactions Web content mining is the process of extracting knowledge from the content of documents or their descriptions. Web structure mining is the process of inferring knowledge from the Web organization and links between references and referents in the

WEB MINING

Page 8

Web. Finally, Web usage mining, also known as Web Log Mining, is the process of extracting interesting patterns in Web access logs. In this seminar, we will discuss all these three types of mining. However, due to the richness and diversity of information on the Web, there are a large number of Web mining tasks. We will not be able to cover them all. We will only focus on some important tasks and their algorithms. The Web mining process is similar to the data mining process. The difference is usually in the data collection. In traditional data mining, the data is often already collected and stored in a data warehouse. For Web mining, data collection can be a substantial task, especially for Web structure and content mining, which involves crawling a large number of target Web pages. We will devote a whole chapter on crawling. Once the data is collected, we go through the same three-step process: data preprocessing, Web data mining and post-processing. However, the techniques used for each step can be quite different from those used in traditional data mining.

WEB MINING

Page 9

2. DATA MINING
Data mining has attracted a great deal of attention in the information industry and in society as a whole in recent years, due to the wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge. The information and knowledge gained can be used for applications ranging from market analysis, fraud detection, and customer retention, to production control and science exploration. Data mining can be viewed as a result of the natural evolution of information technology. The database system industry has witnessed an evolutionary path in the development of the following functionalities: data collection and database creation, data management (including data storage and retrieval, and database transaction processing), and advanced data analysis (involving data warehousing and data mining). For instance, the early development of data collection and database creation mechanisms served as a prerequisite for later development of effective mechanisms for data storage and retrieval, and query and transaction processing. With numerous database systems offering query and transaction processing as common practice, advanced data analysis has naturally become the next target. Since the 1960s, database and information technology has been evolving systematically from primitive file processing systems to sophisticated and powerful database systems. The research and development in database systems since the 1970s has progressed from early hierarchical and network database systems to the development of relational database systems, data modelling tools, and indexing and accessing methods. In addition, users gained convenient and flexible data access through query languages, user interfaces, optimized query processing, and transaction management. Efficient methods for on-line transaction processing (OLTP), where a query is viewed as a read-only transaction, have contributed substantially to the evolution and wide acceptance of relational technology as a major tool for efficient storage, retrieval, and management of large amounts of data. Database technology since the mid-1980s has been characterized by the popular adoption of relational technology and an upsurge of research and development activities on new and powerful database systems. These promote the development of advanced data
WEB MINING Page 10

models such as extended-relational, object-oriented, object-relational, and deductive models. Application-oriented database systems, including spatial, temporal, multimedia, active, stream, and sensor, and scientific and engineering databases, knowledge bases, and office information bases, have flourished. Issues related to the distribution, diversification, and sharing of data have been studied extensively. Heterogeneous database systems and Internet-based global information systems such as the World Wide Web (WWW) have also emerged and play a vital role in the information industry. The steady and amazing progress of computer hardware technology in the past three decades has led to large supplies of powerful and affordable computers, data collection equipment, and storage media. This technology provides a great boost to the database and information industry, and makes a huge number of databases and information repositories available for transaction management, information retrieval, and data analysis. Data can now be stored in many different kinds of databases and information repositories. One data repository architecture that has emerged is the data warehouse, a repository of multiple heterogeneous data sources organized under a unified schema at a single site in order to facilitate management decision making. Data warehouse technology includes data cleaning, data integration, and on-line analytical processing (OLAP), that is, analysis techniques with functionalities such as summarization, consolidation, and aggregation as well as the ability to view information from different angles. Although OLAP tools support multidimensional analysis and decision making, additional data analysis tools are required for in-depth analysis, such data classification, clustering, and the characterization of data changes over time. In addition, huge volumes of data can be accumulated beyond databases and data warehouses. Typical examples include the World Wide Web and data streams, where data flow in and out like streams, as in applications like video surveillance, telecommunication, and sensor networks. The effective and efficient analysis of data in such different forms becomes a challenging task. The abundance of data, coupled with the need for powerful data analysis tools, has been described as a data rich but information poor situation. The fast-growing, tremendous amount of data, collected and stored in large and numerous data repositories, has far exceeded our human ability for comprehension without powerful tools. As a result, data collected in large data repositories become data tombsdata archives that are seldom visited. Consequently, important decisions are often made based not on the informationWEB MINING Page 11

rich data stored in data repositories, but rather on a decision makers intuition, simply because the decision maker does not have the tools to extract the valuable knowledge embedded in the vast amounts of data. In addition, consider expert system technologies, which typically rely on users or domain experts to manually input knowledge into knowledge bases. Unfortunately, this procedure is prone to biases and errors, and is extremely time-consuming and costly. Data mining tools perform data analysis and may uncover important data patterns, contributing greatly to business strategies, knowledge bases, and scientific and medical research. The widening gap between data and information calls for a systematic development of data mining tools that will turn data tombs into golden nuggets of knowledge.

Figure 1: DATA MINING AS A STEP IN THE PROCESS OF KNOWLEDGE DISCOVERY

WEB MINING

Page 12

2.1 DATA MINING BRIEF OVERVIEW


Simply stated, data mining refers to extracting or mining knowledge from large amounts of data. The term is actually a misnomer. Remember that the mining of gold from rocks or sand is referred to as gold mining rather than rock or sand mining. Thus, data mining should have been more appropriately named knowledge mining from data, which is unfortunately somewhat long. Knowledge mining, a shorter term, may not reflect the emphasis on mining from large amounts of data. Nevertheless, mining is a vivid term characterizing the process that finds a small set of precious nuggets from a great deal of raw material. Thus, such a misnomer that carries both data and mining became a popular choice. Many other terms carry a similar or slightly different meaning to data mining, such as knowledge mining from data, knowledge extraction, data/pattern analysis, data archaeology, and data dredging. Many people treat data mining as a synonym for another popularly used term, Knowledge Discovery from Data, or KDD. Alternatively, others view data mining as simply an essential step in the process of knowledge discovery. Knowledge discovery as a process is depicted in Figure 1 and consists of an iterative sequence of the following steps: 1. Data cleaning (to remove noise and inconsistent data) 2. Data integration (where multiple data sources may be combined) 3. Data selection (where data relevant to the analysis task are retrieved from the database) 4. Data transformation (where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations) 5. Data mining (an essential process where intelligent methods are applied in order to extract data patterns) 6. Pattern evaluation (to identify the truly interesting patterns representing knowledge based on some interestingness measures) 7. Knowledge presentation (where visualization and knowledge representation techniques are used to present the mined knowledge to the user)

WEB MINING

Page 13

Steps 1 to 4 are different forms of data pre processing, where the data are prepared for mining. The data mining step may interact with the user or a knowledge base. The interesting patterns are presented to the user and may be stored as new knowledge in the knowledge base. Note that according to this view, data mining is only one step in the entire process, albeit an essential one because it uncovers hidden patterns for evaluation. We agree that data mining is a step in the knowledge discovery process. However, in industry, in media, and in the at a base research milieu, the term data mining is becoming more popular than the longer term of knowledge discovery from data. Therefore, in this book, we choose to use the term data mining. We adopt a broad view of data mining functionality: data mining is the process of discovering interesting knowledge from large amounts of data stored in databases, data warehouses, or other information repositories. Based on this view, the architecture of a typical data mining system may have the following major components (Figure 2): Database, data warehouse, World Wide Web, or other information repository: This is one or a set of databases, data warehouses, spreadsheets, or other kinds of information repositories. Data cleaning and data integration techniques may be performed on the data. Database or data warehouse server: The database or data warehouse server is responsible for fetching the relevant data, based on the users data mining request. Knowledge base: This is the domain knowledge that is used to guide the search or evaluate the interestingness of resulting patterns. Such knowledge can include concept hierarchies, used to organize attributes or attribute values into different levels of abstraction. Knowledge such as user beliefs, which can be used to assess a patterns interestingness based on its unexpectedness, may also be included. Other examples of domain knowledge are additional interestingness constraints or thresholds, and metadata (e.g., describing data from multiple heterogeneous sources). Data mining engine: This is essential to the data mining system and ideally consists of a set of functional modules for tasks such as characterization, association and correlation analysis, classification, prediction, cluster analysis, outlier analysis, and evolution analysis.
WEB MINING Page 14

Figure 2: ARCHITECTURE OF A TYPICAL DATA MINING SYSTEM Pattern evaluation module: This component typically employs interestingness measures (Section 1.5) and interacts with the data mining modules so as to focus the search toward interesting patterns. It may use interestingness thresholds to filter out discovered patterns. Alternatively, the pattern evaluation module may be integrated with the mining module, depending on the implementation of the data mining method used. For efficient data mining, it is highly recommended to push the evaluation of pattern interestingness as deep as possible into the mining process so as to confine the search to only the interesting patterns. User interface: This module communicates between users and the data mining system, allowing the user to interact with the system by specifying a data mining query or task, providing information to help focus the search, and performing exploratory data mining based on the intermediate data mining results. In addition, this component allows the user

WEB MINING

Page 15

to browse database and data warehouse schemas or data structures, evaluate mined patterns, and visualize the patterns in different forms. From a data warehouse perspective, data mining can be viewed as an advanced stage of online analytical processing (OLAP). However, data mining goes far beyond the narrow scope of summarization-style analytical processing of data warehouse systems by incorporating more advanced techniques for data analysis. Although there are many data mining systems on the market, not all of them can perform true data mining. A data analysis system that does not handle large amounts of data should be more appropriately categorized as a machine learning system, a statistical data analysis tool, or an experimental system prototype. A system that can only perform data or information retrieval, including finding aggregate values, or that performs deductive query answering in large databases should be more appropriately categorized as a database system, an information retrieval system, or a deductive database system.

2.2 DATA MINING TECHNIQUES


Neural Networks/Pattern Recognition - Neural Networks are used in a blackbox fashion. One creates a test data set, lets the neural network learn patterns based on known outcomes, then sets the neural network loose on huge amounts of data. For example, a credit card company has 3,000 records, 100 of which are known fraud records. The data set updates the neural network to make sure it knows the difference between the fraud records and the legitimate ones. The network learns the patterns of the fraud records. Then the network is run against companys million record data set and the network spits out the records with patterns the same or similar to the fraud records. Neural networks are known for not being very helpful in teaching analysts about the data, just finding patterns that match. Neural networks have been used for optical character recognition to help the Post Office automate the delivery process without having to use humans to read addresses. Memory Based Reasoning - This technique has results similar to neural network but goes about it differently. MBR looks for "neighbour" kind of data, rather than patterns. If you look at insurance claims and want to know which the adjudicators should look at and which they can just let go through the system, you would set up a set of claims you want adjudicated and let the technique find similar claims.
WEB MINING Page 16

Cluster Detection/Market Basket Analysis - This is where the classic beer/diapers bought together analysis came from. It finds groupings. Basically, this technique finds relationships in product or customer or wherever you want to find associations in data. Link Analysis - This is another technique for associating like records. Not used too much, but there are some tools created just for this. As the name suggests, the technique tries to find links, either in customers, transactions, etc. and demonstrate those links. Visualization - This technique helps users understand their data. Visualization makes the bridge from text based to graphical presentation. Such things as decision tree, rule, cluster and pattern visualization help users see data relationships rather than read about them. Many of the stronger data mining programs have made strides in improving their visual content over the past few years. This is really the vision of the future of data mining and analysis. Data volumes have grown to such huge levels, it is going to be impossible for humans to process it by any text-based method effectively, soon. We will probably see an approach to data mining using visualization appear that will be something like Microsofts Photosynth. The technology is there, it will just take an analyst with some vision to sit down and put it together. Decision Tree/Rule Induction - Decision trees use real data mining algorithms. Decision trees help with classification and spit out information that is very descriptive, helping users to understand their data. A decision tree process will generate the rules followed in a process. For example, a lender at a bank goes through a set of rules when approving a loan. Based on the loan data a bank has, the outcomes of the loans (default or paid), and limits of acceptable levels of default, the decision tree can set up the guidelines for the lending institution. These decision trees are very similar to the first decision support (or expert) systems. Genetic Algorithms - GAs are techniques that act like bacteria growing in a Petri dish. You set up a data set then give the GA ability to do different things for whether a direction or outcome is favourable. The GA will move in a direction that will hopefully optimize the final result. GAs are used mostly for process optimization, such as scheduling, workflow, batching, and process re-engineering. Think of GA as simulations run over and over to find optimal results and the infrastructure around being able to both run the simulations and the ways to set up which results are optimal.
WEB MINING Page 17

OLAP Online Analytical Processing. OLAP allows users to browse data following logical questions about the data. OLAP generally includes the ability to drill down into data, moving from highly summarized views of data into more detailed views. This is generally achieved by moving along hierarchies of data. For example, if one were analyzing populations, one could start with the most populous continent, then drill down to the most populous country, then to the state level, then to the city level, then to the neighbourhood level. OLAP also includes browsing up hierarchies (drill up), across different dimensions of data (drill across), and many other advanced techniques for browsing data, such as automatic time variation when drilling up or down time hierarchies. OLAP is by far the most implemented and used technique. It is also generally the most intuitive and easy to use.

2.3 KIND OF DATA


We examine a number of different data repositories on which mining can be performed. In principle, data mining should be applicable to any kind of data repository, as well as to transient data, such as data streams. Thus the scope of our examination of data repositories will include relational databases, data warehouses, transactional databases, advanced database systems, flat files, data streams, and the World Wide Web. Advanced database systems include object-relational databases and specific application-oriented databases, such as spatial databases, time-series databases, text databases, and multimedia databases. The challenges and techniques of mining may differ for each of the repository systems. 2.3.1 RELATIONAL DATABASES A database system, also called a database management system (DBMS), consists of a collection of interrelated data, known as a database, and a set of software programs to manage and access the data. The software programs involve mechanisms for the definition of database structures; for data storage; for concurrent, shared, or distributed data access; and for ensuring the consistency and security of the information stored, despite system crashes or attempts at unauthorized access.

WEB MINING

Page 18

2.3.2 DATAWAREHOUSES A data warehouse is a repository of information collected from multiple sources, stored under a unified schema, and that usually resides at a single site. Data warehouses are constructed via a process of data cleaning, data integration, data transformation, data loading, and periodic data refreshing. To facilitate decision making, the data in a data warehouse are organized around major subjects, such as customer, item, supplier, and activity. The actual physical structure of a data warehouse may be a relational data store or a multidimensional data cube. A data cube provides a multidimensional view of data and allows the pre computation and fast accessing of summarized data. 2.3.3 TRANSACTIONAL DATABASES In general, a transactional database consists of a file where each record represents a transaction. A transaction typically includes a unique transaction identity number (trans ID) and a list of the items making up the transaction. The transactional database may have additional tables associated with it, which contain other information.

2.4 CLASSIFICATION OF DATA MINING SYSTEMS


Data mining is an interdisciplinary field, the confluence of a set of disciplines, including database systems, statistics, machine learning, visualization, and information science. Moreover, depending on the data mining approach used, techniques from other disciplines may be applied, such as neural networks, fuzzy and/or rough set theory, knowledge representation, inductive logic programming, or high-performance computing. Depending on the kinds of data to be mined or on the given data mining application, the data mining system may also integrate techniques from spatial data analysis, information retrieval, pattern recognition, image analysis, signal processing, computer graphics, Web technology, economics, business, bioinformatics, or psychology. Because of the diversity of disciplines contributing to data mining, data mining research is expected to generate a large variety of data mining systems. Therefore, it is necessary to provide a clear classification of data mining systems, which may help potential users distinguish between such system sand identify those that best match their needs. Data mining systems can be categorized according to various criteria, as follows:

WEB MINING

Page 19

Classification according to the kinds of databases mined: A data mining system can be classified according to the kinds of databases mined. Database systems can be classified according to different criteria (such as data models, or the types of data or applications involved), each of which may require its own data mining technique. Data mining systems can therefore be classified accordingly. For instance, if classifying according to data models, we may have a relational, transactional, object-relational, or data warehouse mining system. If classifying according to the special types of data handled, we may have a spatial, time-series, text, stream data, multimedia data mining system, or a World Wide Web mining system. Classification according to the kinds of knowledge mined: Data mining systems can be categorized according to the kinds of knowledge they mine, that is, based on data mining functionalities, such as characterization, discrimination, association and correlation analysis, classification, prediction, clustering, outlier analysis, and evolution analysis. A comprehensive data mining system usually provides multiple and/or integrated data mining functionalities. Classification according to the kinds of techniques utilized: Data mining systems can be categorized according to the underlying data mining techniques employed. These techniques can be described according to the degree of user interaction involved (e.g., autonomous systems, interactive exploratory systems, query-driven systems) or the methods of data analysis employed (e.g., database-oriented or data warehouse oriented techniques, machine learning, statistics, visualization, pattern recognition, neural networks, and so on). A sophisticated data mining system will often adopt multiple data mining techniques or work out an effective, integrated technique that combines the merits of a few individual approaches. Classification according to the applications adapted: Data mining systems can also be categorized according to the applications they adapt. For example, data mining Systems may be tailored specifically for finance, telecommunications, DNA, stock markets, e-mail, and so on. Different applications often require the integration of application-specific methods. Therefore, a generic, all-purpose data mining system may not fit domain-specific mining tasks.

WEB MINING

Page 20

2.5 DATA MINING TASK PRIMITIVES


Each user will have a data mining task in mind, that is, some form of data analysis that he or she would like to have performed. A data mining task can be specified in the form of a data mining query, which is input to the data mining system. A data mining query is defined in terms of data mining task primitives. These primitives allow the user to interactively communicate with the data mining system during discovery in order to direct the mining process, or examine the findings from different angles or depths. The data mining primitives specify the following. The set of task-relevant data to be mined: This specifies the portions of the database or the set of data in which the user is interested. This includes the database attributes or data warehouse dimensions of interest (referred to as the relevant attributes or dimensions). The kind of knowledge to be mined: This specifies the data mining functions to be performed, such as characterization, discrimination, association or correlation analysis, classification, prediction, clustering, outlier analysis, or evolution analysis. The background knowledge to be used in the discovery process: This knowledge about the domain to be mined is useful for guiding the knowledge discovery process and for evaluating the patterns found. Concept hierarchies are a popular form of background knowledge, which allow data to be mined at multiple levels of abstraction. User beliefs regarding relationships in the data are another form of background knowledge. The interestingness measures and thresholds for pattern evaluation: They may be used to guide the mining process or, after discovery, to evaluate the discovered patterns. Different kinds of knowledge may have different interestingness measures. The expected representation for visualizing the discovered patterns: This refers to the form in which discovered patterns are to be displayed, which may include rules, tables, charts, graphs, decision trees, and cubes.

WEB MINING

Page 21

3. WEB MINING
The following figure shows the architecture of web mining briefly. It divides into two stages. The stage1 contains all information of data & stage2 contains analysis. According to analysis target, web mining can be divided into three different types, which are web usage mining, web content mining and web structure mining.

WEB MINING ARCHITECTURE

Data Cleaning

Transaction Identification

Data Integration

Transformation

Pattern Discovery

Pattern Analysis

Server data log

Clean log

Transaction data

Path Analysis

OLAP/ Visualisation Tools

Registration data

Association Rules Knowledge Query Mechanism

Name Address Marks Attar

Sequential Pattern

Documents and Usage Attributes

Database Query Languag e

Clusters and Classification Rules

Intelligent Agent

STAGE 1

STAGE 2

Figure 3: ARCHITECTURE OF WEB MINING Data mining is the nontrivial process of identifying valid novel, potentially useful, and ultimately understandable patterns in data Fayyad. The most commonly used techniques in data mining is artificial neural networks, decision trees, genetic algorithm, nearest neighbour method, and rule induction. Data mining research has drawn on a number of other fields such as inductive learning, machine learning and statistics etc.
WEB MINING Page 22

Machine learning is the automation of a learning process and learning is based on observations of environmental statistics and transitions. Machine learning examines

previous examples and their outcomes and learns how to reproduce these make generalizations about new uses. Inductive learning Induction means inference of information from data and Inductive learning is a model building process where the database is analyzed to find patterns. Main strategies are supervised learning and unsupervised learning. Statistics: used to detect unusual patterns and explain patterns using statistical models such as linear models. Data mining models can be a discovery model it is the system automatically discovering important information hidden in the data or verification model takes an hypothesis from the user and tests the validity of it against the data. The web contains collection of pages that includes countless hyperlinks and huge volumes of access and usage information. Because of the ever-increasing amount of information in cyberspace, knowledge discovery and web mining are becoming critical for successfully conducting business in the cyber world. Web mining is the discovery and analysis of useful information from the web. Web mining is the use of data mining techniques to automatically discover and extract information from web documents and services (content, structure, and usage).

3.1 APPROACHES OF WEB MINING


Two different approaches were taken in initially defining web mining. i. ii. Process centric View Web mining as a sequence of tasks Data centric view web mining as a web data that was being used in the mining process.

3.2 MINING TECHNIQUES


The important data mining techniques applied in the web domain include Association Rule, Sequential pattern discovery, clustering, path analysis, classification and outlier discovery.
WEB MINING Page 23

1. Association Rule Mining: Predict the association and correlation among set of items where the presence of one set of items in a transaction implies (with a certain degree of confidence) the presence of other itms. That is, 1) Discovers the correlations between pages that are most often referenced together in a single server session/user session. 2) Provide the information: i. What are the set of pages frequently accessed together by web users? ii. What page will be fetched next? iii. What are paths frequently accessed by web users? 3) Associations and correlations: i. Page association from usage data user sessions, user transactions. ii. Page associations from content data similarity based on content analysis iii. Page associations based on structure -- link connectivity between pages. Advantages: Guide for web site restructuring by adding links that interconnect pages often viewed together. Improve the system performance by pre fetching web data.

2. Sequential pattern discovery: Applied to web access server transaction logs. The purpose is to discover sequential patterns that indicate user visit patterns over a certain period. That is, the order in which URLs tend to be accessed. Advantage: Useful user trends can be discovered. Predictions concerning visit pattern can be made. To improve website navigation. Personalize advertisements. Dynamically reorganize link structure and adopt web site contents to individual client requirements or to provide clients with automatic recommendations that best suit customer profiles. 3. Clustering: Group together items (users, pages, etc.,) that have similar characteristics. a) Page clusters: groups of pages that seem to be conceptually related according to users perception.

WEB MINING

Page 24

b) User Cluster: groups or users that seem to be behave similarly when navigating through a web site. 4. Classification: maps a data item into one of several predetermined classes. Example: describing each users category using profiles. Classification algorithms are decision tree, nave Bayesian classifier, neural networks. 5. Path Analysis: A technique that involves the generation of some form of graph that represents relation defined on web pages. This can be the physical layout of a web site in which the web pages are nodes and links between these pages are directed edges. Most graphs are involved in determining frequent traversal

patterns/ more frequently visited paths in a web site. Example: What paths do users traversal before they go to a particular URL? To use data mining on our web site, we have to establish and record visitor and item characteristics, and visitor interactions. Visitor characteristics include: i. Demographics are tangible attributes such as home address, income, property, etc. ii. Psychographics are personality types such as early technology interest, buying tendencies. iii. Techno graphics are attributes of visitors system, such as operating system, browser, and modem speed. Item characteristics include: i. Web content information media type, content category, URL. ii. Product information - product category, colour, size, price Visitor interactions include: i. Visitor item interactions include purchase history, advertising history, and preference information. ii. Visitor site statistics are per session characteristics, such as total time, pages viewed, and so on. We have a lot of information about web visitors and content, but we probably are not making the best use of it. The existing OLAP systems can report only on directly
WEB MINING Page 25

observed and easily correlated information. They rely on users to discover patterns and decide what to do with them. The information is even too complex for humans to discover these patterns using an OLAP system. To solve these problems, data mining techniques are utilized. The scope of data mining is i. Automated prediction of trends, and behaviours ii. Automated discovery of previously unknown patterns. Web mining is searches for i. Web access patterns, ii. Web structure, iii. Regularity and dynamics of web contents. The web mining research is a converging research area from several research communities, such as database, information retrieval, and AI research communities, especially from machine learning and natural language processing. World wide web is a popular and interactive medium to gather information today. The WWW provides every Internet citizen with access to an abundance of information. Users encounter some problems when interacting with the web. i. Finding relevant information (information overload Only a small portion of the web pages contain truly relevant/useful information): a) low precision (the abundance problem 99% of information of no interest to 99% of people) which is due to the irrelevance of many of the search results. This results in a difficulty of finding the relevant information. b) Low recall (limited coverage of the web-Internet sources hidden behind search interface) due to the inability to index all the information available on the web. This results in a difficulty of finding the unindexed

information that is relevant. ii. Discovery of existing but hidden knowledge (retrieve 1/3rd of the indexable web) iii. Personalization of the information (type & presentation of information) Limited customization to individual users. iv. Learning about customers/individual users.
WEB MINING Page 26

v. Lack of feedback on human activities. vi. Lack of multidimensional analysis and data mining support. vii. The web constitutes a highly dynamic information source. Not only does the web continue to grow rapidly, the information I holds also receives constant updates. News, stock market, service centre, and corporate sites revise their web pages regularly. Linkage information and access records also undergo frequent updates. viii. The web serves a broad spectrum of user communities. The Internets rapidly expanding user community connects millions of workstations, and usage purposes. Many lack good knowledge of the information networks structure, are unaware of a particular searchs heavy cost, frequently get lost within the webs ocean of information and lengthy waits required to retrieve search results. ix. Web page complexity far exceeds the complexity of any traditional text document collection. Although the web functions as a huge digital library, the pages themselves lack a uniform structure and contain far more authoring style and content variations than any set of books or traditional text-based documents. Moreover, searching it is extremely difficult. Common problems web marketers want to solve are how to target advertisements (Targeting), Personalize web pages (Personalization), create web pages that show products often bought together (associations), classify articles automatically (Classification), characterize group of similar visitors (clustering), estimate missing data and predict future behaviour. In general web mining tasks are: i. Mining web search engine data ii. Analyzing the webs link structures iii. Classifying web document automatically iv. Mining web page semantic structure and page contents v. Mining web dynamics vi. Personalization. Thus, web mining refers to the overall process of discovering potentially useful and previously unknown information or knowledge from the web data. Web mining aims
WEB MINING Page 27

at finding and extracting relevant information that is hidden in web-related data, in particular in text documents that are published on the web like data mining is a multidisciplinary effort that draws technique from fields like information retrieval, statistics, machine learning, natural language processing and others. Web mining can be a

promising tool to address ineffective search engines that produce incomplete indexing, retrieval of irrelevant information/unverified reliability or retrieved information. It is essential to have a system that helps the user find relevant and reliable information easily and quickly on the web. Web mining discovers information from mounds of data on the www, but it also monitors and predicts user visit patterns. This gives designers more reliable information in structuring and designing a web site. Given the rate of growth of the web, scalability of search engines is a key issue, as the amount of hardware and network resources needed is large, and expensive. In

addition, search engines are popular tools, so they have heavy constraints on query answer time. So, the efficient use of resources can improve both scalability and answer time. One tool to achieve this goal is web mining.

WEB MINING

Page 28

3.3 WEB MINING TAXONOMY


Web mining can be broadly divided into three distinct categories, according to the kinds of data to be mined. Figure 4 shows the taxonomy.

Figure 4: WEB MINING TAXONOMY 3.3.1 Web Content Mining Web content mining is the process of extracting useful information from the contents of web documents. Content data is the collection of facts a web page is designed to contain. It may consist of text, images, audio, video, or structured records such as lists and tables. Application of text mining to web content has been the most widely researched. Issues addressed in text mining include topic discovery and tracking, extracting association patterns, clustering of web documents and classification of web pages. Research activities on this topic have drawn heavily on techniques developed in other disciplines such as Information Retrieval (IR) and Natural Language Processing (NLP). While there exists a significant body of work in extracting knowledge from images in the fields of image processing and computer vision, the application of these techniques to web content mining has been limited.
WEB MINING Page 29

Web content mining is an automatic process that goes beyond keywords extraction. Since the content of a text document presents no machine-readable semantic, some approaches have suggested to restricted the document content in a representation that could be exploited by machines. The usual approach to exploit known structure in document is to use wrappers to map document to some data model. Techniques using lexicons for content interpretation are yet to come. There are two groups of web content mining strategies: those that directly mine the content of document and those that improve on the content search of other tools like search engines. 3.3.2 Web Structure Mining The structure of a typical web graph consists of web pages as nodes, and hyperlinks as edges connecting related pages. Web structure mining is the process of discovering structure information from the web. This can be further divided into two kinds based on the kind of structure information used. Hyperlinks A hyperlink is a structural unit that connects a location in a web page to a different location, either within the same web page or on a different web page. A hyperlink that connects to a different part of the same page is called an intra-document hyperlink, and a hyperlink that connects two different pages is called an inter-document hyperlink. Document Structure In addition, the content within a Web page can also be organized in a tree structured format, based on the various HTML and XML tags within the page. Mining efforts here have focused on automatically extracting document object model (DOM) structures out of documents (Wang and Liu 1998; Moh, Lim, and Ng 2000). World Wide Web can reveal more information than just the information contained in documents. For example, links pointing to a document indicate the popularity of the document, while links coming out of a document indicate the richness or perhaps the variety of topics covered in the document. This can be compared to bibliography citations. When a paper is cited often, it ought to be important. The page rank and CLEVER methods take advantage of this information conveyed by the links to find pertinent web

WEB MINING

Page 30

pages. By means of counters, higher levels cumulate the number of artefacts subsumed by the concepts they hold. 3.3.3 Web Usage Mining Web usage mining is the application of data mining techniques to discover interesting usage patterns from web usage data, in order to understand and better serve the needs of web-based applications. Usage data captures the identity or origin of web users along with their browsing behaviour at a web site. Web usage mining itself can be classified further depending on the kind of usage data considered: Web Server Data User logs are collected by the web server and typically include IP address, page reference and access time. Application Server Data Commercial application servers such as Weblogic1,2 StoryServer3 have significant features to enable E-commerce applications to be built on top of them with little effort. A key feature is the ability to track various kinds of business events and log them in application server logs. Application Level Data New kinds of events can be defined in an application, and logging can be turned on for them generating histories of these events. It must be noted, however, that many end applications require a combination of one or more of the techniques applied in the above the categories. Web servers record and accumulate data about user interactions whenever requests for resources are received. Analyzing the web access logs of different web sites can help understand the user behaviour and the web structure, thereby improving the design of this colossal collection of resources. There are two main tendencies in web usage mining driven by the applications of discoveries: general access pattern tracking and customized usage tracking. The general access pattern tracking analyzes the web logs to understand accesses patterns and trends.

WEB MINING

Page 31

3.4 THE AXES OF WEB MINING


3.4.1 WWW Impact The World Wide Web has grown in the past few years from a small research community to the biggest and most popular way of communication and information dissemination. Every day, the WWW grows by roughly a million electronic pages, adding to the hundreds of millions already on-line. WWW serves as a platform for exchanging various kinds of information, ranging from research papers, and educational content, to multimedia content and software. The continuous growth in the size and the use of the WWW imposes new methods for processing these huge amounts of data. Because of its rapid and chaotic growth, the resulting network of information lacks of organization and structure. Moreover, the content is published in various diverse formats. 3.4.2 Web data Web data are those that can be collected and used in the context of Web Personalization. These data are classified in four categories according to servers, i.e., web usage mi Content data are presented to the end-user appropriately structured. They can be simple text, images, or structured data, such as information retrieved from databases. Structure data represent the way content is organized. They can be either data entities used within a Web page, such as HTML or XML tags, or data entities used to put a Web site together, such as hyperlinks connecting one page to another. Usage data represent a Web sites usage, such as a visitors IP address, time and date of access, complete path accessed, referrers address, and other attributes that can be included in a Web access log. User profile data provide information about the users of a Web site. A user profile contains demographic information for each user of a Web site, as well as information about users interests and preferences. Such information is acquired through registration forms or questionnaires, or can be inferred by analyzing Web usage logs.

WEB MINING

Page 32

3.5 WEB MINING PROS AND CONS


PROS Web mining essentially has many advantages which makes this technology attractive to corporations including the government agencies. This technology has enabled ecommerce to do personalized marketing, which eventually results in higher trade volumes. The government agencies are using this technology to classify threats and fight against terrorism. The predicting capability of the mining application can benefits the society by identifying criminal activities. The companies can establish better customer relationship by giving them exactly what they need. Companies can understand the needs of the customer better and they can react to customer needs faster. The companies can find, attract and retain customers; they can save on production costs by utilizing the acquired insight of customer requirements. They can increase profitability by target pricing based on the profiles created. They can even find the customer who might default to a competitor the company will try to retain the customer by providing promotional offers to the specific customer, thus reducing the risk of losing a customer or customers. CONS Web mining, itself, doesnt create issues, but this technology when used on data of personal nature might cause concerns. The most criticized ethical issue involving web mining is the invasion of privacy. Privacy is considered lost when information concerning an individual is obtained, used, or disseminated, especially if this occurs without their knowledge or consent. The obtained data will be analyzed, and clustered to form profiles; the data will be made anonymous before clustering so that there are no personal profiles. Thus these applications de-individualize the users by judging them by their mouse clicks. De-individualization, can be defined as a tendency of judging and treating people on the basis of group characteristics instead of on their own individual characteristics and merits. Another important concern is that the companies collecting the data for a specific purpose might use the data for a totally different purpose, and this essentially violates the users interests. The growing trend of selling personal data as a commodity encourages website owners to trade personal data obtained from their site. This trend has increased the amount of data being captured and traded increasing the likeliness of ones privacy being invaded. The companies which buy the data are obliged make it anonymous and these companies
WEB MINING Page 33

are considered authors of any specific release of mining patterns. They are legally responsible for the contents of the release; any inaccuracies in the release will result in serious lawsuits, but there is no law preventing them from trading the data. Some mining algorithms might use controversial attributes like sex, race, religion, or sexual orientation to categorize individuals. These practices might be against the anti-discrimination legislation. The applications make it hard to identify the use of such controversial attributes, and there is no strong rule against the usage of such algorithms with such attributes. This process could result in denial of service or a privilege to an individual based on his race, religion or sexual orientation, right now this situation can be avoided by the high ethical standards maintained by the data mining company. The collected data is being made anonymous so that the obtained data and the obtained patterns cannot be traced back to an individual. It might look as if this poses no threat to ones privacy, actually many extra information can be inferred by the application by combining two separate unscrupulous data from the user.

WEB MINING

Page 34

4. WEB CONTENT MINING


Web content mining is the process of extracting useful information from the content of Web documents. Logical structure, semantic content and layout are contained in semi structured Web page text. Topic discovery, extracting association patterns, clustering of Web documents and classification of Web pages are some of research issues in text mining. These activities use techniques from other disciplines IR, IE (information extraction), NLP (natural language processing) and others. Automatic extraction of semantic relations and structures from Web is a growing application of Web content mining. In this area, several algorithms are used: Hierarchical clustering algorithms on terms in order to create concept hierarchies, formal concept analysis and association rule mining to learn generalized conceptual relations and automatic extraction of structured data records from semi-structured HTML pages. Primary goal of each algorithm is to create a set of formally defined domain ontologies that represent Web site content. Common representation approaches are vector-space models, descriptive logics, first order logic, relational models and probabilistic relational models. Structured data extraction is one of most widely studied research topics of Web content mining. Structured data on the Web are often very important as they represent their host pages essential information. Extracting such data allows one to provide value added services, e.g. shopping and meta search. In contrast to unstructured texts, structured data is also easier to extract. This problem has been studied by researchers in AI and database and data mining. Discovery of useful information from the web contents/data/documents (or) is the application of data mining techniques to content published on the Internet. The web contains many kinds and types of data. Basically, the web content consists of several types of data such as plain text (unstructured), image, audio, video, meta data as well as HTML (semi Structured), or XML (structured documents), dynamic documents, multimedia documents. Recent research on mining multi types of data is termed multimedia data mining. Thus we could consider multimedia data mining as an instance of web content mining. The research around applying data mining techniques to unstructured text is termed knowledge discovery in texts/ text data mining/ text mining. Hence we could consider text mining as an instance as an instance of web content mining. Research issues addressed in text mining are: topic discovery, extracting association patterns, clustering of web documents and classification of web pages.
WEB MINING Page 35

4.1 ISSUES IN WEB CONTENT MINING:


developing intelligent tools for information retrieval finding keywords and key phases discovering grammatical rules collections hypertext classification/categorization extracting key phrases from text documents learning extraction rules hierarchical clustering predicting relationships

4.2 WEB CONTENT MINING APPROACHES:

CONTENT MINING

Agent Based Approach 1. Intelligent Search Agents 2. Information Filtering/Categorization 3. Personalized Web Agents
4.2.1 AGENT BASED APPROACHES:

Data Base Approach


1. Multilevel Databases 2. Web Query Systems

Involves AI systems that can act autonomously or semi autonomously on behalf of a particular user, to discover and organize web based information. Agent Based approaches focus on intelligent and autonomous web mining tools based on agent technology. i. Some intelligent web agents can use a user profile to search for relevant information, then organize and interpret the discovered information. Example: Harvest.
WEB MINING Page 36

ii. Some use various information retrieval techniques and the characteristics of open hypertext documents to organize and filter retrieved information. Example: Hypursuit. iii. Learn user preferences and use those preferences to discover information sources for those particular users. Agent-based Web mining systems can be placed into the following three categories: Intelligent Search Agents: Several intelligent Web agents have been developed that search for relevant information using domain characteristics and user pro les to organize and interpret the discovered information. Agents such as Harvest, FAQ-Finder, Information Manifold, OCCAM, and Para Site rely either on pre-specified domain information about particular types of documents, or on hard coded models of the information sources to retrieve and interpret documents. Agents such as Shop Bot and ILA (Internet Learning Agent) interact with and learn the structure of unfamiliar information sources. Shop Bot retrieves product information from a variety of vendor sites using only general information about the product domain. ILA learns models of various information sources and translates these into its own concept hierarchy. Information Filtering/Categorization: A number of Web agents use various information retrieval techniques and characteristics of open hypertext Web documents to automatically retrieve categorize them. HyPursuit uses semantic information embedded in link structures and document content to create cluster hierarchies of hypertext documents, and structure an information space. BO (Bookmark Organizer) combines hierarchical clustering techniques and user interaction to organize a collection of Web documents based on conceptual information. Personalized Web Agents: This category of Web agents learn user preferences and discover Web information sources based on these preferences, and those of other individuals with similar interests.

4.2.1 DATA BASE APPROACHES: Data base approach: focuses on integrating and organizing the heterogeneous and semi-structured data on the web into more structured and high level collections of resources. These organized resources can then be accessed and analyzed. These metadata, or generalization are then organized into structured collections and can be analyzed.

WEB MINING

Page 37

Database approaches to Web mining have focused on techniques for organizing the semi-structured data on the Web into more structured collections of resources, and using standard database querying mechanisms and data mining techniques to analyze it. Multilevel Databases: The main idea behind this approach is that the lowest level of the database contains semi-structured information stored in various Web repositories, such as hypertext documents. At the higher level meta data or generalizations are extracted from lower levels and organized in structured collections, i.e. relational or object-oriented databases. For example, Han, et. Al. use a multilayered database where each layer is obtained via generalization and transformation operations performed on the lower layers. Kholsa, et. al. propose the creation and maintenance of meta-databases at each information providing domain and the use of a global schema for the meta-database. The incremental integration of a portion of the schema from each information source, rather than relying on a global heterogeneous database schema. The ARANEUS system extracts relevant information from hypertext documents and integrates these into higher-level derived Web Hypertexts which are generalizations of the notion of database views. Web Query Systems: Many Web-based query systems and languages utilize standard database query languages such as SQL, structural information about Web documents, and even natural language processing for the queries that are used in World Wide Web searches. W3QL combines structure queries, based on the organization of hypertext documents, and content queries, based on information retrieval techniques. Web Log Logic-based query language for restructuring extracts information from Web in- formation sources. Lorel and UnQL query heterogeneous and semi-structured information on the Web using a labelled graph data model. TSIMMIS extracts data from heterogeneous and semi-structured information sources and correlates them to generate an integrated database representation of the extracted information.

4.3 WEB CONTENT MINING TASK


4.3.1 Structured Data Extraction This is perhaps the most widely studied research topic of Web content mining. One of the reasons for its importance and popularity is that structured data on the Web are often very important as they represent their host pages. Essential information, e.g., lists of
WEB MINING Page 38

products and services. Extracting such data allows one to provide value added services, e.g., comparative shopping, and meta-search. Structured data is also easier to extract compared to unstructured texts. This problem has been studied by researchers in AI, database and data mining, and Web communities. There are several approaches to structured data extraction, which is also called wrapper generation. The first approach is to manually write an extraction program for each Web site based on observed format patterns of the site. This approach is very labour intensive and time consuming. It thus does not scale to a large number of sites. The second approach is wrapper induction or wrapper learning, which is the main technique currently. Wrapper learning works as follows: The user first manually labels a set of trained pages. A learning system then generates rules from the training pages. The resulting rules are then applied to extract target items from Web pages. The third approach is the automatic approach. Since structured data objects on the Web are normally database records retrieved from underlying databases and displayed in Web pages with some fixed templates. Automatic methods aim to find patterns/grammars from the Web pages and then use them to extract data. 4.3.2 Unstructured Text Extraction Most Web pages can be seen as text documents. Extracting information from Web documents has also been studied by many researchers. The research is closely related to text mining, information retrieval and natural language processing. Current techniques are mainly based on machine learning and natural language processing to learn extraction rules. Recently, a number of researchers also make use of common language patterns (common sentence structures used to express certain facts or relations) and redundancy of information on the Web to find concepts, relations among concepts and named entities. The patterns can be automatically learnt or supplied by human users. Another direction of research in this area is Web question-answering. Although question-answering was first studied in information retrieval literature, it becomes very important on the Web as Web offers the largest source of information and the objectives of many Web search queries are to obtain answers to some simple questions. Extend question-answering to the Web by query transformation, query expansion, and then selection. 4.3.3 Web Information Integration

WEB MINING

Page 39

Due to the sheer scale of the Web and diverse authorships, various Web sites may use different syntaxes to express similar or related information. In order to make use of or to extract information from multiple sites to provide value added services, e.g., metasearch, deep Web search, etc, one needs to semantically integrate information from multiple sources. Recently, several researchers attempted this task. Two popular problems related to the Web are (1) Web query interface integration, to enable querying multiple Web databases and (2) schema matching, e.g., integrating Yahoo and Google.s directories to match concepts in the hierarchies. The ability to query multiple deep Web databases is attractive and interesting because the deep Web contains a huge amount of information or data that is not indexed by general search engines. 4.3.4 Building Concept Hierarchies Because of the huge size of the Web, organization of information is obviously an important issue. Although it is hard to organize the whole Web, it is feasible to organize Web search results of a given query. A linear list of ranked pages produced by search engines is insufficient for many applications. The standard method for information organization is concept hierarchy and/or categorization. The popular technique for hierarchy construction is text clustering, which groups similar search results together in a hierarchical fashion. Instead, it exploits existing organizational structures in the original Web documents, emphasizing tags and language patterns to perform data mining to find important concepts, sub-concepts and their hierarchical relationships. In other words, it makes use of the information redundancy property and semi-structure nature of the Web to find what concepts are important and what their relationships might be. This work aim to compile a survey article or a book on the Web automatically. 4.3.5 Segmenting Web Pages & Detecting Noise A typical Web page consists of many blocks or areas, e.g., main content areas, navigation areas, advertisements, etc. It is useful to separate these areas automatically for several practical applications. For example, in Web data mining, e.g., classification and clustering, identifying main content areas or removing noisy blocks (e.g., advertisements, navigation panels, etc) enables one to produce much better results. It was shown in that the information contained in noisy blocks can seriously harm Web data mining. Another application is Web browsing using a small screen device, such as a PDA. Identifying
WEB MINING Page 40

different content blocks allows one to re-arrange the layout of the page so that the main contents can be seen easily without losing any other information from the page. 4.3.6 Mining Web Opinion Sources Consumer opinions used to be very difficult to obtain before the Web was available. Companies usually conduct consumer surveys or engage external consultants to find such opinions about their products and those of their competitors. Now much of the information is publicly available on the Web. There are numerous Web sites and pages containing consumer opinions, e.g., customer reviews of products, forums, discussion groups, and blogs. This online word-of-mouth behaviour represents new and measurable sources of information for marketing intelligence. Techniques are now being developed to exploit these sources to help companies and individuals to gain such information effectively and easily. For instance, proposes a feature based summarization method to automatically analyze consumer opinions in customer reviews from online merchant sites and dedicated review sites. The result of such a summary is useful to both potential customers and product manufacturers.

WEB MINING

Page 41

5. WEB STRUCTURE MINING


Web Structure Mining operates on the webs hyperlink structure. This graph structure can provide information about page ranking or authoritativeness and enhance search results through filtering i.e., tries to discover the model underlying the link structures of the web. This model is used to analyze the similarity and relationship between different web sites. Uses the hyperlink structure of the web as an additional information source. This type of mining can be further divided into 2 kinds based on the kind of structural data used. a) HYPERLINKS: A hyperlink is a structural unit that connects a web page to different location, either within the same web page (intra document hyperlink) or to a different web page (inter document) hyperlink. b) DOCUMENT STRUCTURE: In addition, the content within a web page can also be organized in a tree structured format, based on various HTML and XML tags within the page. Mining efforts here have focused on automatically extracting document object model (DOM) structures out of documents. Web link analysis used for: 1. ordering documents matching a user query (ranking) 2. deciding what pages to add to a collection 3. page categorization 4. finding related pages 5. finding duplicated web sites 6. and also to find out similarity between them Web structure mining uses the hyperlink structure of the Web to yield useful information, including definitive pages specification, hyperlinked communities

identification, Web pages categorization andWeb site completeness evaluation. Web structure mining can be divided into two categories based on the kind of structured data used:

WEB MINING

Page 42

1. Web graph mining: The Web provides additional information about how different documents are connected to each other via hyperlinks. The Web can be viewed as a (directed) graph whose nodes are Web pages and whose edges are hyperlinks between them. 2. Deep Web mining: Web also contains a vast amount of non crawlable content. his hidden part of the Web is referred to as the deep Web or the hidden Web. Compared to the static surface Web, the deep Web contains a much larger amount of high-quality structured information. Most of mining algorithms, that are improving the performance of Web search, are based on two assumptions. (a) Hyperlinks convey human endorsement. If there exists a link from page A to page B, and these two pages are authored by different people, then the first author found the second page valuable. Thus the importance of a page can be propagated to those pages it links to. (b) Pages that are co-cited by a certain page are likely related to the same topic. The popularity or importance of a page is correlated to the number of incoming links to some extendt, and related pages tend to be clustered together through dense linkages among them. Web information extraction has the goal of pulling out information from a collection of Web pages and converting it to a homogeneous form that is more readily digested and analyzed for both humans and machines. The result of IE could be used to improve the indexing process, because IE removes irrelevant information in Web pages and facilitates other advanced search functions due to the structured nature of data. It is usually difficult or even impossible to directly obtain the structures of the Web sites backend databases without cooperation from the sites. Instead, the sites present two other distinguishing structures: Interface schema and result schema. The interface schema is the schema of the query interface, which exposes attributes that can be queried in the backend database. The result schema is the schema of the query results, which exposes attributes that are shown to users.

WEB MINING

Page 43

6. WEB USAGE MINING


Web Usage Mining is a part of Web Mining, which, in turn, is a part of Data Mining. As Data Mining involves the concept of extraction meaningful and valuable information from large volume of data, Web Usage mining involves mining the usage characteristics of the users of Web Applications. This extracted information can then be used in a variety of ways such as, improvement of the application, checking of fraudulent elements etc. Web Usage Mining is often regarded as a part of the Business Intelligence in an organization rather than the technical aspect. It is used for deciding business strategies through the efficient use of Web Applications. It is also crucial for the Customer Relationship Management (CRM) as it can ensure customer satisfaction as far as the interaction between the customer and the organization is concerned. The major problem with Web Mining in general and Web Usage Mining in particular is the nature of the data they deal with. With the upsurge of Internet in this millennium, the Web Data has become huge in nature and a lot of transactions and usages are taking place by the seconds. Apart from the volume of the data, the data is not completely structured. It is in a semi-structured format so that it needs a lot of preprocessing and parsing before the actual extraction of the required information. we have taken up a small part of the Web Usage Mining process, which involves the Preprocessing, User Identification, Bot removal and Analysis of the

6.1 WEB USAGE MINING ARCHITECTURE


The WEBMINER is a system that implements parts of this general architecture. The architecture divides the Web usage mining process into two main parts. The rest part includes the domain dependent processes of transforming the Web data into suitable transaction form. This includes preprocessing, transaction identification, and data integration components. The second part includes the largely domain independent application of generic data mining and pattern matching techniques (such as the discovery of association rule and sequential patterns) as part of the system's data mining engine. The overall architecture for the Web mining process. Data cleaning is the first step performed in the Web usage mining process. Some low level data integration tasks may also be
WEB MINING Page 44

performed at this stage, such as combining multiple logs, incorporating referrer logs, etc. After the data cleaning, the log entries must be partitioned into logical clusters using one or a series of transaction identification modules. The goal of trans- action identification is to create meaningful clusters of references for each user. The task of identifying transactions is one of either dividing a large transaction into multiple smaller ones or merging small transactions into fewer larger ones. The input and output transaction formats match so that any number of modules to be combined in any order, as the data analyst sees _t. Once the domain-dependent data transformation phase is completed, the resulting transaction data must be formatted to conform to the data model of the appropriate data mining task. For instance, the format of the data for the association rule discovery task may be different than the format necessary for mining sequential patterns. Finally, a query mechanism will allow the user (analyst) to provide more control over the discovery process by specifying various constraints.

6.2 WEB DATA


In Web Usage Mining, data can be collected in server logs, browser logs, proxy logs, or obtained from an organization's database. These data collections differ in terms of the location of the data source, the kinds of data available, the segment of population from which the data was collected, and methods of implementation. There are many kinds of data that can be used in Web Mining. 1. Content: The visible data in the Web pages or the information which was meant to be imparted to the users. A major part of it includes text and graphics (images). 2. Structure: Data which describes the organization of the website. It is divided into two types. Intra-page structure information includes the arrangement of various HTML or XML tags within a given page. The principal kind of inter-page structure information are the hyper-links used for site navigation. 3. Usage: Data that describes the usage patterns of Web pages, such as IP addresses, page references, and the date and time of accesses and various other information depending on the log format.

WEB MINING

Page 45

6.3 DATA SOURCES


The data sources used in Web Usage Mining may include web data repositories like: 1. WEB SERVER LOGS: These are logs which maintain a history of page requests. The W3C maintains a standard format for web server log files, but other proprietary formats exist. More recent entries are typically appended to the end of the file. Information about the request, including client IP address, request date/time, page requested, HTTP code, bytes served, user agent, and referrer are typically added. These data can be combined into a single file, or separated into distinct logs, such as an access log, error log, or referrer log. However, server logs typically do not collect user-specific information. These files are usually not accessible to general Internet users, only to the webmaster or other administrative person. A statistical analysis of the server log may be used to examine traffic patterns by time of day, day of week, referrer, or user agent. Efficient web site administration, adequate hosting resources and the fine tuning of sales efforts can be aided by analysis of the web server logs. Marketing departments of any organization that owns a website should be trained to understand these powerful tools. A Web server log is an important source for performing Web Usage Mining because it explicitly records the browsing behaviour of site visitors. The data recorded in server logs reflects the (possibly concurrent) access of a Web site by multiple users. These log files can be stored in various formats such as Common log or Extended log formats. However, the site usage data recorded by server logs may not be entirely reliable due to the presence of various levels of caching within the Web environment. Cached page views are not recorded in a server log. In addition, any important information passed through the POST method will not be available in a server log. Packet sniffing technology is an alternative method to collecting usage data through server logs. Packet sniffers monitor network traffic coming to a Web server and extract usage data directly from TCP/IP packets. The Web server can also store other kinds of usage information such as cookies and query data in separate logs. Cookies are tokens generated by the Web server for individual client browsers in order to automatically track the site visitors. Tracking of individual users is not an easy task due to the stateless connection model of the HTTP protocol. Cookies rely on implicit user cooperation and thus have raised growing concerns regarding user privacy. Query data is also typically generated by online visitors while
WEB MINING Page 46

searching for pages relevant to their information needs. Besides usage data, the server side also provides content data, structure information and Web page meta-information. The Web server also relies on other utilities such as CGI scripts to handle data sent back from client browsers. Web servers implementing the CGI standard parse the URI 1 of the requested file to determine if it is an application program. The URI for CGI programs may contain additional parameter values to be passed to the CGI application. Once the CGI program has completed its execution, the Web server send the output of the CGI application back to the browser. 2. PROXY SERVER LOGS: A Web proxy is a caching mechanism which lies between client browsers and Web servers. It helps to reduce the load time of Web pages as well as the network traffic load at the server and client side. Proxy server logs contain the HTTP requests from multiple clients to multiple Web servers. This may serve as a data source to discover the usage pattern of a group of anonymous users, sharing a common proxy server. A Web proxy acts as an intermediate level of caching between client browsers and Web servers. Proxy caching can be used to reduce the loading time of a Web page experienced by users as well as the network traffic load at the server and client sides. The performance of proxy caches depends on their ability to predict future page requests correctly. Proxy traces may reveal the actual HTTP requests from multiple clients to multiple Web servers. This may serve as a data source for characterizing the browsing behaviour of a group of anonymous users sharing a common proxy server. 3. BROWSER LOGS: Various browsers like Mozilla, Internet Explorer etc. can be modified or various JavaScript and Java applets can be used to collect client side data. This implementation of client-side data collection requires user cooperation, either in enabling the functionality of the JavaScript and Java applets, or to voluntarily use the modified browser. Client-side collection scores over server-side collection because it reduces both the bot and session identification problems. Client-side data collection can be implemented by using a remote agent such as Java scripts or Java applets or by modifying the source code of an existing browser such as Mosaic or Mozilla to enhance its data collection capabilities. The implementation of
WEB MINING Page 47

client-side data collection methods requires user cooperation, either in enabling the functionality of the Java scripts and Java applets, or to voluntarily use the modified browser. Client-side collection has an advantage over server-side collection because it ameliorates both the caching and session identification problems. However, Java applets perform no better than server logs in terms of determining the actual view time of a page. In fact, it may incur some additional overhead especially when the Java applet is loaded for the first time. Java scripts, on the other hand, consume little interpretation time but cannot capture all user clicks (such as reload or back buttons). These methods will collect only single-user, single-site browsing behaviour. A modified browser is much more versatile and will allow data collection about a single user over multiple Web sites. The most difficult part of using this method is convincing the users to use the browser for their daily browsing activities. This can be done by offering incentives to users who are willing to use the browser, similar to the incentive programs ordered by companies such as NetZero and All Advantage that reward users for clicking on banner advertisements while surfing the Web.

6.4 INFORMATION OBTAINED


1. Number of Hits: This number usually signifies the number of times any resource is accessed in a Website. A hit is a request to a web server for a file (web page, image, JavaScript, Cascading Style Sheet, etc.). When a web page is uploaded from a server the number of "hits" or "page hits" is equal to the number of files requested. Therefore, one page load does not always equal one hit because often pages are made up of other images and other files which stack up the number of hits counted. 2. Number of Visitors: A "visitor" is exactly what it sounds like. It's a human who navigates to your website and browses one or more pages on your site. 3. Visitor Referring Website: The referring website gives the information or url of the website which referred the particular website in consideration. 4. Visitor Referral Website: The referral website gives the information or url of the website which is being referred to by the particular website in consideration. 5. Time and Duration: This information in the server logs give the time and duration for how long the Website was accessed by a particular user.
WEB MINING Page 48

6. Path Analysis: Path analysis gives the analysis of the path a particular user has followed in accessing contents of a Website. 7. Visitor IP address: This information gives the Internet Protocol(I.P.) address of the visitors who visited the Website in consideration. 8. Browser Type: This information gives the information of the type of browser that was used for accessing the Website. 9. Cookies: A message given to a Web browser by a Web server. The browser stores the message in a text file called cookie. The message is then sent back to the server each time the browser requests a page from the server. The main purpose of cookies is to identify users and possibly prepare customized Web pages for them. When you enter a Web site using cookies, you may be asked to fill out a form providing such information as your name and interests. This information is packaged into a cookie and sent to your Web browser which stores it for later use. The next time you go to the same Web site, your browser will send the cookie to the Web server. The server can use this information to present you with custom Web pages. So, for example, instead of seeing just a generic welcome page you might see a welcome page with your name on it. 10. Platform: This information gives the type of Operating System etc. that was used to access the Website.

6.5 POSSIBLE ACTIONS


1. Shortening Paths of High visit Pages: The pages which are frequently accessed by the users can be seen as to follow a particular path. These pages can be included in an easily accessible part of the Website thus resulting in the decrease in the navigation path length. 2. Eliminating or Combining Low Visit Pages: The pages which are not frequently accessed by users can be either removed or their content can be merged with pages with frequent access.

WEB MINING

Page 49

3. Redesigning Pages to help User Navigation: To help the user to navigate through the website in the best possible manner, the information obtained can be used to redesign the structure of the Website. 4. Redesigning Pages For Search Engine Optimization: The content as well as other information in the website can be improved from analyzing user patterns and this information can be used to redesign pages for Search Engine Optimization so that the search engines index the website at a proper rank. 5. Help Evaluating Effectiveness of Advertising Campaigns: Important and business critical advertisements can be put up on pages that are frequently accessed.

6.6 WEB USAGE MINING PROCESS


6.6.1 PREPROCESSING Data preprocessing describes any type of processing performed on raw data to prepare it for another processing procedure. Commonly used as a preliminary data mining practice, data preprocessing transforms the data into a format that will be more easily and effectively processed for the purpose of the user. The different types of preprocessing in Web Usage Mining are: 1. Usage Pre-Processing: Pre-Processing relating to Usage patterns of users. Usage preprocessing is arguably the most difficult task in the Web Usage Mining process due to the incompleteness of the available data. Unless a client side tracking mechanism is used, only the IP address, agent, and server side click- stream are available to identify users and server sessions. Some of the typically encountered problems are: Single IP address/Multiple Server Sessions Internet service providers (ISPs) typically have a pool of proxy servers that users access the Web through. A single proxy server may have several users accessing a Web site, potentially over the same time period. Multiple IP address/Single Server Session - Some ISPs or privacy tools randomly assign each request from a user to one of several IP addresses. In this case, a single server session can have multiple IP addresses.

WEB MINING

Page 50

Multiple IP address/Single User - A user that accesses the Web from different machines will have a different IP address from session to session. This makes tracking repeat visits from the same user difficult.

Multiple Agent/Singe User - Again, a user that uses more than one browser, even on the same machine, will appear as multiple users. Assuming each user has now been identified (through cookies, logins, or

IP/agent/path analysis), the click-stream for each user must be divided into sessions. Since page requests from other servers are not typically available, it is difficult to know when a user has left a Web site. A thirty minute timeout is often used as the default method of breaking a user's click-stream into sessions. When a session ID is embedded in each URI, the definition of a session is set by the content server. While the exact content served as a result of each user action is often available from the request field in the server logs, it is sometimes necessary to have access to the content server information as well. Since content servers can maintain state variables for each active session, the information necessary to determine exactly what content is served by a user request is not always available in the URI. The final problem encountered when preprocessing usage data is that of inferring cached page references. The only variable method of tracking cached page views is to monitor usage from the client side. The referrer field for each request can be used to detect some of the instances when cached pages have been viewed. IP address 123.456.78.9 is responsible for three server sessions, and IP addresses 209.456.78.2 and 209.45.78.3 are responsible for a fourth session. Using a combination of referrer and agent information, lines 1 through 11 can be divided into three sessions of A-B-F-O-G, L-R, and A-B-C-J. Path completion would add two page references to the first session A-B-F-O-F-B-G, and one reference to the third session A-BA-C-J. Without using cookies, an embedded session ID, or a client-side data collection method, there is no method for determining that lines 12 and 13 are actually a single server session. 2. Content Pre-Processing: Pre-Processing of content accessed. Content preprocessing consists of converting the text, image, scripts, and other les such as multimedia into forms that are useful for the Web Usage Mining process. Often, this consists of performing content mining such as classification or clustering. While applying
WEB MINING Page 51

data mining to the content of Web sites is an interesting area of research in its own right, in the context of Web Usage Mining the content of a site can be used to filter the input to, or output from the pattern discovery algorithms. For example, results of a classification algorithm could be used to limit the discovered patterns to those containing page views about a certain subject or class of products. In addition to classifying or clustering page views based on topics, page views can also be classified according to their intended use. Page views can be intended to convey information (through text, graphics, or other multimedia), gather information from the user, allow navigation (through a list of hypertext links), or some combination uses. The intended use of a page view can also filter the sessions before or after pattern discovery. In order to run content mining algorithms on page views, the information must first be converted into a quantifiable format. Some version of the vector space model is typically used to accomplish this. Text files can be broken up into vectors of words. Keywords or text descriptions can be substituted for graphics or multimedia. The content of static page views can be easily preprocessed by parsing the HTML and reformatting the information or running additional algorithms as desired. Dynamic page views present more of a challenge. Content servers that employ personalization techniques and/or draw upon databases to construct the page views may be capable of forming more page views than can be practically preprocessed. A given set of server sessions may only access a fraction of the page views possible for a large dynamic site. Also the content may be revised on a regular basis. The content of each page view to be pre- processed must be \assembled", either by an HTTP request from a crawler, or a combination of template, script, and database accesses. If only the portion of page views that are accessed are preprocessed, the output of any classification or clustering algorithms may be skewed. 3. Structure Pre-Processing: Pre-Processing related to structure of the website. The structure of a site is created by the hypertext links between page views. The structure can be obtained and pre- processed in the same manner as the content of a site. Again, dynamic content (and therefore links) pose more problems than static page views. A different site structure may have to be constructed for each server session.

WEB MINING

Page 52

6.6.1 PATTERN DISCOVERY: Web Usage mining can be used to uncover patterns in server logs but is often carried out only on samples of data. The mining process will be ineffective if the samples are not a good representation of the larger body of data. Pattern discovery draws upon methods and algorithms developed from several fields such as statistics, data mining, machine learning and pattern recognition. However, it is not the intent of this paper to describe all the available algorithms and techniques derived from these fields. Interested readers should consult references such as. This section describes the kinds of mining activities that have been applied to the Web domain. Methods developed from other fields must take into consideration the different kinds of data abstractions and prior knowledge available for Web Mining. For example, in association rule discovery, the notion of a transaction for market-basket analysis does not take into consideration the order in which items are selected. How- ever, in Web Usage Mining, a server session is an ordered sequence of pages requested by a user. Furthermore, due to the difficulty in identifying unique sessions, additional prior knowledge is required (such as imposing a default timeout period, as was pointed out in the previous section). 1. Statistical Analysis Statistical techniques are the most common method to extract knowledge about visitors to a Web site. By analyzing the session file, one can perform different kinds of descriptive statistical analyses (frequency, mean, median, etc.) on variables such as page views, viewing time and length of a navigational path. Many Web traffic analysis tools produce a periodic report containing statistical information such as the most frequently accessed pages, average view time of a page or average length of a path through a site. This report may include limited low-level error analysis such as detecting unauthorized entry points or finding the most common invalid URI. Despite lacking in the depth of its analysis, this type of knowledge can be potentially useful for improving the system performance, enhancing the security of the system, facilitating the site modification task, and providing support for marketing decisions.

WEB MINING

Page 53

2. Association Rules Association rule generation can be used to relate pages that are most often referenced together in a single server session. In the context of Web Usage Mining, association rules refer to sets of pages that are accessed together with a support value exceeding some specified threshold. These pages may not be directly connected to one another via hyperlinks. For example, association rule discovery using the Apriori algorithm may reveal a correlation between users who visited a page containing electronic products to those who access a page about sporting equipment. Aside from being applicable for business and marketing applications, the presence or absence of such rules can help Web designers to restructure their Web site. The association rules may also serve as a heuristic for pre fetching documents in order to reduce user-perceived latency when loading a page from a remote site. 3. Clustering Clustering is a technique to group together a set of items having similar characteristics. In the Web Usage domain, there are two kinds of interesting clusters to be discovered : usage clusters and page clusters. Clustering of users tends to establish groups of users exhibiting similar browsing pat- terns. Such knowledge is especially useful for inferring user demographics in order to perform market segmentation in E-commerce applications or provide personalized Web con- tent to the users. On the other hand, clustering of pages will discover groups of pages having related content. This information is useful for Internet search engines and Web assistance providers. In both applications, permanent or dynamic HTML pages can be created that suggest related hyperlinks to the user according to the user's query or past history of information needs. 4. Classification Classification is the task of mapping a data item into one of several predefined classes. In the Web domain, one is interested in developing a profile of users belonging to a particular class or category. This requires extraction and selection of features that best describe the properties of a given class or category. Classification can be done by using supervised inductive learning algorithms such as decision tree classifiers, naive Bayesian classifiers, k-nearest neighbour classifiers, Support Vector Machines etc. For example, classification on server logs may lead to the discovery of interesting rules such as : 30% of users who
WEB MINING Page 54

placed an online order in /Product/Music are in the 18-25 age group and live on the West Coast. 5. Sequential Patterns The technique of sequential pattern discovery attempts to find inter-session patterns such that the presence of a set of items is followed by another item in a time-ordered set of sessions or episodes. By using this approach, Web marketers can predict future visit patterns which will be helpful in placing advertisements aimed at certain user groups. Other types of temporal analysis that can be performed on sequential patterns includes trend analysis, change point detection, or similarity analysis. 6. Dependency Modeling Dependency modeling is another useful pattern discovery task in Web Mining. The goal here is to develop a model capable of representing significant dependencies among the various variables in the Web domain. As an example, one may be interested to build a model representing the different stages a visitor undergoes while shopping in an online store based on the actions chosen (i.e. from a casual visitor to a serious potential buyer). There are several probabilistic learning techniques that can be employed to model the browsing behaviour of users. Such techniques include Hidden Markov Models and Bayesian Belief Networks. Modeling of Web usage patterns will not only provide a theoretical framework for analyzing the behaviour of users but is potentially useful for predicting future Web resource consumption. Such information may help develop strategies to increase the sales of products offered by the Web site or improve the navigational convenience of users. 6.6.3 PATTERN ANALYSIS This is the final step in the Web Usage Mining process. After the preprocessing and pattern discovery, the obtained usage patterns are analyzed to filter uninteresting information and extract the useful information. The methods like SQL(Structured Query Language) processing and OLAP (Online Analytical Processing) can be used. The motivation behind pattern analysis is to filter out uninteresting rules or patterns from the set found in the pattern discovery phase. The exact analysis methodology is usually governed by the application for which Web mining is done. The most common
WEB MINING Page 55

form of pattern analysis consists of a knowledge query mechanism such as SQL. Another method is to load usage data into a data cube in order to perform OLAP operations. Visualization techniques, such as graphing patterns or as- signing colours to different values, can often highlight overall patterns or trends in the data. Content and structure information can be used to filter out patterns containing pages of a certain usage type, content type, or pages that match a certain hyperlink structure.

6.7 WEB USAGE MINING AREAS


1. Personalization 2. System Improvement 3. Site Modification 4. Business Intelligence 5. Usage Characterization

6.8 WEB USAGE MINING APPLICATIONS


1. LETIZIA Letizia is an application that assists a user browsing the Internet. As the user operates a conventional Web browser such as Mozilla, the application tracks usage patterns and attempts to predict items of interest by performing concurrent and autonomous exploration of links from the user's current position. The application uses a best-first search augmented by heuristics inferring user interest from browsing behaviour. 2. WEBSIFT The WebSIFT (Web Site Information Filter) system is another application which performs Web Usage Mining from server logs recorded in the extended NSCA format (includes referrer and agent fields), which is quite similar to the combined log format which used in case of D Space log files. The preprocessing algorithms include identifying users, server sessions, and identifying cached page references through the use of the referrer field. It identifies interesting information and frequent item sets from mining usage data.

WEB MINING

Page 56

3. ADAPTIVE WEBSITES An adaptive website adjusts the structure, content, or presentation of information in response to measured user interaction with the site, with the objective of optimizing future user interactions. Adaptive websites are web sites that automatically improve their organization and presentation by learning from their user access patterns. User interaction patterns may be collected directly on the website or may be mined from Web server logs. A model or models are created of user interaction using artificial intelligence and statistical methods. The models are used as the basis for tailoring the website for known and specific patterns of user interaction.

6.9 ANALYSIS OF WEB SERVER LOGS


We used different web server log analyzers like Web Expert Lite 6.1 and Analog6.0 to analyze various sample web server logs obtained. The key information obtained was: Total Hits, Visitor Hits, Average Hits per Day, Average Hits per Visitor, Failed Requests, Page Views Total Page Views, Average Page Views per Day, Average Page Views per Visitor, Visitors Total Visitors Average Visitors per Day, Total Unique IPs , Bandwidth, Total Bandwidth , Visitor Bandwidth , Average Bandwidth per Day, Average Bandwidth per Hit, Average Bandwidth per Visitor. Access Data like files, images etc., Referrers, User Agents etc. Analysis of above obtained information proved Web Usage Mining as a powerful technique in Web Site Management and improvement.

WEB MINING

Page 57

7. KEY CONCEPTS OF WEB MINING


In this section we briefly describe the new concepts introduced by the web mining research community.

7.1 RANKING METRICS: FOR PAGE QUALITY AND RELEVANCE


Searching the web involves two main steps: Extracting the pages relevant to a query and ranking them according to their quality. Ranking is important as it helps the user look for quality pages that are relevant to the query. Different metrics have been proposed to rank web pages according to their quality. We briefly discuss two of the prominent ones. PAGE RANK Page Rank is a metric for ranking hypertext documents based on their quality. Page, Brin, Motwani, and Winograd (1998) developed this metric for the popular search engine Google (Brin and Page 1998). The key idea is that a page has a high rank if it is pointed to by many highly ranked pages. So, the rank of a page depends upon the ranks of the pages pointing to it. This process is done iteratively until the rank of all pages are determined. Intuitively, the approach can be viewed as a stochastic analysis of a random walk on the web graph. The first term in the right hand side of the equation is the probability that a random web surfer arrives at a page p by typing the URL or from a bookmark; or may have a particular page as his/her homepage. Here d is the probability that the surfer chooses a URL directly, rather than traversing a link5 and 1d is the probability that a person arrives at a page by traversing a link. The second term in the right hand side of the equation is the probability of arriving at a page by traversing a link. HUBS AND AUTHORITIES Hubs and authorities can be viewed as fans and centers in a bipartite core of a web graph, where the fans represent the hubs and the centers represent the authorities. The hub and authority scores computed for each web page indicate the extent to which the web page serves as a hub pointing to good authority pages or as an authority on a topic
WEB MINING Page 58

pointed to by good hubs. The scores are computed for a set of pages related to a topic using an iterative procedure called HITS (Kleinberg 1999). First a query is submitted to a search engine and a set of relevant documents is retrieved. This set, called the root set, is then expanded by including web pages that point to those in the root set and are pointed by those in the root set. This new set is called the base set. An adjacency matrix, A is formed such that if there exists at least one hyperlink from page i to page j, then Ai,j = 1, otherwise Ai,j = 0. HITS algorithm is then used to compute the hub and authority scores for these set of pages. There have been modifications and improvements to the basic page rank and hubs and authorities approaches such as SALSA (Lempel and Moran 2000), topic sensitive page rank, (Haveliwala 2002) and web page reputations (Mendelzon and Rafiei 2000). These different hyperlink based metrics have been discussed by Desikan, Srivastava, Kumar, and Tan (2002).

7.2 ROBOT DETECTION AND FILTERING: SEPARATING HUMAN AND NON HUMAN WEB BEHAVIOUR
Web robots are software programs that automatically traverse the hyperlink structure of the web to locate and retrieve information. The importance of separating robot behaviour from human behaviour prior to building user behaviour models has been illustrated by Kohavi. First, e-commerce retailers are particularly concerned about the unauthorized deployment of robots for gathering business intelligence at their web sites. Second, web robots tend to consume considerable network bandwidth at the expense of other users. Sessions due to web robots also make it difficult to perform click-stream analysis effectively on the web data. Conventional techniques for detecting web robots are based on identifying the IP address and user agent of the web clients. While these techniques are applicable to many well-known robots, they are not sufficient to detect camouflaged and previously unknown robots. Tan and Kumar proposed a classification based approach that uses the navigational patterns in click-stream data to determine if it is due to a robot. Experimental results have shown that highly accurate classification models can be built using this approach. Furthermore, these models are able to discover many camouflaged and previously unidentified robots.

WEB MINING

Page 59

7.3 INFORMATION SCENT: APPLYING FORAGING THEORY TO BROWSING BEHAVIOUR


Information scent is a concept that uses the snippets of information present around the links in a page as a scent to evaluate the quality of content of the page it points to, and the cost of accessing such a page. The key idea is to model a user at a given page as foraging for information, and following a link with a stronger scent. The scent of a path depends on how likely it is to lead the user to relevant information, and is determined by a network flow algorithm called spreading activation. The snippets, graphics, and other information around a link are called proximal cues. The users desired information need is expressed as a weighted keyword vector. The similarity between the proximal cues and the users information need is computed as proximal scent. With the proximal cues from all the links and the users information need vector, a proximal scent matrix is generated. Each element in the matrix reflects the extent of similarity between the links proximal cues and the users information need. If enough information is not available around the link, a distal scent is computed with the information about the link described by the contents of the pages it points to. The proximal scent and the distal scent are then combined to give the scent matrix. The probability that a user would follow a link is then decided by the scent or the value of the element in the scent matrix.

7.4 USER PROFILES: UNDERSTANDING HOW SSERS BEHAVE


The web has taken user profiling to new levels. For example, in a brick-and mortar store, data collection happens only at the checkout counter, usually called the point-of-sale. This provides information only about the final outcome of a complex human decision making process, with no direct information about the process itself. In an on-line store, the complete click-stream is recorded, which provides a detailed record of every action taken by the user, providing a much more detailed insight into the decision making process. Adding such behavioural information to other kinds of information about users, for example demographic, psychographic, and so on, allows a comprehensive user profile to be built, which can be used for many different purposes. While most organizations build profiles of user behaviour limited to visits to their own sites, there are
WEB MINING Page 60

successful examples of building web-wide behavioural profiles such as Alexa Research and DoubleClick. These approaches require browser cookies of some sort, and can provide a fairly detailed view of a users browsing behaviour across the web.

7.5 INTERESTINGNESS MEASURES: WHEN MULTIPLE SOURCES PROVIDE CONFLICTING EVIDENCE


One of the significant impacts of publishing on the web has been the close interaction now possible between authors and their readers. In the pre web era, a readers level of interest in published material had to be inferred from indirect measures such as buying and borrowing, library checkout and renewal, opinion surveys, and in rare cases feedback on the content. For material published on the web it is possible to track the clickstream of a reader to observe the exact path taken through on-line published material. We can measure times spent on each page, the specific link taken to arrive at a page and to leave it, etc. Much more accurate inferences about readers interest in content can be drawn from these observations. Mining the user click-stream for user behaviour, and using it to adapt the look-and-feel of a site to a readers needs was first proposed by Perkowitz and Etzioni. While the usage data of any portion of a web site can be analyzed, the most significant, and thus interesting, is the one where the usage pattern differs significantly from the link structure. This is so because the readers behaviour, reflected by web usage, is very different from what the author would like it to be, reflected by the structure created by the author. Treating knowledge extracted from structure data and usage data as evidence from independent sources, and combining them in an evidential reasoning framework to develop measures for interestingness.

7.6 PREPROCESSING: MAKING WEB DATA SUITABLE FOR MINING


In the panel discussion referred to earlier, preprocessing of web data to make it suitable for mining was identified as one of the key issues for web mining. A significant amount of work has been done in this area for web usage data, including user identification and session creation, robot detection and filtering, and extracting usage path patterns. Cooleys Ph.D. dissertation provides a comprehensive overview of the work in web usage data

WEB MINING

Page 61

preprocessing. Preprocessing of web structure data, especially link information, has been carried out for some applications, the most notable being Google style web search.

7.7 IDENTIFYING WEB COMMUNITIES OF INFORMATION SOURCES


The web has had tremendous success in building communities of users and information sources. Identifying such communities is useful for many purposes. Gibson, Kleinberg, and Raghavan identified web communities as a core of central authoritative pages linked together by hub pages. Their approach was to discover emerging web communities while crawling. A different approach to this problem was taken by Flake, Lawrence, and Giles who applied the maximum-flow minimum cut model to the web graph for identifying web communities. Compare HITS and the maximum flow approaches and discuss the strengths and weakness of the two methods. Reddy and Kitsuregawa propose a dense bipartite graph method, a relaxation to the complete bipartite method followed by HITS approach, to find web communities. A related concept of friends and neighbours was introduced by Adamic and Adar. They identified a group of individuals with similar interests, who in the cyber-world would form a community. Two people are termed friends if the similarity between their web pages is high. Similarity is measured using features such as text, out-links, in-links and mailing lists.

7.8 ONLINE BIBILIOMETRICS


With the web having become the fastest growing and most up to date source of information, the research community has found it extremely useful to have online repositories of publications. Lawrence observed that having articles online makes them more easily accessible and hence more often cited than articles that are offline. Such online repositories not only keep the researchers updated on work carried out at different centres but also makes the interaction and exchange of information much easier. With such information stored in the web, it becomes easier to point to the most frequent papers that are cited for a topic and also related papers that have been published earlier or later than a given paper. This helps in understanding the state of the art in a particular field, helping researchers to explore new areas. Fundamental web mining techniques are applied to improve the search and categorization of research papers, and citing related articles.
WEB MINING Page 62

7.9 VISUALIZATION OF THE WORLD WIDEWEB


Mining web data provides a lot of information, which can be better understood with visualization tools. This makes concepts clearer than is possible with pure textual representation. Hence, there is a need to develop tools that provide a graphical interface that aids in visualizing results of web mining. Analyzing the web log data with visualization tools has evoked a lot of interest in the research community. Chi, Pitkow, Mackinlay, Pirolli, Gossweiler, and Card developed a web ecology and evolution visualization (WEEV) tool to understand the relationship between web content, web structure and web usage over a period of time. The site hierarchy is represented in a circular form called the Disk Tree and the evolution of the web is viewed as a Time Tube. Cadez, Heckerman, Meek, Smyth, and White present a tool called WebCANVAS that displays clusters of users with similar navigation behaviour. Prasetyo, Pramudiono, Takahashi, Toyoda, and Kitsuregawa developed Naviz, an interactive web log visualization tool that is designed to display the user browsing pattern on the web site at a global level, and then display each browsing path on the pattern displayed earlier in an incremental manner. The support of each traversal is represented by the thickness of theedge between the pages. Such a tool is very useful in analyzing user behaviour and improving web sites.

WEB MINING

Page 63

8. PROMINENT APPLICATIONS
Excitement about the web in the past few years has led to the web applications being developed at a much faster rate in the industry than research in web related technologies. Many of these are based on the use of web mining concepts, even though the organizations that developed these applications, and invented the corresponding technologies, did not consider it as such. We describe some of the most successful applications in this section. Clearly, realizing that these applications use web mining is largely a retrospective exercise. For each application category discussed below, we have selected a prominent representative, purely for exemplary purposes. This in no way implies that all the techniques described were developed by that organization alone. On the contrary, in most cases the successful techniques were developed by a rapid copy and improve approach to each others ideas.

8.1 PERSONALIZED CUSTOMER EXPERIENCE IN B2C ECOMMERCE: AMAZON.COM


Early on in the life of Amazon.com, its visionary CEO Jeff Bezos observed, In a traditional (brick-and-mortar) store, the main effort is in getting a customer to the store. Once a customer is in the store they are likely to make a purchasesince the cost of going to another store is highand thus the marketing budget (focused on getting the customer to the store) is in general much higher than the in store customer experience budget (which keeps the customer in the store). In the case of an on-line store, getting in or out requires exactly one click, and thus the main focus must be on customer experience in the store. This fundamental observation has been the driving force behind Amazons comprehensive approach to personalized customer experience, based on the mantra a personalized store for every customer (Morphy 2001). A host of web mining techniques, such as associations between pages visited and click-path analysis are used to improve the customers experience during a store visit. Knowledge gained from web mining is the key intelligence behind Amazons features such as instant recommendations, purchase circles, wish-lists, etc.

WEB MINING

Page 64

8.2 WEB SEARCH: GOOGLE


Google is one of the most popular and widely used search engines. It provides users access to information from over 2 billion web pages that it has indexed on its server. The quality and quickness of the search facility makes it the most successful search engine. Earlier search engines concentrated on web content alone to return the relevant pages to a query. Google was the first to introduce the importance of the link structure in mining information from the web. Page Rank, which measures the importance of a page, is the underlying technology in all Google search products, and uses structural information of the web graph to return high quality results. The Google toolbar is another service provided by Google that seeks to make search easier and informative by providing additional features such as highlighting the query words on the returned web pages. The full version of the toolbar, if installed, also sends the click-stream information of the user to Google. The usage statistics thus obtained are used by Google to enhance the quality of its results. Google also provides advanced search capabilities to search images and find pages that have been updated within a specific date range. Built on top of Netscapes Open Directory project, Googles web directory provides a fast and easy way to search within a certain topic or related topics. The advertising program introduced by Google targets users by providing advertisements that are relevant to a search query. This does not bother users with irrelevant ads and has increased the clicks for the advertising companies by four to five times. According to B to B, a leading national marketing publication, Google was named a top 10 advertising property in the Media Power 50 that recognizes the most powerful and targeted business-to-business advertising outlets. One of the latest services offered by Google is Google News. It integrates news from the online versions of all newspapers and organizes them categorically to make it easier for users to read the most relevant news. It seeks to provide latest information by constantly retrieving pages from news site worldwide that are being updated on a regular basis. The key feature of this news page, like any other Google service, is that it integrates information from various web news sources through purely algorithmic means, and thus does not introduce any human bias or

WEB MINING

Page 65

effort. However, the publishing industry is not very convinced about a fully automated approach to news distillation.

8.3 WEB-WIDE TRACKING: DOUBLECLICK


Web-wide tracking, i.e. tracking an individual across all sites he visits, is an intriguing and controversial technology. It can provide an understanding of an individuals lifestyle and habits to a level that is unprecedented, which is clearly of tremendous interest to marketers. A successful example of this is DoubleClick Inc.s DART ad management technology. DoubleClick serves advertisements, which can be targeted on demographic or behavioural attributes, to the end-user on behalf of the client, i.e. the web site using DoubleClicks service. Sites that use DoubleClicks service are part of The DoubleClick Network and the browsing behaviour of a user can be tracked across all sites in the network, using a cookie. This makes DoubleClicks ad targeting to be based on very sophisticated criteria. Alexa Research has recruited a panel of more than 500,000 users, who have voluntarily agreed to have their every click tracked, in return for some freebies. This is achieved through having a browser bar that can be downloaded by the panelist from Alexas website, which gets attached to the browser and sends Alexa a complete click-stream of the panelists web usage. Alexa was purchased by Amazon for its tracking technology. Clearly web-wide tracking is a very powerful idea. However, the invasion of privacy it causes has not gone unnoticed, and both Alexa/Amazon and Double Click have faced very visible lawsuits. Microsofts Passport technology also falls into this category. The value of this technology in applications such as cyber-threat analysis and homeland defense is quite clear, and it might be only a matter of time before these organizations are asked to provide information to law enforcement agencies.

8.4 UNDERSTANDING WEB COMMUNITIES: AOL


One of the biggest successes of America Online (AOL) has been its sizeable and loyal customer base. A large portion of this customer base participates in various AOL communities, which are collections of users with similar interests. In addition to providing a forum for each such community to interact amongst themselves, AOL provides them with useful information and services. Over time these communities have grown to be wellvisited waterholes for AOL users with shared interests. Applying web mining to the data collected from community interactions provides AOL with a very good understanding of
WEB MINING Page 66

its communities, which it has used for targeted marketing through advertisements and email solicitation. Recently, it has started the concept of community sponsorship, whereby an organization, say Nike, may sponsor a community called Young Athletic Twenty Something. In return, consumer survey and new product development experts of the sponsoring organization get to participate in the community, perhaps without the knowledge of other participants. The idea is to treat the community as a highly specialized focus group, understand its needs and opinions on new and existing products, and also test strategies for influencing opinions.

8.5 UNDERSTANDING AUCTION BEHAVIOUR: EBAY


As individuals in a society where we have many more things than we need, the allure of exchanging our useless stuff for some cash, no matter how small, is quite powerful. This is evident from the success of flea markets, garage sales and estate sales. The genius of eBays founders was to create an infrastructure that gave this urge a global reach, with the convenience of doing it from ones home PC. In addition, it popularized auctions as a product selling and buying mechanism and provides the thrill of gambling without the trouble of having to go to Las Vegas. All of this has made eBay as one of the most successful businesses of the internet era. Unfortunately, the anonymity of the web has also created a significant problem for eBay auctions, as it is impossible to distinguish real bids from fake ones. eBay is now using web mining techniques to analyze bidding behaviour to determine if a bid is fraudulent (Colet 2002). Recent efforts are geared towards understanding participants bidding behaviours/patterns to create a more efficient auction market.

8.6 PERSONALIZED PORTAL FOR THE WEB: MYYAHOO


Yahoo was the first to introduce the concept of a personalized portal, i.e. a web site designed to have the look-and-feel and content personalized to the needs of an individual end-user. This has been an extremely popular concept and has led to the creation of other personalized portals such as Yodlee for private information like bank and brokerage accounts. Mining MyYahoo usage logs provides Yahoo valuable insight into an individuals web usage habits, enabling Yahoo to provide personalized content, which in turn has led to the tremendous popularity of the Yahoo web site.
WEB MINING Page 67

8.7

CITESEER:

DIGITAL

LIBRARY

AND

AUTONOMOUS

CITATION INDEXING
NEC Research Index, also known as CiteSeer is one of the most popular online bibiliographic indices related to computer science. The key contribution of the CiteSeer repository is its Autonomous Citation Indexing (ACI) (Lawrence, Giles, and Bollacker 1999). Citation indexing makes it possible to extract information about related articles. Automating such a process reduces a lot of human effort, and makes it more effective and faster. CiteSeer works by crawling the web and downloading research related papers. Information about citations and the related context is stored for each of these documents. The entire text and information about the document is stored in different formats. Information about documents that are similar at a sentence level (percentage of sentences that match between the documents), at a text level or related due to co citation is also given. Citation statistics for documents are computed that enable the user to look at the most cited or popular documents in the related field. They also maintain a directory for computer science related papers, to make search based on categories easier. These documents are ordered by the number of citations

WEB MINING

Page 68

9. RESEARCH DIRECTIONS
Although we are going through an inevitable phase of irrational despair following a phase of irrational exuberance about the commercial potential of the web, the adoption and usage of the web continues to grow unabated. As the web and its usage grows, it will continue to generate ever more content, structure, and usage data, and the value of web mining will keep increasing. Outlined here are some research directions that must be pursued to ensure that we continue to develop web mining technologies that will enable this value to be realized.

9.1 WEB METRICS AND MEASUREMENTS


From an experimental human behaviourists viewpoint, the web is the perfect experimental apparatus. Not only does it provide the ability of measuring human behaviour at a micro level, it eliminates the bias of the subjects knowing that they are participating in an experiment, and allows the number of participants to be many orders of magnitude larger than conventional studies. However, we have not yet begun to appreciate the true impact of this revolutionary experimental apparatus for human behaviour studies. The web Lab of Amazon is one of the early efforts in this direction. It is regularly used to measure the user impact of various proposed changes, on operational metrics such as site visits and visit/buy ratios, as well as on financial metrics such as revenue and profit, before a deployment decision is made. For example, during Spring 2000 a 48 hour long experiment on the live site was carried out, involving over one million user sessions, before the decision to change Amazons logo was made. Research needs to be done in developing the right set of web metrics, and their measurement procedures, so that various web phenomena can be studied.

9.2 PROCESS MINING


Mining of market basket data, collected at the point-of-sale in any store, has been one of the visible successes of data mining. However, this data provides only the end result of the process, and that too decisions that ended up in product purchase. Clickstream data provides the opportunity for a detailed look at the decision making process itself, and knowledge extracted from it can be used for optimizing, influencing the process, etc. Underhill has conclusively proven the value of process information in
WEB MINING Page 69

understanding users behaviour in traditional shops. Research needs to be carried out in (1) extracting process models from usage data, (2) understanding how different parts of the process model impact various web metrics of interest, and (3) how the process models change in response to various changes that are made, i.e. changing stimuli to the user.

9.3 TEMPORAL EVOLUTION OF THE WEB


Societys interaction with the web is changing the web as well as the way people interact with each other. While storing the history all of this interaction in one place is clearly too staggering a task, at least the changes to the web are being recorded by the pioneering internet archive project. Research needs to be carried out in extracting temporal models of how web content, web structures, web communities, authorities, hubs, etc. evolve over time. Large organizations generally archive usage data from their web sites. With these sources of data available, there is a large scope of research to develop techniques for analyzing of how the web evolves over time.

9.4 WEB SERVICES PERFORMANCE OPTIMIZATION


As services over the web continue to grow, there will be a continuing need to make them robust, scalable and efficient. Web mining can be applied to better understand the behaviour of these services, and the knowledge extracted can be useful for various kinds of optimizations. The successful application of web mining for predictive pre fetching of pages by a browser has been demonstrated by Pandey, Srivastava, and Shekhar. It is necessary to do analysis of the web logs for web services performance optimization. Research is needed in developing web mining techniques to improve various other aspects of web services.

9.5 FRAUD AND THREAT ANALYSIS


The anonymity provided by the web has led to a significant increase in attempted fraud, from unauthorized use of individual credit cards to hacking into credit card databases for blackmail purposes. Yet another example is auction fraud, which has been increasing on popular sites like eBay. Since all these frauds are being perpetrated through the internet, web mining is the perfect analysis technique for detecting and preventing them. Research issues include developing techniques to recognize known frauds,
WEB MINING Page 70

characterize them and recognize emerging frauds. The issues in cyber threat analysis and intrusion detection are quite similar in nature.

9.6 WEB MINING AND PRIVACY


While there are many benefits to be gained from web mining, a clear drawback is the potential for severe violations of privacy. Public attitude towards privacy seems to be almost schizophrenic, i.e. people say one thing and do quite the opposite. For example, famous cases like those involving Amazon and Doubleclick seem to indicate that people value their privacy, while experience at major e-commerce portals shows that over 97% of all people accept cookies with no problems, and most of them actually like the personalization features that are provided based on it. Spiekerman, Grossklags, and Berendt have demonstrated that people were willing to provide fairly personal information about themselves, which was completely irrelevant to the task at hand, if provided the right stimulus to do so. Furthermore, explicitly bringing attention to information privacy policies had practically no effect. One explanation of this seemingly contradictory attitude towards privacy may be that we have a bi-modal view of privacy, namely that Id be willing to share information about myself as long as I get some (tangible or intangible) benefits from it, and as long as there is an implicit guarantee that the information will not be abused. The research issue generated by this attitude is the need to develop approaches, methodologies and tools that can be used to verify and validate that a web service is indeed using users information in a manner consistent with its stated policies.

WEB MINING

Page 71